Time- and energy-efficient embedded AI for Tensor Processing Units

Stage

Équipes / Services concernés

Responsables

Tomasz Kloda

Date de publication

24.11.24

Prise de poste souhaitée

03.02.25

Context. Tensor Processing Units (TPUs) [2, 3] are application-specific integrated circuits designed specifically for accelerating core machine learning operations like matrix multiplications and convolutions. TPUs are increasingly applied in embedded applications thanks to their low energy consumption and significant acceleration of neural net inference (https://coral.ai/docs/edgetpu/benchmarks). They can accelerate the inference time by 30x compared with embedded CPUs and deliver high performance per watt at a very small footprint. Our preliminary benchmarks [5, 6] on an ASUS AI Accelerator CRL-G18U-P3D1, with 8 Google Edge TPUs, demonstrated that pipeline depth (the number of TPUs used) has a very significant impact on inference times for large and medium-sized networks (see the figure on the left). Running large models on an Edge TPU requires fetching the remaining part of the model parameters from the main memory, which incurs a high memory transaction latency. With pipelining, a model can be divided into multiple smaller segments using the Edge TPU Compiler, and each segment can run on a different TPU fitting its internal memory. Finally, our tests have demonstrated that reloading the neural network models stored on a TPU incurs a high context switch overhead (from 9 to 15 ms).

This internship. Our study will explore various factors that contribute to the power consumption of deep neural networks when executed on multiple tensor processing units (TPUs). Its aim is to identify the interplay between three key factors: inference time, power consumption, and accuracy. We will undertake the following steps:

Testbed for multi-TPU power consumption measurement. The instantaneous power consumption of different neural networks running on a multi-TPU board ASUS CRL-G18U-P3DF can be measured using the current sensors. A microcontroller (e.g., Arduino Uno or STM32 ) can collect the measurements in real-time from the current sensors (e.g., ACS712 series) connected to the TPU board PCI interface and send them to the host computer. Several frameworks can be implemented with the firmware already available (e.g., PowerSensor2, https://gitlab.com/astron-misc/PowerSensor/).
Software optimization techniques and power consumption. Various software optimization methods [1, 4], such as pruning, quantization, or weight sharing, can reduce deep neural networks’ computational requirements and memory footprint. By applying these optimization techniques and measuring power consumption during execution on multi-TPUs, a trade-off between power savings and performance degradation in terms of time and accuracy can be established. This part of the project will be undertaken in collaboration with Laboratoire d’InfoRmatique en Image et Systèmes d’information (LIRIS-CNRS) in Lyon.
TPU pipeline design. Model pipelining allows to execute different segments of the same model on different TPUs to reduce the inference time (https://coral.ai/docs/edgetpu/pipeline/). To create and test the different model segmentations across multiple TPUs, a tool to modify and analyze tflite neural net model files (e.g., Netron) can be used. This part of the project will be undertaken in collaboration with Technical University of Munich.
Benchmarking suite for TPUs. The final step is to create scripts to automate the tests of compressed neural networks with different segmentations and with different numbers of TPUs being used and interpret the benchmarking results (consumption energy, inference time, precision).

Research group. The project is a collaboration among the Laboratory for Analysis and Architecture of Systems (LAAS-CNRS), the Laboratory for Computer Science in Images and Information Systems (LIRIS-CNRS), and the Technical University of Munich (TUM). The host laboratory is the LAAS-CNRS in Toulouse, France. The internship will be co-supervised by Dr. Tomasz Kloda (LAAS-CNRS), Dr. Stefan Duffner (LIRIS-CNRS), and Binqi Sun (TUM).

Contact. For inquiries about the position, please contact Dr. Tomasz Kloda (tomasz.kloda@laas.fr).

Qualifications. A graduate student of electrical engineering or computer engineering or similar. Good C and Python programming skills, knowledge of Linux and embedded systems as well as basic knowledge of machine learning.

Duration. 6 months. The ideal starting period is February/March 2025 with some flexibility.

Salary. 650 euros / month

References.

[1] Anthony Berthelier, Thierry Chateau, Stefan Duffner, Christophe Garcia, and Christophe Blanc. Deep model compression and architecture optimization for embedded systems: A survey. Journal of Signal Processing Systems, 93:863 – 878, 2020.

[2] John L. Hennessy and David A. Patterson. A new golden age for computer architecture. Commun. ACM, 62(2):48–60, jan 2019.

[3] Norman P. Jouppi, Cliff Young, Nishant Patil, and David Patterson. A domain-specific architecture for deep neural networks. Commun. ACM, 61(9):50–59, aug 2018.

[4] Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461:370–403, 2021.

[5] Binqi Sun, Tomasz Kloda, Jiyang Chen, Cen Lu, and Marco Caccamo. Schedulability analysis of non-preemptive sporadic gang tasks on hardware accelerators. In 2023 IEEE 29th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 147–160, 2023.

[6] Binqi Sun, Tomasz Kloda, Chu-ge Wu, and Marco Caccamo. Partitioned scheduling and parallelism assignment for real-time DNN inference tasks on multi-TPU. In Proceedings of the 61st ACM/IEEE Design Automation Conference, DAC ’24, New York, NY, USA, 2024. Association for Computing Machinery.

Postuler

Retour aux offres