Time- and energy-efficient embedded AI for Tensor Processing Units

Type de l'offre: 
Stage
Statut de l'offre: 
Validé
Equipe ou Service: 
VERTICS
Description: 

Tensor Processing Units (TPUs) [1,2] are application-specific integrated circuits in the form of a systolic array [3] that are dedicated to matrix multiplication. They are increasingly applied in embedded applications thanks to their low energy consumption and significant acceleration of neural net inference (https://coral.ai/docs/edgetpu/benchmarks). TPUs can improve the inference time by 30x compared with embedded CPUs and deliver high performance per watt at small footprint, enabling rapid and cost-effective AI platform deployment. Our preliminary benchmarks [4] on an ASUS AI Accelerator CRL-G18U-P3D1, with 8 Google Edge TPUs, demonstrated that pipeline depth (the number of TPUs used) has a very significant impact on inference times for large and medium-sized networks (see the figure on the left). Running large models on an Edge TPU requires fetching the remaining part of the model parameters from the main memory for every inference, which incurs a high memory transaction latency. With pipelining, a model can be divided into multiple smaller segments using the Edge TPU Compiler, and each segment can run on a different TPU fitting its internal memory. Furthermore, our tests have demonstrated that reloading the neural network models stored on a TPU incurs a high context switch overhead (from 9 to 15 ms).

 

 

 

 

The study will investigate various factors that contribute to the power consumption of deep neural networks when executed on multiple tensor processing units (TPUs). Its aim is to identify the interplay between three factors: inference time, power consumption, and accuracy. This will involve the following steps:

  • Testbed for multi-TPU power consumption measurement. The instantaneous power consumption of different neural networks running on a multi-TPU board ASUS CRL-G18U-P3DF can be measured using the current sensors and a microcontroller. The current sensors (e.g., ACS712 series) must be connected to the TPU board PCI interface such that the microcontroller (e.g., Arduino Uno or STM32) can collect the measurements in real time and send them to the host computer. Several frameworks can be implemented with the firmware already available (e.g., PowerSensor2, https://gitlab.com/astron-misc/PowerSensor/).
  • Software optimization techniques and power consumption. Various software optimization methods [5,6], such as pruning, quantization, or weight sharing, can reduce deep neural networks' computational requirements and memory footprint. By applying these optimization techniques and measuring power consumption during execution on multi-TPUs, we will try to assess the trade-off between power savings and performance degradation in terms of time and accuracy. This part of the project will be undertaken in collaboration with LIRIS-CNRS in Lyon, and the work on structured pruning (i.e., removing less useful neurons or feature maps according to a criterion) is the objective of another internship (https://liris.cnrs.fr/sites/default/files/emploi/sujet_stage_m2_ia3f.pdf).
  • TPU pipeline design. Model pipelining allows to execute different segments of the same model on different TPUs to reduce the inference time (https://coral.ai/docs/edgetpu/pipeline/). To create and test the different model segmentations across multiple TPUs, we will port a tool to modify and analyze tflite neural net model files (e.g., Netron).
  • Benchmarking suite for TPUs. We will create scripts to automate the tests of compressed neural networks with different segmentations and with different numbers of TPUs being used and interpret the benchmarking results (consumption energy, inference time, precision).

 

The project is a collaboration among the Laboratory for Analysis and Architecture of Systems (LAAS-CNRS), the Laboratory for Computer Science in Images and Information Systems (LIRIS-CNRS), and the Technical University of Munich (TUM). The host laboratory is the LAAS-CNRS in Toulouse, France. The internship will be co-supervised by Dr. Tomasz Kloda (LAAS-CNRS), Dr. Stefan Duffner (LIRIS-CNRS), and Binqi Sun (TUM).

 

[1] John L. Hennessy and David A. Patterson. “A New Golden Age for Computer Architecture”. In: Commun. ACM 62.2 (Jan. 2019), pp. 48–60. ISSN: 0001-0782. DOI: 10.1145/3282307.
[2] Norman P. Jouppi, Cliff Young, Nishant Patil, and David Patterson. “A Domain-Specific Architecture for Deep Neural Networks”. In: Commun. ACM 61.9 (Aug. 2018), pp. 50–59. ISSN: 0001-0782. DOI: 10.1145/3154484.
[3] Kung, "Why systolic architectures?," in Computer, vol. 15, no. 1, pp. 37-46, Jan. 1982, doi: 10.1109/MC.1982.1653825.
[4] Binqi Sun, Tomasz Kloda, Jiyang Chen, Cen Lu, and Marco Caccamo. “Schedulability Analysis of Non-preemptive Sporadic Gang Tasks on Hardware Accelerators”. In: 2023 IEEE 29th Real-Time and Embedded Technology and Applications Symposium (RTAS). 2023, pp. 147–160. DOI: 10.1109/RTAS58335.2023.00019.
[5] Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. “Pruning and quantization for deep neural network acceleration: A survey”. In: Neurocomputing 461 (2021), pp. 370–403. ISSN: 0925-2312. DOI: https ://doi.org/10.1016/j.neucom.2021.07.045.
[6] Anthony Berthelier, Thierry Chateau, Stefan Duffner, Christophe Garcia, and Christophe Blanc. “Deep Model Compression and Architecture Optimization for Embedded Systems: A Survey”. In: Journal of Signal Processing Systems (Oct. 2020). DOI: 10.1007/s11265-020-01596-1.

 

Mots clés: 
TPU
power consumption
neural network
hardware accelerators
real-time
Indemnisation: 
650 € per month
Durée: 
6 months
Nombre de personnes: 
1
 
1 Candidater 2 Fin
Les fichiers doivent peser moins de 2 Mo.
Extensions autorisées : pdf doc docx odt.
Les fichiers doivent peser moins de 2 Mo.
Extensions autorisées : pdf doc docx odt.
Les fichiers doivent peser moins de 2 Mo.
Extensions autorisées : pdf doc docx odt.
Les fichiers doivent peser moins de 2 Mo.
Extensions autorisées : pdf doc docx odt.
CAPTCHA
Afin d'empêcher les robots d'envoyer du spam, merci de répondre à la question