The TinyML community research theories, methods and approach to reduce model size and complexity towards their im
Project | Confluence page | Description |
TinyML - KA25 | SNL Demo | Fully connected model using floating point values |
TinyML - KA25 | SNL Demo - CNN - Floating point (MNIST) | |
TinyML - KA25 | SNL Demo - CNN - Quantized weight and bias | |
TinyML - KA25 | SNL Demo - CNN - Quantized weight and bias and multipliers | |
TinyML - KA25 | SNL Demo - CNN - Quantized weight and bias and multipliers | |
TinyML - KA25 | SNL Demo - CNN - Binary model and dataset | |
TinyML - KA25 | SNL Demo - CNN - Floating point trained on eMNIST |
Generating ASICs from HLS can follow a few possible flows. For example one can use the HDL code from Vitis HLS and import that into the digital flow for ASIC P&R. Or one can make the HLS code compatible with Stratus or Catapult and use these tools to generate the HDL code.
If using the Vitis tool flow, we have a reference design that can be adopted which has been used for the eFPGA project
28nm - https://github.com/slaclab/fabulous-28nm-asic/tree/main/targets
130nm - https://github.com/slaclab/fabulous-28nm/tree/main/asic/targets/digital_top
Application git for the fab28:
https://github.com/slaclab/fabulous-28nm-dev?tab=readme-ov-file#asicfwsw-co-simulation
For the ASIC flow we can reuse the digital top design and replace the core with an SNL model and keep all the interface the same (see picture below). If we then package the device, a strategy will be to make an FMC card that can be reuse to test all the digital ASICs without the need of custom wire bond boards.
Packaging
From previous project we can estimate 1mm square ASIC in 28nm cost ˜12-15k and ˜3k for QFN packaging. This may not be part of the current KA25 project but could be added to a request for a second year on this project.
Increasing data rates, image sizes and the multiple experimental techniques impose several requirements on real time data processing
Currently developing a model fully optimized (pruning, weights and bias) lead to a long turn around time if new synthesis are required
Second, most models that are developed for offline processing targeting GPU and super computers are too big to be implemented in FPGAs
Goals
A successful development would enable LCLS to provide a set of models that can be used in real time to perform the first pass on data analyses and fast feedback to for users during beam time. Demonstrate the viability of this approach with the ePixHR detector (or similar) generating images at 5,000fps
List of possible new feature to be included in SNL to enable more models to be implemented and be part of the list of supported ML models in the FPGA context
Increasing data rates, image sizes and the multiple experimental techniques impose several requirements on real time data processing
Project | Confluence page | Description |
Daniel/Patrick work | Hit, Maybe, Miss | Reproduces the paper on HMM, keeping model structure and reducing image size and filter to make the model to fit in an FPGA |
Segmented Hit, Maybe, Miss | Reproduce the inital example with classification segmented on a carrier board level, then sum the votes to take final decision | |
Quantized Hit, Maybe, Miss | Reproduce original work (with or without adaptation) but reduce the model using quantization. Implement SNL with AP fixed | |
Binary Hit, Maybe, Miss | Reproduce original work with binary weights and bias. (Not sure SNL supports this) | |
HEP equivalent to Hit, Maybe, Miss | Take advantage of SNL dynamic bias and weights and tarted the same model to a different set of images. Dog, Maybe, Cat? Need to check with physicists what that could be | |
ePixHR10k equivalent to Hit, Maybe, Miss | Create a dataset using 3 types of images (laser pointer, cross and dark. Using the same model classify the images | |
Pre-processing - Dark subtraction | In HEP Cryo ASIC can be used as an example for channel equalization | |
Pre-processing - Gain equalization | In HEP Cryo ASIC can have its gain equalized (there is a gap in the midscale that should also be corrected | |
Date | Title | link | Comments |
---|---|---|---|
2023 | Implementation of a framework for deploying AI inference engines in FPGAs | https://arxiv.org/abs/2305.19455 | Principles of SNL. Must read for all interns. |
2022 | Performance of a convolutional autoencoder designed to remove electronic noise from p-type point contact germanium detector signals | https://arxiv.org/pdf/2204.06655.pdf | This approach, if successful in HW can be used to denoise nEXO data. |
2022 | OPEN-SOURCE FPGA-ML CODESIGN FOR THE MLPERF™ TINY BENCHMARK | https://arxiv.org/abs/2206.11791 | |
2020 | Benchmarking TinyML Systems: Challenges and Direction | https://arxiv.org/abs/2003.04821 | "Many traditional ML use cases can be considered futuristic TinyML tasks. As ultra-low-power inference hardware continues to improve, the threshold of viability expands. Tasks like large label space image classification or object counting are well suited for low-power always-on applications but are currently too compute and memory hungry for today’s TinyML hardware." |
2019 | Image Classification on IoT Edge Devices: Profiling and Modeling | https://arxiv.org/pdf/1902.11119.pdf | "In this paper, we show the feasibility and study the performance of image classification using IoT devices. Specifically, we explore the relationships between various factors of image classification algorithms that may affect energy consumption, such as dataset size, image resolution, algorithm type, algorithm phase, and device hardware." |
2018 | Accelerating CNN inference on FPGAs: A Survey | https://arxiv.org/pdf/1806.01683.pdf | Talks about tricks that can be used to minimize computation and maximize performance in CNN applications on FPGAs. Also talks about quantization, so may be interesting to look at in the future. |
2015 | Learning both Weights and Connections for Efficient | https://proceedings.neurips.cc/paper/2015/file/ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf | conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, they describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections. |
2021 | Anomaly Detection Based on Tiny Machine | https://oiji.utm.my/index.php/oiji/article/view/148/109 | By using TensorFlow Lite Micro, the TinyML can be trained to undergo anomaly detection. However, the machine learning algorithm had to be exported from TensorFlow, then TensorFlow Lite, and finally TensorFlow Lite Micro in order to upload the machine learning algorithm into TinyML. This paper highlights the state of the art of the current works on TinyML. Some suggestions on the research direction are also introduced for potential future endeavors. |
2019 | Squeezing the last MHz for CNN acceleration on FPGAs | https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8871619 | Overclocking KCU1500 will decrease runtime while maintaining near-constant accuracy. Tested on 4 different CNN models: LeNet, AlexNet, VGG-16, VGG-19. Provides specific clock numbers and performance as well as accuracy for all 4 models. |
2021 | An Adaptive Row-based Weight Reuse Scheme for FPGA Implementation of Convolutional Neural Networks | https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9501490 | Proposes an adaptive row reuse scheme by applying each level of row-reuse for each layer depending on its characteristics. The proposed design is implemented with a Xilinx KCU1500 board with 1.7 times less buffer size than previous works when running VGG-16. |
2018 | A convolutional neural network-based screening tool for X-ray serial crystallography | https://web.eecs.umich.edu/~stellayu/publication/doc/2018xrayJSR.pdf | Opportunity to explore SNL. That could be Daniel or Patrick research project This papers proposes the use of convolutional network to classify datasets in the x-ray domain. It would be very interesting to reproduce this effort with the SNL and demonstrate how it performs with a hardware implementation. Dataset is available and was produce with the now legacy CSPAD. |
2023 | Artificial neural network on-chip and in-pixel implementation towards pulse amplitude measurement | https://iopscience.iop.org/article/10.1088/1748-0221/18/02/C02048 | This paper presents a tiny model for ADC data correction on streaming. Can we correct data for CRYO? Also can this be applied to correct data for pixelated detectors (future gama TPC |