相关文章推荐
爱搭讪的长颈鹿  ·  ACL Paper Decoded: ...·  4 月前    · 
有爱心的消防车  ·  AI-Benchmark·  3 月前    · 
豪情万千的围巾  ·  Accelerating ...·  3 月前    · 
读研的键盘  ·  Underactuated Robotics·  3 月前    · 
开朗的小刀  ·  Release 0.5.0 - ...·  2 月前    · 
爱喝酒的香槟  ·  怎样利用digital ...·  1 年前    · 
稳重的铁链  ·  WPF Prism框架(1) -- ...·  2 年前    · 
大方的柚子  ·  X++ container runtime ...·  2 年前    · 

AI Benchmark v4: Pushing Mobile NPUs to Their Limits

Twice larger number of tests, native hardware acceleration on many mobile platforms, new tasks targeted at multiple model acceleration, the possibility of loading and running custom TFLite models, NPU / DSP throttling tests — this isn't the full list of improvements coming with the 4th version of AI Benchmark. The detailed description of these and other changes introduced in this release are provided below.

Native Hardware Acceleration

One of the most awaited features in this release is native AI hardware acceleration on many mobile chipsets. This became possible with the introduction of TensorFlow Lite delegates working as a middleware between the standard TFLite runtime and vendors' custom deep learning libraries. Unlike the standard SDK-based benchmarking approach that requires each model to be converted and "optimized" for each vendor, thus making the comparison absolutely unfair, this solution allows to run the identical model on every single mobile device, while benefiting from highly optimized proprietary SDKs and avoiding all limitations of the standard Android NNAPI acceleration path.

Qualcomm Hexagon NN delegate: allows to run quantized deep learning models directly on Snapdragon Hexagon DSPs, is compatible with the Hexagon 680 (Snapdragon 821/820, 660, 636), Hexagon 682 (Snapdragon 835), Hexagon 683 (Snapdragon 662, 460), Hexagon 685 (Snapdragon 845, 712, 710, 675, 670), Hexagon 686 (Snapdragon 665), Hexagon 688 (Snapdragon 730/G), Hexagon 690 (Snapdragon 855/Plus), Hexagon 692 (Snapdragon 720/G), Hexagon 696 (Snapdragon 768/G, 765/G), Hexagon 698 (Snapdragon 865) and newer models. Note that the Hexagon DSP might be locked on some devices and inaccessible by anyone except for the corresponding vendor. Samsung Eden delegate: provides access to the Exynos NPU (can currently run only a few models) and contains optimized GPU kernels for Exynos' Mali GPUs. Is compatible with the Exynos 9820, 9825, 980, 990 and newer models. TFLite GPU delegate: a universal OpenCL / OpenGL-based delegate for accelerating floating-point models on mobile GPUs. Is compatible with almost any GPU supporting OpenGL ES 3.0+ and provides excellent results on both new and older-generation devices. More delegates will be added to AI Benchmark as they become available and ready to use on mobile devices.

Running Custom TensorFlow Lite Models

One can now load and run any TensorFlow Lite model directly in the benchmark (in the PRO Mode) with all acceleration options available for the standard benchmark tests. There is no need for using any complicated tools or Android programming — just put your model.tflite file into the Download folder, select the desired acceleration options ( TFLite CPU backend , Android NNAPI , Hexagon NN / Eden / TFLite GPU delegates when available) and get the resulting latency or a detailed error log if the model is incompatible with the corresponding delegator. You can also select your model manually using the prompted File Manager — note that in this case the "Show Internal Storage" option should be enabled as the full path to your model file is required. If you do not have any previous experience of working with the TensorFlow Lite, you can find the detailed instructions on how to convert your pre-trained TensorFlow model to TFLite format here (just three lines of code are needed). If you are using PyTorch — then you first need to export your model to ONNX and then convert the resulting file to TFLite . You can also quantize your model during the conversion to be able to run it on the Hexagon DSP, Google Coral TPU and some other integer-only mobile AI accelerators.

NNAPI-1.2 Tests

Since NNAPI-1.1 implements support for a very limited number of ML operations, it cannot accelerate a large number of new deep learning models presented in the past years. Therefore, in AI Benchmark v4 we introduced 25 new subtests , the majority of which are NNAPI-1.2 compatible. The new benchmark version contains MobileNet-V3 and PyNET architectures, LSTM-based OCR and Text Completion models, neural networks with transposed convolution / depth-to-space / gather ops, etc. The number of accuracy tests was also increased from 10 to 18, while the reported errors became more transparent. In total, the 4th benchmark version consists of the 14 benchmark sections listed below: Section 1. Classification, MobileNet-V2 (INT8 + FP16 + Batch Size=4) Section 2. Classification, Inception-V3 (INT8 + FP16) Section 3. Face Recognition, MobileNet-V3 (INT8 + FP16) Section 4. Classification, MobileNet-V2 x 8 (INT8 + FP16 | Parallel) Section 5. OCR, CRNN  (ResNet-18 + Bi-LSTM) (FP16 + FP32) Section 6. Deblurring, PyNET (INT8 + FP16) Section 7. Super-Resolution, VGG19 (INT8 + FP16) Section 8. Super-Resolution, SRGAN (INT8 + FP16) Section 9. Bokeh Rendering, U-Net (INT8 + FP16) Section 10. Semantic Segmentation, DeepLab-V3+ (INT8 + FP16) Section 11. Semantic Segmentation, DeepLab-V3+ (INT8 + FP16 in Parallel) Section 12. Image Enhancement, DPED ResNet (INT8 + FP16) Section 13. Text Completion, LSTM (FP16) Section 14. Memory limits, SRCNN (FP16 | Various Resolutions) Each section might consist of several subtests running quantized and floating-point models on different accelerators, measuring the initialization time and accuracy of the predicted outputs as well as checking the speed of the pure CPU inference. A more detailed information about each test / model can be found here .

Parallel and NPU / DSP Throttling Tests

There are two test categories in AI Benchmark v4 that should be mentioned separately. The first one is the Parallel Execution category that is aimed at processing multiple data simultaneously. It consists of three different tasks: inference with a large batch size in Section 1 , running a floating-point and a quantized model in parallel to check acceleration support for mixed type inference in Section 11 . Finally, in the last task the device is running up to 8 floating-point / quantized models simultaneously (Section 5) to see if it is capable of asynchronous model inference and acceleration, which is crucial for many real-world problems involving the use of more than one deep learning model. In the PRO mode, one can find a new Throttling Test that allows to run the Inception-V3 model on CPU or with any available acceleration option and monitor the resulting FPS over time. This test does not have a time limit, therefore it is possible to check both the short- and long-term thermal stability of the mobile chipset and all its components. We should note that this test will be significantly expanded in the next benchmark versions, introducing more models and considerably more complex setups.

Scoring System

AI Benchmark v4 measures more than 100 different aspects of AI performance that are grouped in the following categories: INT8,  NNAPI-1.1 Compatible Tests INT8,  NNAPI-1.2 Compatible Tests INT8,  Parallel Model Execution INT8,  Single / Multi-Thread CPU Tests INT8,  Accuracy Tests INT8,  Initialization Time FP16,  NNAPI-1.1 Compatible Tests FP16,  NNAPI-1.2 Compatible Tests FP16,  Parallel Model Execution FP16,  Single / Multi-Thread CPU Tests FP16,  Accuracy Tests FP16,  Initialization Time FP16,  RNN / LSTM Tests FP32,  RNN / LSTM Tests FP16,  Memory Tests The weight of each test category is chosen according to its popularity among the research and development ML community and its relevance for mobile inference, therefore the final benchmark score is reflecting the actual user experience of running the existing AI applications on smartphones. More information about the scoring system will be provided in our next AI Benchmark paper, the previous report with a detailed overview of the last release can be found here .

31 May 2020 Andrey Ignatov | AI Benchmark

From  Kirin  970  to  Snapdragon  855  Plus:   Performance  Review  of  All  SoCs  with  AI  capabilities