AI Benchmark v4:
Pushing Mobile NPUs to Their Limits
Twice larger number of tests, native hardware acceleration on many mobile platforms, new tasks targeted at multiple model acceleration, the possibility of loading and running custom TFLite models, NPU / DSP throttling tests — this isn't the full list of improvements coming with the 4th version of AI Benchmark. The detailed description of these and other changes introduced in this release are provided below.
Native Hardware Acceleration
One of the most awaited features in this release is native AI hardware acceleration on many mobile chipsets. This became possible with the introduction of
TensorFlow Lite delegates
working as a middleware between the standard TFLite runtime and vendors' custom deep learning libraries. Unlike the standard SDK-based benchmarking approach that requires each model to be converted and "optimized" for each vendor, thus making the comparison absolutely unfair, this solution allows to run the
identical model
on every single mobile device, while benefiting from highly optimized proprietary SDKs and avoiding all limitations of the standard Android NNAPI acceleration path.
●
Qualcomm Hexagon NN delegate:
allows to run quantized deep learning models directly on Snapdragon Hexagon DSPs, is compatible with the
Hexagon 680
(Snapdragon 821/820, 660, 636),
Hexagon 682
(Snapdragon 835),
Hexagon 683
(Snapdragon 662, 460),
Hexagon 685
(Snapdragon 845, 712, 710, 675, 670),
Hexagon 686
(Snapdragon 665),
Hexagon 688
(Snapdragon 730/G),
Hexagon 690
(Snapdragon 855/Plus),
Hexagon 692
(Snapdragon 720/G),
Hexagon 696
(Snapdragon 768/G, 765/G),
Hexagon 698
(Snapdragon 865) and newer models. Note that the Hexagon DSP might be locked on some devices and inaccessible by anyone except for the corresponding vendor.
●
Samsung Eden delegate:
provides access to the Exynos NPU (can currently run only a few models) and contains optimized GPU kernels for Exynos' Mali GPUs. Is compatible with the
Exynos 9820, 9825, 980, 990
and newer models.
●
TFLite GPU delegate:
a universal OpenCL / OpenGL-based delegate for accelerating floating-point models on mobile GPUs. Is compatible with almost any GPU supporting
OpenGL ES 3.0+
and provides excellent results on both new and older-generation devices.
More delegates will be added to AI Benchmark as they become available and ready to use on mobile devices.
Running Custom TensorFlow Lite Models
One can now load and
run any TensorFlow Lite model
directly in the benchmark (in the PRO Mode) with all acceleration options available for the standard benchmark tests. There is no need for using any complicated tools or Android programming — just put your
model.tflite
file into the
Download
folder, select the desired acceleration options (
TFLite CPU backend
,
Android NNAPI
,
Hexagon NN
/
Eden
/
TFLite GPU
delegates when available) and get the resulting latency or a detailed error log if the model is incompatible with the corresponding delegator. You can also select your model manually using the prompted File Manager — note that in this case the
"Show Internal Storage"
option should be enabled as the full path to your model file is required.
If you do not have any previous experience of working with the TensorFlow Lite, you can find the detailed instructions on how to convert your pre-trained TensorFlow model to TFLite format
here
(just three lines of code are needed). If you are using PyTorch — then you first need to export your model to
ONNX
and then convert the resulting file to
TFLite
. You can also
quantize
your model during the conversion to be able to run it on the Hexagon DSP, Google Coral TPU and some other integer-only mobile AI accelerators.
NNAPI-1.2 Tests
Since NNAPI-1.1 implements support for a very limited number of ML operations, it cannot accelerate a large number of new deep learning models presented in the past years. Therefore, in AI Benchmark v4 we
introduced 25 new subtests
, the majority of which are NNAPI-1.2 compatible. The new benchmark version contains
MobileNet-V3
and
PyNET
architectures,
LSTM-based
OCR and Text Completion models, neural networks with
transposed convolution
/
depth-to-space
/
gather
ops, etc. The number of accuracy tests was also increased from 10 to 18, while the reported errors became more transparent. In total, the 4th benchmark version consists of the
14 benchmark sections
listed below:
●
Section 1.
Classification,
MobileNet-V2
(INT8 + FP16 + Batch Size=4)
●
Section 2.
Classification,
Inception-V3
(INT8 + FP16)
●
Section 3.
Face Recognition,
MobileNet-V3
(INT8 + FP16)
●
Section 4.
Classification,
MobileNet-V2 x 8
(INT8 + FP16 | Parallel)
●
Section 5.
OCR,
CRNN (ResNet-18 + Bi-LSTM)
(FP16 + FP32)
●
Section 6.
Deblurring,
PyNET
(INT8 + FP16)
●
Section 7.
Super-Resolution,
VGG19
(INT8 + FP16)
●
Section 8.
Super-Resolution,
SRGAN
(INT8 + FP16)
●
Section 9.
Bokeh Rendering,
U-Net
(INT8 + FP16)
●
Section 10.
Semantic Segmentation,
DeepLab-V3+
(INT8 + FP16)
●
Section 11.
Semantic Segmentation,
DeepLab-V3+
(INT8 + FP16 in Parallel)
●
Section 12.
Image Enhancement,
DPED ResNet
(INT8 + FP16)
●
Section 13.
Text Completion,
LSTM
(FP16)
●
Section 14.
Memory limits,
SRCNN
(FP16 | Various Resolutions)
Each section might consist of several subtests running quantized and floating-point models on different accelerators, measuring the initialization time and accuracy of the predicted outputs as well as checking the speed of the pure CPU inference.
A more detailed information about each test / model can be found
here
.
Parallel and NPU / DSP Throttling Tests
There are two test categories in AI Benchmark v4 that should be mentioned separately. The first one is the
Parallel Execution
category that is aimed at processing multiple data simultaneously. It consists of three different tasks: inference with a large batch size in
Section 1
, running a floating-point and a quantized model in parallel to check acceleration support for mixed type inference in
Section 11
. Finally, in the last task the device is running up to 8 floating-point / quantized models simultaneously
(Section 5)
to see if it is capable of asynchronous model inference and acceleration, which is crucial for many real-world problems involving the use of more than one deep learning model.
In the PRO mode, one can find a new
Throttling Test
that allows to run the Inception-V3 model on CPU or with any available acceleration option and monitor the resulting FPS over time. This test does not have a time limit, therefore it is possible to check both the short- and long-term thermal stability of the mobile chipset and all its components. We should note that this test will be significantly expanded in the next benchmark versions, introducing more models and considerably more complex setups.
Scoring System
AI Benchmark v4 measures
more than 100 different aspects
of AI performance that are grouped in the following categories:
●
INT8, NNAPI-1.1 Compatible Tests
●
INT8, NNAPI-1.2 Compatible Tests
●
INT8, Parallel Model Execution
●
INT8, Single / Multi-Thread CPU Tests
●
INT8, Accuracy Tests
●
INT8, Initialization Time
●
FP16, NNAPI-1.1 Compatible Tests
●
FP16, NNAPI-1.2 Compatible Tests
●
FP16, Parallel Model Execution
●
FP16, Single / Multi-Thread CPU Tests
●
FP16, Accuracy Tests
●
FP16, Initialization Time
●
FP16, RNN / LSTM Tests
●
FP32, RNN / LSTM Tests
●
FP16, Memory Tests
The weight of each test category is chosen according to its
popularity
among the research and development ML community and its
relevance
for mobile inference, therefore the final benchmark score is
reflecting the actual user experience
of running the existing AI applications on smartphones. More information about the scoring system will be provided in our next AI Benchmark paper, the previous report with a detailed overview of the last release can be found
here
.
31 May 2020
Andrey Ignatov
|
AI Benchmark
From Kirin 970 to Snapdragon 855 Plus: Performance Review of All SoCs with AI capabilities