tensorflow - No speed up with TensorRT FP16 or INT8 on NVIDIA V100

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have been trying to use the trt.create_inference_graph to convert my Keras translated Tensorflow saved model from FP32 to FP16 and INT8,and then saving it in a format that can be used for TensorFlow serving. Code here - https://colab.research.google.com/drive/16zUmIx0_KxRHLN751RCEBuZRKhWx6BsJ

However running this with my test client, I see no change in the timing.

I compared different models with NVIDIA V100 32 GB and my 8Gb 1070 GTX card in the laptop. I tried reducing and increasing the input shape to check memory effects. Overall I am thinking that, other than the advantage of 32 GB memory (not just to load models, but to process more- say frames without going out of memory) V100 does not seem to have the speed up; I was especially thinking of double the speed up in FP16 mode. Not sure if the Keras converted TF Model, or the model complexiy or design has some part to play.

Here are test details https://docs.google.com/spreadsheets/d/1Sl7K6sa96wub1OXcneMk1txthQfh63b0H5mwygyVQlE/edit?usp=sharing

Model 4 Keras converted TF sering           
Model 6 TF Graph simple optimisation            
Model 7     TF Graph simple optimisation + Weight Qunatization          
Model 8 TF Graph simple optimisation + Weight + Model Qunatization          
Model 9     Based on Model 4 frozen; NVIDIA Tensor RT Optimisation FP 32            
Model 10    Based on Model 4 frozen; NVIDIA Tensor RT Optimisation FP 16            
Model 11    Based on Model 4 frozen; NVIDIA Tensor RT Optimisation INT 8            
No of Runs 1                
Model   NVIDIA GTX 1070 NVIDIA V100  32 GB      
4   0.13    0.13        
6   0.14    0.15        
7   0.15    0.14        
9   0.13    0.12        
10  0.13    0.12        
11  0.13    0.12        
No of runs :10              
4   1.15    0.81        
6   1.34    1.16        
7   1.15    1.27        
9   1.23    0.82        
10  1.22    0.83        
11  1.22    0.85        
FP32 - V100 -No optimization
('Label', 'person', ' at ', array([409, 167, 728, 603]), ' Score ', 0.968112)
('Label', 'person', ' at ', array([  0, 426, 512, 785]), ' Score ', 0.8355837)
('Label', 'person', ' at ', array([ 723,  475, 1067,  791]), ' Score ', 0.7234411)
('Label', 'tie', ' at ', array([527, 335, 569, 505]), ' Score ', 0.52543193)
('Time for ', 10, ' is ', 0.7228488922119141)
FP 32 with TensorFlow based Optimization - TransformGraph
without weight or model quantization
('Time for ', 10, ' is ', 0.6342859268188477)
FP ?? with TensorFlow based Optimization - +Weight Quantized- TransformGraph
After weight quantized; Model size is 39 MB!! (from ~149 MB)
But time is double
    ('Time for ', 10, ' is ', 1.201113224029541)
Model Quantization - Does not work (at least with TF Serving)
Using NVIDIA TensorRT Optimization (colab notebook)
FP16 - v100
('Label', 'person', ' at ', array([409, 167, 728, 603]), ' Score ', 0.9681119)
('Label', 'person', ' at ', array([  0, 426, 512, 785]), ' Score ', 0.83558357)
('Label', 'person', ' at ', array([ 723,  475, 1067,  791]), ' Score ', 0.7234408)
('Label', 'tie', ' at ', array([527, 335, 569, 505]), ' Score ', 0.52543193)
('Time for ', 10, ' is ', 0.8691568374633789)
INT 8
('Label', 'person', ' at ', array([409, 167, 728, 603]), ' Score ', 0.9681119)
('Label', 'person', ' at ', array([  0, 426, 512, 785]), ' Score ', 0.83558357)
('Label', 'person', ' at ', array([ 723,  475, 1067,  791]), ' Score ', 0.7234408)
('Label', 'tie', ' at ', array([527, 335, 569, 505]), ' Score ', 0.52543193)
('Time for ', 10, ' is ', 0.8551359176635742)
Optimization Snippet
https://colab.research.google.com/drive/1u79vDN4MZuq6gYIOkPmWsbghjunbDq6m
Note : Between runs there are slight differences
                Did you find the solution to this issue? I'm encountering similar issue (no speedups on V100 after converting resnet50 from FP32 to FP16 using TensorRT).
– Minh Nguyen
                Sep 8, 2020 at 12:15
I did the test with TF official Resnet50 model, FP32 and FP16 on NVIDIA GTX 1070 and NVIDIA V100. This time I did not use TensorRT or any optimisation. Used the TF Model from 
MODEL = https://github.com/tensorflow/models/tree/master/official/resnet
FP32 = http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NCHW.tar.gz
FP16 =  http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp16_savedmodel_NCHW.tar.gz
Model Extracted and RUN
docker run  --net=host --runtime=nvidia  -it --rm -p 8900:8500 -p 8901:8501 
 -v /home/alex/coding/IPython_neuralnet/:/models/ tensorflow/serving:latesgpu 
  --model_config_file=/models/resnet50_fp32.json or resnet50_fp16.json
Results =
And here are the results. Seems there is no speed difference/ number of CUDA cores beyond a certain point; and FP16 model here does not run twice as fast. Maybe I need to convert it using TensorRT
https://docs.google.com/spreadsheets/d/1Sl7K6sa96wub1OXcneMk1txthQfh63b0H5mwygyVQlE/edit?usp=sharing
A few things can help to root cause the lack of speedup.
You can check how many nodes get converted to TRT.
Use the latest version of TF (1.13 or nightly) to utilize the all the recent features added.
Profile (e.g. nvprof or tf profiler) to see what's the bottleneck of your inference workload.
TF-TRT user guide might help: https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html
There is also a bunch of examples in this repo: https://github.com/tensorflow/tensorrt
                thanks, I am using the latest binaries, but was doing the transformation in another machine and with CPU. I will try doing this in the same v100 with GPU and check
– Alex Punnen
                Mar 16, 2019 at 17:25
                I used the latest tensorflow-gpu 1.13.1, ran this in V100 and created the different optimizations and tested. Results are the same. So it seems there is no difference in the optimization process happening in the GPU or CPU. So this could be due to the model - a resnet-50 based model - retinanet that is converted to TF Serving and Frozen for optimization, or simply the fact that V100 is not really fast, but just has more memory- (that does not look right as at least the FP16 should make it run at double speed- need to check with simpler model)
– Alex Punnen
                Mar 17, 2019 at 12:57
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.