mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-12 00:59:23 +00:00

History

Olivia Jain 56ab2166e8 Delete float16.py (#6336 ) No longer needed. Also doesn't pass policheck.		2021-01-13 13:41:06 -08:00
..
models/yolov4/custom_test_data/test_data_set	TensorRT Perf Tool (#4900 )	2020-09-15 10:06:01 -07:00
__init__.py	TensorRT Perf Tool (#4900 )	2020-09-15 10:06:01 -07:00
benchmark.py	Refactor EP Perf Tool (#6202 )	2021-01-04 08:50:41 -08:00
benchmark_wrapper.py	Refactor EP Perf Tool (#6202 )	2021-01-04 08:50:41 -08:00
model_list.json	Tensorrt perf tool (#5436 )	2020-11-06 12:27:42 -08:00
ort_build_latest.py	Improve perf testing (#5760 )	2020-11-20 16:03:09 -08:00
perf.sh	Refactor EP Perf Tool (#6202 )	2021-01-04 08:50:41 -08:00
perf_utils.py	Refactor EP Perf Tool (#6202 )	2021-01-04 08:50:41 -08:00
README.md	Tensorrt perf tool (#5436 )	2020-11-06 12:27:42 -08:00

README.md

TensorRT EP Performance Test Script

This script mainly focus on benchmarking ORT TensorRT EP performance compared with CUDA EP and standalone TensorRT. The metrics includes TensorRT EP performance gain, percentage of model operators and execution time that run on TensorRT EP.

Usage

You can use following command to test whether models can be run using TensorRT and run benchmark:

./perf.sh

If you only want to run benchmark or use randomly generated input data instead of input data from ONNX model zoo, please use following command:

python3 benchmark_wrapper.py -r benchmark -i random -t 100

python3 benchmark.py -r benchmark -i random -t 100

Please note that benchmark_wrapper.py creates one process to execute benchmark.py for every model and every ep, therefore, when process runs into segmentation fault and is forced to exit, the wrapper can catch the error. However, benchmark.py creates only one process to run all the model inferences on all eps, once the process triggers segmentation fault, the whole process is forced to exit and can't successfully capture the error and testing results.

Options

-r, --running_mode: (defaul: benchmark) There are two types of running mode, validate and benchmark. For validation, this test script records any runtime error as well as validates the accuracy of prediction result using np.testing.assert_almost_equal() and exposes result that doesn't meet accuracy requirement. For benchmark, it simply runs model inference assuming model is correct and get the performance metrics. (Note: If you run validation first and then benchmark, test script knows which model has issue and will skip benchmarking of this particular model.)
-m, --model_source: (default: model_list.json) There are two ways to specify list of models to test. (1) Explicitly specify model list file which contains model information. (2) Specify directory which has following layout:

    --Directory
      --ModelName1
          --test_data_set_0
              --input0.pb
          --test_data_set_2
              --input0.pb
          --model.onnx
      --ModelName2
          --test_data_set_0
              --input0.pb
          --test_data_set_2
              --input0.pb
          --model.onnx

-i, --input_data: (default: random) Where is the input data coming from. The value are zoo or random. The input data can be from ONNX model zoo or it can be randomly generated by test script.
-t, --test_times: (default: 1) Number of inference run when in 'benchmark' running mode.
-o, --perf_result_path: (default: result) Directory for perf result..
--fp16: (default: True) Enable TensorRT/CUDA FP16 and include the performance of this floating point optimization.
--trtexec: Path of standalone TensorRT executable, for example: trtexec.

Results

After running validation and benchmark. The metrics are written into five different csv files in 'result' directory or the directory you specified with -o argument.

benchmark_fail_xxxx.csv: Lists all the models that fail to be inferenced by TensorRT/CUDA.
benchmark_success_xxxx.csv: Lists all the models that can be successfully inferenced by TensorRT/CUDA, as well as other related metrics.
benchmark_latency_xxxx.csv: Lists all the models with inference latecy of TensorRT/CUDA and TensorRT Float32/Float16 performance gain compared with CUDA.
benchmark_metrics_xxxx.csv: List how much and percentage of model operators that are run by TensorRT and what percentage of execution time is running on TensorRT.
benchmark_system_info_xxxx.csv: includes CUDA version, TensorRT version and CPU information.

Thoese metrics will be shown on the standard output as well.

The output of running validation:

Total time for running/profiling all models: 0:20:30.761618
['bert-squad', 'faster-rcnn', 'mask-rcnn', 'ssd', 'tiny-yolov2', 'resnet152v1']

Total models: 6
Fail models: 2
Models FAIL/SUCCESS: 2/4

============================================
========== Failing Models/EPs ==============
============================================
{'faster-rcnn': ['CUDAExecutionProvider_fp16'], 'mask-rcnn': ['CUDAExecutionProvider_fp16']}

========================================
========== TRT detail metrics ==========
========================================
{   'BERT-Squad': {   'ratio_of_execution_time_in_trt': 0.9980344366695495,
                      'ratio_of_ops_in_trt': 0.9989451476793249,
                      'total_execution_time': 12719,
                      'total_ops': 948,
                      'total_ops_in_trt': 947,
                      'total_trt_execution_time': 12694},
    'BERT-Squad (FP16)': {   'ratio_of_execution_time_in_trt': 0.9948146725561744,
                             'ratio_of_ops_in_trt': 0.9989451476793249,
                             'total_execution_time': 5207,
                             'total_ops': 948,
                             'total_ops_in_trt': 947,
                             'total_trt_execution_time': 5180},
    'FasterRCNN-10': {   'ratio_of_execution_time_in_trt': 0.881433685003768,
                         'ratio_of_ops_in_trt': 0.8637346791636625,
                         'total_execution_time': 106160,
                         'total_ops': 2774,
                         'total_ops_in_trt': 2396,
                         'total_trt_execution_time': 93573},
    'FasterRCNN-10 (FP16)': {   'ratio_of_execution_time_in_trt': 0.8391227836682785,
                                'total_execution_time': 67623,
                                'total_trt_execution_time': 56744},
    'MaskRCNN-10': {   'ratio_of_execution_time_in_trt': 0.9084868640292711,
                       'ratio_of_ops_in_trt': 0.8557567917205692,
                       'total_execution_time': 147039,
                       'total_ops': 3092,
                       'total_ops_in_trt': 2646,
                       'total_trt_execution_time': 133583},
    'MaskRCNN-10 (FP16)': {   'ratio_of_execution_time_in_trt': 0.8537288833951381,
                              'total_execution_time': 87372,
                              'total_trt_execution_time': 74592},
    'Resnet-152-v1': {   'ratio_of_execution_time_in_trt': 1.0,
                         'ratio_of_ops_in_trt': 1.0,
                         'total_execution_time': 12330,
                         'total_ops': 360,
                         'total_ops_in_trt': 360,
                         'total_trt_execution_time': 12330},
    'Resnet-152-v1 (FP16)': {   'ratio_of_execution_time_in_trt': 1.0,
                                'ratio_of_ops_in_trt': 1.0,
                                'total_execution_time': 3201,
                                'total_ops': 360,
                                'total_ops_in_trt': 360,
                                'total_trt_execution_time': 3201},
    'SSD': {   'ratio_of_execution_time_in_trt': 0.6751571867232051,
               'ratio_of_ops_in_trt': 0.9905660377358491,
               'total_execution_time': 102585,
               'total_ops': 212,
               'total_ops_in_trt': 210,
               'total_trt_execution_time': 69261},
    'SSD (FP16)': {   'ratio_of_execution_time_in_trt': 0.38334507797420264,
                      'ratio_of_ops_in_trt': 0.9905660377358491,
                      'total_execution_time': 32639,
                      'total_ops': 212,
                      'total_ops_in_trt': 210,
                      'total_trt_execution_time': 12512},
    'tiny_yolov2': {   'ratio_of_execution_time_in_trt': 1.0,
                       'ratio_of_ops_in_trt': 1.0,
                       'total_execution_time': 3003,
                       'total_ops': 33,
                       'total_ops_in_trt': 33,
                       'total_trt_execution_time': 3003},
    'tiny_yolov2 (FP16)': {   'ratio_of_execution_time_in_trt': 1.0,
                              'ratio_of_ops_in_trt': 1.0,
                              'total_execution_time': 864,
                              'total_ops': 33,
                              'total_ops_in_trt': 33,
                              'total_trt_execution_time': 864}}

The output of running benchmark:


=========================================
=========== CUDA/TRT latency  ===========
=========================================
{   'BERT-Squad': {   'CUDAExecutionProvider': '28.88',
                      'CUDAExecutionProvider_fp16': '18.08',
                      'TensorrtExecutionProvider': '15.55',
                      'TensorrtExecutionProvider_fp16': '5.00',
                      'Tensorrt_fp16_gain(%)': '72.35 %',
                      'Tensorrt_gain(%)': '46.16 %'},
    'FasterRCNN-10': {   'CUDAExecutionProvider': '161.40',
                         'TensorrtExecutionProvider': '109.24',
                         'TensorrtExecutionProvider_fp16': '66.68',
                         'Tensorrt_gain(%)': '32.32 %'},
    'MaskRCNN-10': {   'CUDAExecutionProvider': '221.93',
                       'TensorrtExecutionProvider': '154.04',
                       'TensorrtExecutionProvider_fp16': '83.78',
                       'Tensorrt_gain(%)': '30.59 %'},
    'Resnet-152-v1': {   'CUDAExecutionProvider': '22.55',
                         'CUDAExecutionProvider_fp16': '24.59',
                         'TensorrtExecutionProvider': '9.82',
                         'TensorrtExecutionProvider_fp16': '3.22',
                         'Tensorrt_fp16_gain(%)': '86.91 %',
                         'Tensorrt_gain(%)': '56.45 %'},
    'SSD': {   'CUDAExecutionProvider': '176.23',
               'CUDAExecutionProvider_fp16': '82.34',
               'TensorrtExecutionProvider': '109.34',
               'TensorrtExecutionProvider_fp16': '40.73',
               'Tensorrt_fp16_gain(%)': '50.53 %',
               'Tensorrt_gain(%)': '37.96 %'},
    'tiny_yolov2': {   'CUDAExecutionProvider': '6.99',
                       'CUDAExecutionProvider_fp16': '5.50',
                       'TensorrtExecutionProvider': '3.15',
                       'TensorrtExecutionProvider_fp16': '1.39',
                       'Tensorrt_fp16_gain(%)': '74.73 %',
                       'Tensorrt_gain(%)': '54.94 %'}}

Others

ort_build_latest.py: This script should be run before running the benchmark.py to make sure the latest ORT wheel file is being used.

-o, --ort_master_path: ORT master repo.
-t, --tensorrt_home: TensorRT home directory.
-c, --cuda_home: CUDA home directory.

Dependencies

When inferencing model using CUDA float16, this script following script to convert nodes in model graph from float32 to float16. It also modifies the converting script a little bit to better cover more model graph conversion. https://github.com/microsoft/onnxconverter-common/blob/master/onnxconverter_common/float16.py