TensorRT Perf Tool (#4900)

* Initialize tensorrt perf script * Add bert-squad dependencies * Modified code to make ort inference with CUDA/Tensorrt * Add get CUDA/TRT version * uncomment bert-squad * Add BERT-SQUAD inputs.json * Add FastRCNN * Make preprocess/validation in to common functions * Add MaskRCNN and SSD and consolidate the code * Add dependencies for MaskRCNN * following modifications are made: - create common fetch function to get inputs/outputs of model from ONNX model zoo. - create common validation function to compare inference outputs with reference outputs from ONNX model zoo. - move run/repeat time to argument list. (still working on other arguments, like fp16 or fp32, latency percentile). - generate table in csv file to show the latency comparison (TRT vs CUDA) side by side. * Add approache to analyze profling file and also update model related settings * Add models * Add most of models from ONNX model zoo * Add model input name and print all the model names at the end of run * Add system info * Add TRT fp16 support * Refine the code * Handle TRT fall back and modify the way to get input data * Refine code * Modify code * Add more precise approach to measure inference * Add io-binding * Add YoLoV4 * Refine the code * Refine the code * Add models * Add yolov4 notebook for jetson device * Update notebook * Update notebook * Add CVS models * Add missing model * Add support of float16 * Add new way to get trt version * Add "validate" and "benchmark" mode * Add randomly generated input * Refine perf script * Refine the code. * Add README * Refine the code * Update README.md * Refine code * Update README.md * Remove all the model related python and instead using model_list.json as models configuration. Refine the benchmark.py * Refine the code Co-authored-by: Chi Lo <lochi@microsoft.com>
2026-07-11 17:48:34 +00:00 · 2020-09-15 10:06:01 -07:00 · 2020-09-15 10:06:01 -07:00 · 9f526f45ac
commit 9f526f45ac
parent ef496d36ea
9 changed files with 2113 additions and 0 deletions
--- a/onnxruntime/python/tools/tensorrt/perf/README.md
+++ b/onnxruntime/python/tools/tensorrt/perf/README.md
@ -0,0 +1,160 @@
+# TensorRT Performance Test Script
+This script benchmarks TensorRT EP performance from ONNX runtime using CUDA EP and standalone TensorRT as baseline. The metrics includes TensorRT EP performance gain, percentage of model operators and execution time that run on TensorRT EP.
+
+## Usage
+You can use following command to run benchmark and validate prediction results:
+```
+./perf.sh
+```
+If you only want to run benchmark or use randomly generated input data instead of input data from ONNX model zoo, please use following command:
+```
+python3 benchmark.py -r benchmark -i random -t 100
+```
+### Options
+- **-r, --running_mode**: (*defaul: benchmark*) There are two types of running mode, *validate* and *benchmark*. For validation, this test script records any runtime error as well as validates the accuracy of prediction result using *np.testing.assert_almost_equal()* and exposes result that doesn't meet accuracy requirement. For benchmark, it simply runs model inference assuming model is correct and get the performance metrics. (Note: If you run validation first and then benchmark, test script knows which model has issue and will skip benchmarking of this particular model.)
+- **-m, --model_list_file**: (*default: model_list.json*) The model list for benchmarking as well as information about each model.
+- **-i, --input_data**: (*default: random*) Where is the input data coming from. The value are *zoo* or *random*. The input data can be from ONNX model zoo or it can be randomly generated by test script.
+- **-t, --test_times**: (*default: 1*) Number of inference run when in 'benchmark' running mode.
+- **--fp16**: (*default: True*) Enable TensorRT/CUDA FP16 and include the performance of this floating point optimization.
+- **--trtexec**: Path of standalone TensorRT executable, for example: trtexec.
+### Results
+After running validation and benchmark. The metrics are written into five different csv files in 'result' directory.
+- **benchmark_fail_xxxx.csv**: Lists all the models that fail to be inferenced by TensorRT/CUDA.
+- **benchmark_success_xxxx.csv**: Lists all the models that can be successfully inferenced by TensorRT/CUDA, as well as other related metrics.
+- **benchmark_latency_xxxx.csv**: Lists all the models with inference latecy of TensorRT/CUDA and TensorRT Float32/Float16 performance gain compared with CUDA.
+- **benchmark_ratio_xxxx.csv**: List how much and percentage of model operators that are run by TensorRT and what percentage of execution time is running on TensorRT.
+- **benchmark_system_info_xxxx.csv**: includes CUDA version, TensorRT version and CPU information.
+
+Thoese metrics will be shown on the standard output as well.
+
+The output of running validation:
+```
+Total time for running/profiling all models: 0:20:30.761618
+['bert-squad', 'faster-rcnn', 'mask-rcnn', 'ssd', 'tiny-yolov2', 'resnet152v1']
+
+Total models: 6
+Fail models: 2
+Models FAIL/SUCCESS: 2/4
+
+============================================
+========== Failing Models/EPs ==============
+============================================
+{'faster-rcnn': ['CUDAExecutionProvider_fp16'], 'mask-rcnn': ['CUDAExecutionProvider_fp16']}
+
+========================================
+========== TRT detail metrics ==========
+========================================
+{   'BERT-Squad': {   'ratio_of_execution_time_in_trt': 0.9980344366695495,
+                      'ratio_of_ops_in_trt': 0.9989451476793249,
+                      'total_execution_time': 12719,
+                      'total_ops': 948,
+                      'total_ops_in_trt': 947,
+                      'total_trt_execution_time': 12694},
+    'BERT-Squad (FP16)': {   'ratio_of_execution_time_in_trt': 0.9948146725561744,
+                             'ratio_of_ops_in_trt': 0.9989451476793249,
+                             'total_execution_time': 5207,
+                             'total_ops': 948,
+                             'total_ops_in_trt': 947,
+                             'total_trt_execution_time': 5180},
+    'FasterRCNN-10': {   'ratio_of_execution_time_in_trt': 0.881433685003768,
+                         'ratio_of_ops_in_trt': 0.8637346791636625,
+                         'total_execution_time': 106160,
+                         'total_ops': 2774,
+                         'total_ops_in_trt': 2396,
+                         'total_trt_execution_time': 93573},
+    'FasterRCNN-10 (FP16)': {   'ratio_of_execution_time_in_trt': 0.8391227836682785,
+                                'total_execution_time': 67623,
+                                'total_trt_execution_time': 56744},
+    'MaskRCNN-10': {   'ratio_of_execution_time_in_trt': 0.9084868640292711,
+                       'ratio_of_ops_in_trt': 0.8557567917205692,
+                       'total_execution_time': 147039,
+                       'total_ops': 3092,
+                       'total_ops_in_trt': 2646,
+                       'total_trt_execution_time': 133583},
+    'MaskRCNN-10 (FP16)': {   'ratio_of_execution_time_in_trt': 0.8537288833951381,
+                              'total_execution_time': 87372,
+                              'total_trt_execution_time': 74592},
+    'Resnet-152-v1': {   'ratio_of_execution_time_in_trt': 1.0,
+                         'ratio_of_ops_in_trt': 1.0,
+                         'total_execution_time': 12330,
+                         'total_ops': 360,
+                         'total_ops_in_trt': 360,
+                         'total_trt_execution_time': 12330},
+    'Resnet-152-v1 (FP16)': {   'ratio_of_execution_time_in_trt': 1.0,
+                                'ratio_of_ops_in_trt': 1.0,
+                                'total_execution_time': 3201,
+                                'total_ops': 360,
+                                'total_ops_in_trt': 360,
+                                'total_trt_execution_time': 3201},
+    'SSD': {   'ratio_of_execution_time_in_trt': 0.6751571867232051,
+               'ratio_of_ops_in_trt': 0.9905660377358491,
+               'total_execution_time': 102585,
+               'total_ops': 212,
+               'total_ops_in_trt': 210,
+               'total_trt_execution_time': 69261},
+    'SSD (FP16)': {   'ratio_of_execution_time_in_trt': 0.38334507797420264,
+                      'ratio_of_ops_in_trt': 0.9905660377358491,
+                      'total_execution_time': 32639,
+                      'total_ops': 212,
+                      'total_ops_in_trt': 210,
+                      'total_trt_execution_time': 12512},
+    'tiny_yolov2': {   'ratio_of_execution_time_in_trt': 1.0,
+                       'ratio_of_ops_in_trt': 1.0,
+                       'total_execution_time': 3003,
+                       'total_ops': 33,
+                       'total_ops_in_trt': 33,
+                       'total_trt_execution_time': 3003},
+    'tiny_yolov2 (FP16)': {   'ratio_of_execution_time_in_trt': 1.0,
+                              'ratio_of_ops_in_trt': 1.0,
+                              'total_execution_time': 864,
+                              'total_ops': 33,
+                              'total_ops_in_trt': 33,
+                              'total_trt_execution_time': 864}}
+
+```
+
+The output of running benchmark:
+```
+
+=========================================
+=========== CUDA/TRT latency  ===========
+=========================================
+{   'BERT-Squad': {   'CUDAExecutionProvider': '28.88',
+                      'CUDAExecutionProvider_fp16': '18.08',
+                      'TensorrtExecutionProvider': '15.55',
+                      'TensorrtExecutionProvider_fp16': '5.00',
+                      'Tensorrt_fp16_gain(%)': '72.35 %',
+                      'Tensorrt_gain(%)': '46.16 %'},
+    'FasterRCNN-10': {   'CUDAExecutionProvider': '161.40',
+                         'TensorrtExecutionProvider': '109.24',
+                         'TensorrtExecutionProvider_fp16': '66.68',
+                         'Tensorrt_gain(%)': '32.32 %'},
+    'MaskRCNN-10': {   'CUDAExecutionProvider': '221.93',
+                       'TensorrtExecutionProvider': '154.04',
+                       'TensorrtExecutionProvider_fp16': '83.78',
+                       'Tensorrt_gain(%)': '30.59 %'},
+    'Resnet-152-v1': {   'CUDAExecutionProvider': '22.55',
+                         'CUDAExecutionProvider_fp16': '24.59',
+                         'TensorrtExecutionProvider': '9.82',
+                         'TensorrtExecutionProvider_fp16': '3.22',
+                         'Tensorrt_fp16_gain(%)': '86.91 %',
+                         'Tensorrt_gain(%)': '56.45 %'},
+    'SSD': {   'CUDAExecutionProvider': '176.23',
+               'CUDAExecutionProvider_fp16': '82.34',
+               'TensorrtExecutionProvider': '109.34',
+               'TensorrtExecutionProvider_fp16': '40.73',
+               'Tensorrt_fp16_gain(%)': '50.53 %',
+               'Tensorrt_gain(%)': '37.96 %'},
+    'tiny_yolov2': {   'CUDAExecutionProvider': '6.99',
+                       'CUDAExecutionProvider_fp16': '5.50',
+                       'TensorrtExecutionProvider': '3.15',
+                       'TensorrtExecutionProvider_fp16': '1.39',
+                       'Tensorrt_fp16_gain(%)': '74.73 %',
+                       'Tensorrt_gain(%)': '54.94 %'}}
+
+```
+## Dependencies
+- This test script uses following script to infer shape in the model for TensorRT execution provider. 
+https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/core/providers/nuphar/scripts/symbolic_shape_infer.py
+- When inferencing model using CUDA float16, this script following script to convert nodes in model graph from float32 to float16. It also modifies the converting script a little bit to better cover more model graph conversion.
+https://github.com/microsoft/onnxconverter-common/blob/master/onnxconverter_common/float16.py
--- a/onnxruntime/python/tools/tensorrt/perf/init.py
+++ b/onnxruntime/python/tools/tensorrt/perf/init.py
--- a/onnxruntime/python/tools/tensorrt/perf/benchmark.py
+++ b/onnxruntime/python/tools/tensorrt/perf/benchmark.py
--- a/onnxruntime/python/tools/tensorrt/perf/float16.py
+++ b/onnxruntime/python/tools/tensorrt/perf/float16.py
@ -0,0 +1,225 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License. See License.txt in the project root for
+# license information.
+###########################################################################
+
+import itertools
+import numpy as np
+import onnx
+from onnx import helper
+from onnx import onnx_pb as onnx_proto
+
+
+def _npfloat16_to_int(np_list):
+    '''
+    Convert numpy float16 to python int.
+    :param np_list: numpy float16 list
+    :return int_list: python int list
+    '''
+    return [int(bin(_.view('H'))[2:].zfill(16), 2) for _ in np_list]
+
+
+def convert_tensor_float_to_float16(tensor):
+    '''
+    Convert tensor float to float16.
+    :param tensor: TensorProto object
+    :return tensor_float16: converted TensorProto object
+    Example:
+    ::
+        from onnxmltools.utils.float16_converter import convert_tensor_float_to_float16
+        new_tensor = convert_tensor_float_to_float16(tensor)
+    '''
+    if not isinstance(tensor, onnx_proto.TensorProto):
+        raise ValueError('Expected input type is an ONNX TensorProto but got %s' % type(tensor))
+
+    if tensor.data_type == onnx_proto.TensorProto.FLOAT:
+        tensor.data_type = onnx_proto.TensorProto.FLOAT16
+        # convert float_data (float type) to float16 and write to int32_data
+        if tensor.float_data:
+            int_list = _npfloat16_to_int(np.float16(tensor.float_data))
+            tensor.int32_data[:] = int_list
+            tensor.float_data[:] = []
+        # convert raw_data (bytes type)
+        if tensor.raw_data:
+            # convert n.raw_data to float
+            float32_list = np.fromstring(tensor.raw_data, dtype='float32')
+            # convert float to float16
+            float16_list = np.float16(float32_list)
+            # convert float16 to bytes and write back to raw_data
+            tensor.raw_data = float16_list.tostring()
+    return tensor
+
+
+def convert_float_to_float16(model):
+    '''
+    Convert tensor float type in the ONNX ModelProto input to tensor float16.
+    :param model: ONNX ModelProto object
+    :return: converted ONNX ModelProto object
+    Examples:
+    ::
+        Example 1: Convert ONNX ModelProto object:
+        from onnxmltools.utils.float16_converter import convert_float_to_float16
+        new_onnx_model = convert_float_to_float16(onnx_model)
+        Example 2: Convert ONNX model binary file:
+        from onnxmltools.utils.float16_converter import convert_float_to_float16
+        from onnxmltools.utils import load_model, save_model
+        onnx_model = load_model('model.onnx')
+        new_onnx_model = convert_float_to_float16(onnx_model)
+        save_model(new_onnx_model, 'new_model.onnx')
+    '''
+    func_infer_shape = None
+    if onnx.__version__ >= '1.2':
+        try:
+            from onnx.shape_inference import infer_shapes
+            func_infer_shape = infer_shapes
+        finally:
+            pass
+
+    if not isinstance(model, onnx_proto.ModelProto):
+        raise ValueError('Expected model type is an ONNX ModelProto but got %s' % type(model))
+
+    # create black list
+    op_black_list = ['ArrayFeatureExtractor', 'Binarizer', 'CastMap', 'CategoryMapper', 'DictVectorizer',
+                     'FeatureVectorizer', 'Imputer', 'LabelEncoder', 'LinearClassifier', 'LinearRegressor',
+                     'Normalizer', 'OneHotEncoder', 'SVMClassifier', 'SVMRegressor', 'Scaler', 'TreeEnsembleClassifier',
+                     'TreeEnsembleRegressor', 'ZipMap', 'NonMaxSuppression', 'TopK', 'RoiAlign', 'Resize',
+                     'Range', 'CumSum', 'Upsample']
+    input_of_op_black_list = []
+    output_of_op_black_list = []
+    op_need_to_modify = []
+    # create a queue for BFS
+    queue = []
+    value_info_list = []
+    node_list = []
+    # type inference on input model
+    if func_infer_shape is not None:
+        model = func_infer_shape(model)
+    queue.append(model)
+    while queue:
+        next_level = []
+        for q in queue:
+            # if q is model, push q.graph (GraphProto)
+            if isinstance(q, onnx_proto.ModelProto):
+                next_level.append(q.graph)
+            # if q is model.graph, push q.node.attribute (AttributeProto)
+            if isinstance(q, onnx_proto.GraphProto):
+                for n in q.node:
+                    # if n is in the black list (doesn't support float16), no conversion for the node,
+                    # and save the node for further processing
+                    if n.op_type in op_black_list:
+                        input_of_op_black_list += n.input
+                        output_of_op_black_list += n.output
+                        node_list.append(n)
+                    else:
+                        if n.op_type == 'Cast':
+                            for attr in n.attribute:
+                                if attr.name == 'to' and attr.i == 1:
+                                    attr.i = 10
+                                    break
+                        for attr in n.attribute:
+                            next_level.append(attr)
+
+                        # workaround.
+                        # if input of op is output of another op in black list
+                        # need to take care of this situation
+                        if n.op_type == 'Concat':
+                            for input in n.input:
+                                if input in output_of_op_black_list:
+                                    op_need_to_modify.append(n)
+                                    break
+
+            # if q is model.graph.node.attribute, push q.g and q.graphs (GraphProto)
+            # and process node.attribute.t and node.attribute.tensors (TensorProto)
+            if isinstance(q, onnx_proto.AttributeProto):
+                next_level.append(q.g)
+                for n in q.graphs:
+                    next_level.append(n)
+                q.t.CopyFrom(convert_tensor_float_to_float16(q.t))
+                for n in q.tensors:
+                    n = convert_tensor_float_to_float16(n)
+            # if q is graph, process graph.initializer(TensorProto), input, output and value_info (ValueInfoProto)
+            if isinstance(q, onnx_proto.GraphProto):
+                for n in q.initializer:  # TensorProto type
+                    if n.name in input_of_op_black_list:
+                        continue
+                    n = convert_tensor_float_to_float16(n)
+                # for all ValueInfoProto with tensor(float) type in input, output and value_info, convert them to
+                # tensor(float16) except map and seq(map). And save them in value_info_list for further processing
+                for n in itertools.chain(q.input, q.output, q.value_info):
+                    if n.type.tensor_type.elem_type == onnx_proto.TensorProto.FLOAT:
+                        n.type.tensor_type.elem_type = onnx_proto.TensorProto.FLOAT16
+                        value_info_list.append(n)
+        queue = next_level
+
+    # process the nodes in black list that doesn't support tensor(float16)
+    for node in node_list:
+        # if input's name is in the value_info_list meaning input is tensor(float16) type,
+        # insert a float16 to float Cast node before the node,
+        # change current node's input name and create new value_info for the new name
+        for i in range(len(node.input)):
+            input = node.input[i]
+            for value_info in value_info_list:
+                if input == value_info.name:
+                    # create new value_info for current node's new input name
+                    new_value_info = model.graph.value_info.add()
+                    new_value_info.CopyFrom(value_info)
+                    output_name = node.name + '_input_cast_' + str(i)
+                    new_value_info.name = output_name
+                    new_value_info.type.tensor_type.elem_type = onnx_proto.TensorProto.FLOAT
+                    # add Cast node (from tensor(float16) to tensor(float) before current node
+                    node_name = node.name + '_input_cast' + str(i)
+                    new_node = [helper.make_node('Cast', [input], [output_name], to=1, name=node_name)]
+                    model.graph.node.extend(new_node)
+                    # change current node's input name
+                    node.input[i] = output_name
+                    continue
+        # if output's name is in the value_info_list meaning output is tensor(float16) type, insert a float to
+        # float16 Cast node after the node, change current node's output name and create new value_info for the new name
+        for i in range(len(node.output)):
+            output = node.output[i]
+            for value_info in value_info_list:
+                if output == value_info.name:
+                    # create new value_info for current node's new output
+                    new_value_info = model.graph.value_info.add()
+                    new_value_info.CopyFrom(value_info)
+                    input_name = node.name + '_output_cast_' + str(i)
+                    new_value_info.name = input_name
+                    new_value_info.type.tensor_type.elem_type = onnx_proto.TensorProto.FLOAT
+                    # add Cast node (from tensor(float) to tensor(float16) after current node
+                    node_name = node.name + '_output_cast' + str(i)
+                    new_node = [helper.make_node('Cast', [input_name], [output], to=10, name=node_name)]
+                    model.graph.node.extend(new_node)
+                    # change current node's input name
+                    node.output[i] = input_name
+                    continue
+
+        for i in range(len(op_need_to_modify)):
+            target_node = op_need_to_modify[i]
+
+            for j in range(len(target_node.input)):
+                target_input = target_node.input[j]
+
+                if target_input not in output_of_op_black_list:
+                    continue
+
+                if target_input not in node.input and target_input not in node.output:
+                    continue
+
+                # create new value_info for current node's new input name
+                from onnx import TensorProto
+                x = helper.make_tensor_value_info('X', TensorProto.FLOAT, None)
+                new_value_info = model.graph.value_info.add()
+                new_value_info.CopyFrom(x)
+                output_name = target_node.name + '_input_cast_' + str(i)
+                new_value_info.name = output_name
+                new_value_info.type.tensor_type.elem_type = onnx_proto.TensorProto.FLOAT16
+                # add Cast node (from tensor(float) to tensor(float16) after current node
+                node_name = target_node.name + '_output_cast' + str(i)
+                new_node = [helper.make_node('Cast', [node.output[0]], [output_name], to=10, name=node_name)]
+                model.graph.node.extend(new_node)
+                # change current node's input name
+                target_node.input[j] = output_name
+
+                continue
+
+    return model
--- a/onnxruntime/python/tools/tensorrt/perf/model_list.json
+++ b/onnxruntime/python/tools/tensorrt/perf/model_list.json
@ -0,0 +1,248 @@
+[
+    {
+        "model_name": "BERT-Squad",
+        "working_directory": "./models/bert-squad/",
+        "model_path": "./download_sample_10/bertsquad10.onnx",
+        "test_data_path": "./download_sample_10/"
+    },
+    {
+        "model_name": "FasterRCNN-10",
+        "working_directory": "./models/faster-rcnn/",
+        "model_path": "./faster_rcnn_R_50_FPN_1x.onnx",
+        "test_data_path": "./"
+    },
+    {
+        "model_name": "MaskRCNN-10",
+        "working_directory": "./models/mask-rcnn/",
+        "model_path": "./mask_rcnn_R_50_FPN_1x.onnx",
+        "test_data_path": "./"
+    },
+    {
+        "model_name": "SSD",
+        "working_directory": "./models/ssd/",
+        "model_path": "./model.onnx",
+        "test_data_path": "./"
+    },
+    {
+        "model_name": "TinyYolov2",
+        "working_directory": "./models/tiny-yolov2/",
+        "model_path": "./tiny_yolov2/model.onnx",
+        "test_data_path": "./tiny_yolov2/"
+    },
+    {
+        "model_name": "TinyYolov3",
+        "working_directory": "./models/tiny-yolov3/",
+        "model_path": "./yolov3-tiny.onnx",
+        "test_data_path": "./"
+    },
+    {
+        "model_name": "Yolov3",
+        "working_directory": "./models/yolov3/",
+        "model_path": "./yolov3/yolov3.onnx",
+        "test_data_path": "./yolov3/"
+    },
+    {
+        "model_name": "Yolov4",
+        "working_directory": "./models/yolov4/",
+        "model_path": "./yolov4/yolov4.onnx",
+        "test_data_path": "./custom_test_data/"
+    },
+    {
+        "model_name": "Resnet-152-v1",
+        "working_directory": "./models/resnet152v1/",
+        "model_path": "./resnet152v1/resnet152-v1-7.onnx",
+        "test_data_path": "./resnet152v1/"
+    },
+    {
+        "model_name": "Resnet-152-v2",
+        "working_directory": "./models/resnet152v2/",
+        "model_path": "./resnet152v2/resnet152-v2-7.onnx",
+        "test_data_path": "./resnet152v2/"
+    },
+    {
+        "model_name": "Inception-v1",
+        "working_directory": "./models/inception-v1/",
+        "model_path": "./inception_v1/model.onnx",
+        "test_data_path": "./inception_v1/"
+    },
+    {
+        "model_name": "Inception-v2",
+        "working_directory": "./models/inception-v2/",
+        "model_path": "./inception_v2/model.onnx",
+        "test_data_path": "./inception_v2/"
+    },
+    {
+        "model_name": "Mobilenet-v2-1.0",
+        "working_directory": "./models/mobilenet-v2/",
+        "model_path": "./mobilenetv2-1.0/mobilenetv2-1.0.onnx",
+        "test_data_path": "./mobilenetv2-1.0/"
+    },
+    {
+        "model_name": "Zfnet512",
+        "working_directory": "./models/zfnet512/",
+        "model_path": "./zfnet512/model.onnx",
+        "test_data_path": "./zfnet512/"
+    },
+    {
+        "model_name": "Vgg16",
+        "working_directory": "./models/vgg16/",
+        "model_path": "./vgg16/vgg16.onnx",
+        "test_data_path": "./vgg16/"
+    },
+    {
+        "model_name": "Vgg19-bn",
+        "working_directory": "./models/vgg19-bn/",
+        "model_path": "./vgg19-bn/vgg19-bn.onnx",
+        "test_data_path": "./vgg19-bn/"
+    },
+    {
+        "model_name": "GPT2",
+        "working_directory": "./models/GPT2/",
+        "model_path": "./GPT2/model.onnx",
+        "test_data_path": "./GPT2/"
+    },
+    {
+        "model_name": "GPT2_LM_HEAD",
+        "working_directory": "./models/GPT2-LM-HEAD/",
+        "model_path": "./GPT-2-LM-HEAD/model.onnx",
+        "test_data_path": "./GPT-2-LM-HEAD/"
+    },
+    {
+        "model_name": "mnist",
+        "working_directory": "./models/mnist/",
+        "model_path": "./mnist/model.onnx",
+        "test_data_path": "./mnist/"
+    },
+    {
+        "model_name": "Resnet18-v1",
+        "working_directory": "./models/resnet18v1/",
+        "model_path": "./resnet18-v1-7/resnet18-v1-7.onnx",
+        "test_data_path": "./resnet18-v1-7/"
+    },
+    {
+        "model_name": "Resnet18-v2",
+        "working_directory": "./models/resnet18v2/",
+        "model_path": "./resnet18v2/resnet18-v2-7.onnx",
+        "test_data_path": "./resnet18v2/"
+    },
+    {
+        "model_name": "Resnet34-v1",
+        "working_directory": "./models/resnet34v1/",
+        "model_path": "./resnet34-v1-7/resnet34-v1-7.onnx",
+        "test_data_path": "./resnet34-v1-7/"
+    },
+    {
+        "model_name": "Resnet34-v2",
+        "working_directory": "./models/resnet34v2/",
+        "model_path": "./resnet34v2/resnet34-v2-7.onnx",
+        "test_data_path": "./resnet34v2/"
+    },
+    {
+        "model_name": "Resnet50-v1",
+        "working_directory": "./models/resnet50v1/",
+        "model_path": "./resnet50v1/resnet50v1.onnx",
+        "test_data_path": "./resnet50v1/"
+    },
+    {
+        "model_name": "Resnet50-v2",
+        "working_directory": "./models/resnet50v2/",
+        "model_path": "./resnet50v2/resnet50v2.onnx",
+        "test_data_path": "./resnet50v2/"
+    },
+    {
+        "model_name": "Resnet101",
+        "working_directory": "./models/resnet101/",
+        "model_path": "./resnet101v2/resnet101-v2-7.onnx",
+        "test_data_path": "./resnet101v2/"
+    },
+    {
+        "model_name": "Shufflenet-v1",
+        "working_directory": "./models/shufflenet-v1/",
+        "model_path": "./shufflenet/model.onnx",
+        "test_data_path": "./shufflenet/"
+    },
+    {
+        "model_name": "Shufflenet-v1",
+        "working_directory": "./models/shufflenet-v1/",
+        "model_path": "./shufflenet/model.onnx",
+        "test_data_path": "./shufflenet/"
+    },
+    {
+        "model_name": "Shufflenet-v2",
+        "working_directory": "./models/shufflenet-v2/",
+        "model_path": "./model/test_shufflenetv2/model.onnx",
+        "test_data_path": "./model/test_shufflenetv2"
+    },
+    {
+        "model_name": "Squeezenet1.1",
+        "working_directory": "./models/squeezenet1.1/",
+        "model_path": "./squeezenet1.1/squeezenet1.1.onnx",
+        "test_data_path": "./squeezenet1.1/"
+    },
+    {
+        "model_name": "Emotion-ferplus",
+        "working_directory": "./models/emotion-ferplus/",
+        "model_path": "./emotion_ferplus/model.onnx",
+        "test_data_path": "./emotion_ferplus/"
+    },
+    {
+        "model_name": "bvlc-googlenet",
+        "working_directory": "./models/bvlc-googlenet",
+        "model_path": "./bvlc_googlenet/model.onnx",
+        "test_data_path": "./bvlc_googlenet/"
+    },
+    {
+        "model_name": "bvlc-alexnet",
+        "working_directory": "./models/bvlc-alexnet",
+        "model_path": "./bvlc_alexnet/model.onnx",
+        "test_data_path": "./bvlc_alexnet/"
+    },
+    {
+        "model_name": "bvlc-caffenet",
+        "working_directory": "./models/bvlc-caffenet",
+        "model_path": "./bvlc_reference_caffenet/model.onnx",
+        "test_data_path": "./bvlc_reference_caffenet/"
+    },
+    {
+        "model_name": "bvlc-rcnn-ilsvrc13",
+        "working_directory": "./models/bvlc-rcnn-ilvscr13",
+        "model_path": "./bvlc_reference_rcnn_ilsvrc13/model.onnx",
+        "test_data_path": "./bvlc_reference_rcnn_ilsvrc13/"
+    },
+    {
+        "model_name": "Retinanet",
+        "working_directory": "./models/retinanet",
+        "model_path": "./test_retinanet_resnet101/retinanet-9.onnx",
+        "test_data_path": "./test_retinanet_resnet101/"
+    },
+    {
+        "model_name": "Densenet",
+        "working_directory": "./models/densenet",
+        "model_path": "./densenet121/model.onnx",
+        "test_data_path": "./densenet121/"
+    },
+    {
+        "model_name": "ResNet101-DUC-HDC",
+        "working_directory": "./models/Resnet101-DUC",
+        "model_path": "./ResNet101_DUC_HDC/ResNet101_DUC_HDC.onnx",
+        "test_data_path": "./ResNet101_DUC_HDC/"
+    },
+    {
+        "model_name": "Arc-Face",
+        "working_directory": "./models/arc-face",
+        "model_path": "./resnet100/resnet100.onnx",
+        "test_data_path": "./resnet100/"
+    },
+    {
+        "model_name": "Fast-Neural",
+        "working_directory": "./models/Fast-Neural",
+        "model_path": "./mosaic/mosaic.onnx",
+        "test_data_path": "./mosaic/"
+    },
+    {
+        "model_name": "BiDAF",
+        "working_directory": "./models/BiDAF",
+        "model_path": "./bidaf/bidaf.onnx",
+        "test_data_path": "./bidaf/"
+    }
+]
--- a/onnxruntime/python/tools/tensorrt/perf/models/yolov4/custom_test_data/test_data_set/input0.pb
+++ b/onnxruntime/python/tools/tensorrt/perf/models/yolov4/custom_test_data/test_data_set/input0.pb
--- a/onnxruntime/python/tools/tensorrt/perf/models/yolov4/custom_test_data/test_data_set/output0.pb
+++ b/onnxruntime/python/tools/tensorrt/perf/models/yolov4/custom_test_data/test_data_set/output0.pb
--- a/onnxruntime/python/tools/tensorrt/perf/perf.sh
+++ b/onnxruntime/python/tools/tensorrt/perf/perf.sh
@ -0,0 +1,4 @@
+#!/bin/bash
+
+python3 benchmark.py -r validate
+python3 benchmark.py -r benchmark -i random -t 100
--- a/onnxruntime/python/tools/tensorrt/perf/perf_utils.py
+++ b/onnxruntime/python/tools/tensorrt/perf/perf_utils.py
@ -0,0 +1,209 @@
+import subprocess
+import json
+import pprint
+import logging
+import coloredlogs
+import re
+
+debug = False
+debug_verbose = False 
+
+def parse_single_file(f):
+
+    try:
+        data = json.load(f)
+    except Exception as e:
+        return None
+
+    model_run_flag = False
+    first_run_flag = True
+    provider_op_map = {}  # ep -> map of operator to duration
+    provider_op_map_first_run = {} # ep -> map of operator to duration
+
+    for row in data:
+        if not "cat" in row:
+            continue
+
+        if row["cat"] == "Session":
+            if "name" in row and row["name"] == "model_run":
+                if not first_run_flag:
+                    break
+
+                model_run_flag = True
+                first_run_flag = False
+
+        elif row["cat"] == "Node":
+            if "name" in row and "args" in row and re.search(".*kernel_time", row["name"]):
+                args = row["args"]
+
+                if not "op_name" in args or not "provider" in args:
+                    continue
+
+                provider = args["provider"]
+
+                if first_run_flag:
+                    if provider not in provider_op_map_first_run:
+                        provider_op_map_first_run[provider] = {}
+
+                    op_map = provider_op_map_first_run[provider]
+
+                    if row["name"] in op_map:
+                        provider_op_map[provider] = {}
+                        op_map = provider_op_map[provider]
+                        op_map[row["name"]] = row["dur"]
+                        provider_op_map[provider] = op_map
+                    else:
+                        op_map[row["name"]] = row["dur"]
+                        provider_op_map_first_run[provider] = op_map
+                else:
+                    if provider not in provider_op_map:
+                        provider_op_map[provider] = {}
+
+                    op_map = provider_op_map[provider]
+
+                    # avoid duplicated metrics
+                    if not row["name"] in op_map:
+                        op_map[row["name"]] = row["dur"]
+                        provider_op_map[provider] = op_map
+
+
+    if debug_verbose:
+        pprint._sorted = lambda x:x
+        pprint.sorted = lambda x, key=None: x
+        pp = pprint.PrettyPrinter(indent=4)
+        print("------First run ops map (START)------")
+        for key, map in provider_op_map_first_run.items():
+            print(key) 
+            pp.pprint({k: v for k, v in sorted(map.items(), key=lambda item: item[1], reverse=True)})
+
+        print("------First run ops map (END) ------")
+        print("------Second run ops map (START)------")
+        for key, map in provider_op_map.items():
+            print(key) 
+            pp.pprint({k: v for k, v in sorted(map.items(), key=lambda item: item[1], reverse=True)})
+        print("------Second run ops map (END) ------")
+
+    if model_run_flag:
+        return provider_op_map
+
+    return None
+
+def calculate_cuda_op_percentage(cuda_op_map):
+    if not cuda_op_map or len(cuda_op_map) == 0:
+        return 0
+
+    cuda_ops = 0
+    cpu_ops = 0
+    for key, value in cuda_op_map.items():
+        if key == 'CUDAExecutionProvider':
+            cuda_ops += len(value)
+
+        if key == 'CPUExecutionProvider':
+            cpu_ops += len(value)
+
+    return cuda_ops / (cuda_ops + cpu_ops)
+
+##########################################
+# Return: total ops executed in TRT,
+#         total ops,
+#         ratio of ops executed in TRT,
+##########################################
+def calculate_trt_op_percentage(trt_op_map, cuda_op_map):
+    # % of TRT ops
+    total_ops = 0
+    total_cuda_and_cpu_ops = 0
+    for ep in ["CUDAExecutionProvider", "CPUExecutionProvider"]:
+        if ep in cuda_op_map:
+            op_map = cuda_op_map[ep]
+            total_ops += len(op_map)
+
+        if ep in trt_op_map:
+            op_map = trt_op_map[ep]
+            total_cuda_and_cpu_ops += len(op_map)
+
+    if total_ops == 0:
+        print("Error ...")
+        raise
+
+    if len(trt_op_map) == 0:
+        total_cuda_and_cpu_ops = total_ops
+
+    #
+    # equation of % TRT ops:
+    # (total ops in cuda json - cuda and cpu ops in trt json)/ total ops in cuda json
+    #
+    ratio_of_ops_in_trt = (total_ops - total_cuda_and_cpu_ops) / total_ops
+    if debug:
+        print("total_cuda_and_cpu_ops: {}".format(total_cuda_and_cpu_ops))
+        print("total_ops: {}".format(total_ops))
+        print("ratio_of_ops_in_trt: {}".format(ratio_of_ops_in_trt))
+
+    return ((total_ops - total_cuda_and_cpu_ops), total_ops, ratio_of_ops_in_trt)
+
+
+##########################################
+# Return: total TRT execution time,
+#         total execution time,
+#         ratio of execution time in TRT
+##########################################
+def calculate_trt_latency_percentage(trt_op_map):
+    # % of TRT execution time
+    total_execution_time = 0
+    total_trt_execution_time = 0
+    for ep in ["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]:
+        if ep in trt_op_map:
+            op_map = trt_op_map[ep]
+
+            total_time = 0
+            for key, value in op_map.items():
+                total_time += int(value)
+
+            if ep == "TensorrtExecutionProvider":
+                total_trt_execution_time = total_time
+
+            total_execution_time += total_time
+
+
+
+    if total_execution_time == 0:
+        ratio_of_trt_execution_time = 0
+    else:
+        ratio_of_trt_execution_time = total_trt_execution_time / total_execution_time
+
+    if debug:
+        print("total_trt_execution_time: {}".format(total_trt_execution_time))
+        print("total_execution_time: {}".format(total_execution_time))
+        print("ratio_of_trt_execution_time: {}".format(ratio_of_trt_execution_time))
+
+    return (total_trt_execution_time, total_execution_time, ratio_of_trt_execution_time)
+
+
+
+def get_profile_metrics(path, profile_already_parsed):
+    print("Parsing/Analyzing profiling files in {} ...".format(path))
+    p1 = subprocess.Popen(["find", path, "-name", "onnxruntime_profile*", "-printf", "%T+\t%p\n"], stdout=subprocess.PIPE)
+    p2 = subprocess.Popen(["sort"], stdin=p1.stdout, stdout=subprocess.PIPE)
+    stdout, sterr = p2.communicate()
+    stdout = stdout.decode("ascii").strip()
+    profiling_files = stdout.split("\n")
+    print(profiling_files)
+
+    data = []
+    for profile in profiling_files:
+        profile = profile.split('\t')[1]
+        if profile in profile_already_parsed:
+            continue
+        profile_already_parsed.add(profile)
+
+        print("start to parse {} ...".format(profile))
+        with open(profile) as f:
+            op_map = parse_single_file(f)
+            if op_map:
+                data.append(op_map)
+
+    if len(data) == 0:
+        print("No profile metrics got.")
+        return None
+
+    return data[-1]
+