onnxruntime/docs/ONNX_Runtime_Perf_Tuning.md

# ONNX Runtime Performance Tuning

## Why do we need to tune performance?
ONNX Runtime is designed to be open and extensible with its concept of "Execution Provider" to represent different execution kernels. See the [design overview](./HighLevelDesign.md). 

ONNX Runtime supports a variety of execution providers across CPU and GPU: [see the list here](../README.md#high-performance).
For different models and different hardware, there is no silver bullet that can always perform the best. Even for a single execution provider, often there are several knobs that can be tuned (e.g. thread number, wait policy etc.).

This document covers basic tools and knobs that can be leveraged to find the best performance for your model and hardware.

## Is there a tool to help with performance tuning?
Yes, the onnxruntime_perf_test.exe tool (available from the build drop) can be used to test various knobs. Please find the usage instructions using `onnxruntime_perf_test.exe -h`.

Additionally, the [ONNX Go Live "OLive" tool](https://github.com/microsoft/OLive) provides an easy-to-use pipeline for converting models to ONNX and optimizing performance with ONNX Runtime. The tool can help identify the optimal runtime configuration to get the best performance on the target hardware for the model. For quickstart, check out the notebooks on how to use OLive [here](https://github.com/microsoft/OLive/blob/master/notebook/Convert_Models_and_Tune_Performance_with_OLive_Python_SDK.ipynb) (using Python) and [here](https://github.com/microsoft/OLive/blob/master/notebook/Convert_Models_and_Tune_Performance_with_OLive_Docker_Images.ipynb) (using Docker). 

## Using different execution providers

### Python API
Official Python packages on Pypi only support the default CPU (MLAS) and default GPU (CUDA) execution providers. For other execution providers, you need to build from source. Please refer to the [build instructions](../BUILD.md). The recommended instructions build the wheel with debug info in parallel.

For example: 

`DNNL:		 ./build.sh --config RelWithDebInfo --use_dnnl --build_wheel --parallel`

` CUDA:	     ./build.sh --config RelWithDebInfo --use_cuda  --build_wheel --parallel`


### C and C# API
Official release (nuget package) supports default (MLAS) and MKL-ML for CPU, and CUDA for GPU. For other execution providers, you need to build from source. Append `--build_csharp` to the instructions to build both C# and C packages.

For example:

`DNNL:		 ./build.sh --config RelWithDebInfo --use_dnnl --build_csharp --parallel`

`CUDA:	     ./build.sh --config RelWithDebInfo --use_cuda  --build_csharp --parallel`

In order to use DNNL, nGraph, CUDA, or TensorRT execution provider, you need to call the C API OrtSessionOptionsAppendExecutionProvider. Here is an example for the CUDA execution provider:

C API Example:
```c
  const OrtApi* g_ort = OrtGetApi(ORT_API_VERSION);
  OrtEnv* env;
  g_ort->CreateEnv(ORT_LOGGING_LEVEL_WARNING, "test", &env)
  OrtSessionOptions* session_option;
  g_ort->OrtCreateSessionOptions(&session_options);
  g_ort->OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, 0);
  OrtSession* session;
  g_ort->CreateSession(env, model_path, session_option, &session);
```

C# API Example:
```c#
SessionOptions so = new SessionOptions();
so.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_EXTENDED;
so.AppendExecutionProvider_CUDA(0);
var session = new InferenceSession(modelPath, so);
```

Python API Example:
```python
import onnxruntime as rt

so = rt.SessionOptions()
so.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
session = rt.InferenceSession(model, sess_options=so)
session.set_providers(['CUDAExecutionProvider'])
```
## How to tune performance for a specific execution provider?
* In general if ORT is built with OpenMP, use the OpenMP env variables to control the number of intra op num threads.
* If ORT is not built with OpenMP, use the appropriate ORT API to control intra op num threads.
* Inter op num threads (used only when parallel execution is enabled) is not affected by OpenMP settings and should
always be set using the ORT APIs.

### Default CPU Execution Provider (MLAS)
The default execution provider uses different knobs to control the thread number.

For the default CPU execution provider, you can try following knobs in the Python API:
```python
import onnxruntime as rt

sess_options = rt.SessionOptions()

sess_options.intra_op_num_threads = 2
sess_options.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
```

* Thread Count
  * `sess_options.intra_op_num_threads = 2` controls the number of threads to use to run the model
* Sequential vs Parallel Execution
  * `sess_options.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL` controls whether the operators in the graph run sequentially or in parallel. Usually when a model has many branches, setting this option to false will provide better performance.
  * When `sess_options.execution_mode = rt.ExecutionMode.ORT_PARALLEL`, you can set `sess_options.inter_op_num_threads` to control the
number of threads used to parallelize the execution of the graph (across nodes).

* sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL. Default is already ORT_ENABLE_ALL(99). Please see [onnxruntime_c_api.h](../include/onnxruntime/core/session/onnxruntime_c_api.h#L241)  (enum GraphOptimizationLevel) for the full list of all optimization levels. For details regarding available optimizations and usage please refer to the [Graph Optimizations Doc](../docs/ONNX_Runtime_Graph_Optimizations.md).

### MKL_DNN/nGraph/MKL_ML Execution Provider
MKL_DNN, MKL_ML and nGraph all depends on openmp for parallelization. For those execution providers, we need to use the openmp environment variable to tune the performance.

The most widely used environment variables are:

* OMP_NUM_THREADS=n
  * Controls the thread pool size

* OMP_WAIT_POLICY=PASSIVE/ACTIVE
  * Controls whether thread spinning is enabled
  * PASSIVE is also called throughput mode and will yield CPU after finishing current task
  * ACTIVE will not yield CPU, instead it will have a while loop to check whether the next task is ready
  * Use PASSIVE if your CPU usage already high, and use ACTIVE when you want to trade CPU with latency


## Profiling and Performance Report

You can enable ONNX Runtime latency profiling in code:

```python
import onnxruntime as rt

sess_options = rt.SessionOptions()
sess_options.enable_profiling = True
```
If you are using the onnxruntime_perf_test.exe tool, you can add `-p [profile_file]` to enable performance profiling.

In both cases, you will get a JSON file which contains the detailed performance data (threading, latency of each operator, etc). This file is a standard performance tracing file, and to view it in a user friendly way, you can open it by using chrome://tracing:
* Open chrome browser
* Type chrome://tracing in the address bar
* Load the generated JSON file

## Performance Tuning for Bert Models

For Bert models, sometimes ONNX Runtime cannot apply the best optimization due to reasons such as framework version updates. In this case, we recommend trying out the [Bert optimization tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert), which reflects the latest changes in graph pattern matching and model conversions, and a set of [notebooks](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert/notebooks) for quickstart.


## Model graph is not optimized even with graph_optimization_level set to ORT_ENABLE_ALL?

ONNX model from IR_VERSION 4 only treats initializers that appear in graph input as non-constant. This may fail some of the graph optimizations, like const folding, operator fusion and etc. Move initializers out of graph inputs if there is no need to override them, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`# ONNX Runtime Performance Tuning`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
			`## Why do we need to tune performance?`
Handle the case that initializers are in graph input (#3449) warn that initializers are in graph input provide a tool to move initializer out of graph input Motivation and Context ONNX model from IR_VERSION 4 only treats initializers that appear in graph input as non-constant. This may fail some of the graph optimizations, like const folding, operator fusion and etc. Warn the case and provide a tool. 2020-04-14 16:06:04 +00:00			`ONNX Runtime is designed to be open and extensible with its concept of "Execution Provider" to represent different execution kernels. See the [design overview](./HighLevelDesign.md).`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`ONNX Runtime supports a variety of execution providers across CPU and GPU: [see the list here](../README.md#high-performance).`
Handle the case that initializers are in graph input (#3449) warn that initializers are in graph input provide a tool to move initializer out of graph input Motivation and Context ONNX model from IR_VERSION 4 only treats initializers that appear in graph input as non-constant. This may fail some of the graph optimizations, like const folding, operator fusion and etc. Warn the case and provide a tool. 2020-04-14 16:06:04 +00:00			`For different models and different hardware, there is no silver bullet that can always perform the best. Even for a single execution provider, often there are several knobs that can be tuned (e.g. thread number, wait policy etc.).`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`This document covers basic tools and knobs that can be leveraged to find the best performance for your model and hardware.`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`## Is there a tool to help with performance tuning?`
			Yes, the onnxruntime_perf_test.exe tool (available from the build drop) can be used to test various knobs. Please find the usage instructions using `onnxruntime_perf_test.exe -h`.
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
perf tuning docs update (#3520) 2020-04-17 07:23:15 +00:00			Additionally, the [ONNX Go Live "OLive" tool](https://github.com/microsoft/OLive) provides an easy-to-use pipeline for converting models to ONNX and optimizing performance with ONNX Runtime. The tool can help identify the optimal runtime configuration to get the best performance on the target hardware for the model. For quickstart, check out the notebooks on how to use OLive [here](https://github.com/microsoft/OLive/blob/master/notebook/Convert_Models_and_Tune_Performance_with_OLive_Python_SDK.ipynb) (using Python) and [here](https://github.com/microsoft/OLive/blob/master/notebook/Convert_Models_and_Tune_Performance_with_OLive_Docker_Images.ipynb) (using Docker).
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00
			`## Using different execution providers`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
			`### Python API`
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`Official Python packages on Pypi only support the default CPU (MLAS) and default GPU (CUDA) execution providers. For other execution providers, you need to build from source. Please refer to the [build instructions](../BUILD.md). The recommended instructions build the wheel with debug info in parallel.`

			`For example:`

Renaming MKL-DNN as DNNL (#2515) * DNNL: Moving Files to rename file names * DNNL name change * azure pipeline updated * disable ceil/dialation and enable Opset10 * disable ceil/dialation tests in Python * mlperf_ssd_resnet34_1200 disabled 2019-12-03 15:34:23 +00:00			`DNNL: ./build.sh --config RelWithDebInfo --use_dnnl --build_wheel --parallel`
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00
			` CUDA: ./build.sh --config RelWithDebInfo --use_cuda --build_wheel --parallel`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00

			`### C and C# API`
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			Official release (nuget package) supports default (MLAS) and MKL-ML for CPU, and CUDA for GPU. For other execution providers, you need to build from source. Append `--build_csharp` to the instructions to build both C# and C packages.

			`For example:`

Renaming MKL-DNN as DNNL (#2515) * DNNL: Moving Files to rename file names * DNNL name change * azure pipeline updated * disable ceil/dialation and enable Opset10 * disable ceil/dialation tests in Python * mlperf_ssd_resnet34_1200 disabled 2019-12-03 15:34:23 +00:00			`DNNL: ./build.sh --config RelWithDebInfo --use_dnnl --build_csharp --parallel`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`CUDA: ./build.sh --config RelWithDebInfo --use_cuda --build_csharp --parallel`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
Renaming MKL-DNN as DNNL (#2515) * DNNL: Moving Files to rename file names * DNNL name change * azure pipeline updated * disable ceil/dialation and enable Opset10 * disable ceil/dialation tests in Python * mlperf_ssd_resnet34_1200 disabled 2019-12-03 15:34:23 +00:00			`In order to use DNNL, nGraph, CUDA, or TensorRT execution provider, you need to call the C API OrtSessionOptionsAppendExecutionProvider. Here is an example for the CUDA execution provider:`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
			`C API Example:`
			```c
Perf tuning doc update with latest API (#2128) * Update perf tuning md * Remove AppendExecutionProvider 2019-10-20 04:03:09 +00:00			`const OrtApi* g_ort = OrtGetApi(ORT_API_VERSION);`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00			`OrtEnv* env;`
Perf tuning doc update with latest API (#2128) * Update perf tuning md * Remove AppendExecutionProvider 2019-10-20 04:03:09 +00:00			`g_ort->CreateEnv(ORT_LOGGING_LEVEL_WARNING, "test", &env)`
			`OrtSessionOptions* session_option;`
			`g_ort->OrtCreateSessionOptions(&session_options);`
			`g_ort->OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, 0);`
			`OrtSession* session;`
			`g_ort->CreateSession(env, model_path, session_option, &session);`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00			```

			`C# API Example:`
			```c#
			`SessionOptions so = new SessionOptions();`
Perf tuning doc update with latest API (#2128) * Update perf tuning md * Remove AppendExecutionProvider 2019-10-20 04:03:09 +00:00			`so.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_EXTENDED;`
			`so.AppendExecutionProvider_CUDA(0);`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00			`var session = new InferenceSession(modelPath, so);`
			```

Perf tuning doc update with latest API (#2128) * Update perf tuning md * Remove AppendExecutionProvider 2019-10-20 04:03:09 +00:00			`Python API Example:`
			```python
			`import onnxruntime as rt`

			`so = rt.SessionOptions()`
			`so.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL`
			`session = rt.InferenceSession(model, sess_options=so)`
			`session.set_providers(['CUDAExecutionProvider'])`
			```
			`## How to tune performance for a specific execution provider?`
Update documentation + Update mlas threading lib to use the new TrySimpleParallelFor. (#3779) 2020-05-01 07:23:06 +00:00			`* In general if ORT is built with OpenMP, use the OpenMP env variables to control the number of intra op num threads.`
			`* If ORT is not built with OpenMP, use the appropriate ORT API to control intra op num threads.`
			`* Inter op num threads (used only when parallel execution is enabled) is not affected by OpenMP settings and should`
			`always be set using the ORT APIs.`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
			`### Default CPU Execution Provider (MLAS)`
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`The default execution provider uses different knobs to control the thread number.`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`For the default CPU execution provider, you can try following knobs in the Python API:`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00			```python
			`import onnxruntime as rt`

			`sess_options = rt.SessionOptions()`

Perf tuning doc update with latest API (#2128) * Update perf tuning md * Remove AppendExecutionProvider 2019-10-20 04:03:09 +00:00			`sess_options.intra_op_num_threads = 2`
			`sess_options.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL`
			`sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00			```
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00
			`* Thread Count`
Perf tuning doc update with latest API (#2128) * Update perf tuning md * Remove AppendExecutionProvider 2019-10-20 04:03:09 +00:00			* `sess_options.intra_op_num_threads = 2` controls the number of threads to use to run the model
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`* Sequential vs Parallel Execution`
Handle the case that initializers are in graph input (#3449) warn that initializers are in graph input provide a tool to move initializer out of graph input Motivation and Context ONNX model from IR_VERSION 4 only treats initializers that appear in graph input as non-constant. This may fail some of the graph optimizations, like const folding, operator fusion and etc. Warn the case and provide a tool. 2020-04-14 16:06:04 +00:00			* `sess_options.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL` controls whether the operators in the graph run sequentially or in parallel. Usually when a model has many branches, setting this option to false will provide better performance.
Perf tuning doc update with latest API (#2128) * Update perf tuning md * Remove AppendExecutionProvider 2019-10-20 04:03:09 +00:00			* When `sess_options.execution_mode = rt.ExecutionMode.ORT_PARALLEL`, you can set `sess_options.inter_op_num_threads` to control the
Revert "Change default optimization level to All (from Basic) (#2745)" This reverts commit 56bb503c2f26474b6613bcb2a198691a11dcef38. 2020-01-03 21:35:32 +00:00			`number of threads used to parallelize the execution of the graph (across nodes).`
update default optimization level + fix gemm_activation fusion (#2791) * update defualt optimization level + fix gemm_activation fusion * fix typo * add unit test and incorporate review comments * fix test comment 2020-01-13 22:05:38 +00:00
			`* sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL. Default is already ORT_ENABLE_ALL(99). Please see [onnxruntime_c_api.h](../include/onnxruntime/core/session/onnxruntime_c_api.h#L241) (enum GraphOptimizationLevel) for the full list of all optimization levels. For details regarding available optimizations and usage please refer to the [Graph Optimizations Doc](../docs/ONNX_Runtime_Graph_Optimizations.md).`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
			`### MKL_DNN/nGraph/MKL_ML Execution Provider`
Fix some typos. (#3582) * Fix some typos. * Fix a typo. 2020-04-18 21:18:05 +00:00			`MKL_DNN, MKL_ML and nGraph all depends on openmp for parallelization. For those execution providers, we need to use the openmp environment variable to tune the performance.`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
Fix some typos. (#3582) * Fix some typos. * Fix a typo. 2020-04-18 21:18:05 +00:00			`The most widely used environment variables are:`
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00			`* OMP_NUM_THREADS=n`
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`* Controls the thread pool size`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`* OMP_WAIT_POLICY=PASSIVE/ACTIVE`
			`* Controls whether thread spinning is enabled`
			`* PASSIVE is also called throughput mode and will yield CPU after finishing current task`
			`* ACTIVE will not yield CPU, instead it will have a while loop to check whether the next task is ready`
			`* Use PASSIVE if your CPU usage already high, and use ACTIVE when you want to trade CPU with latency`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00

Add OLive in perf tuning section (#1772) * Add OLive in perf tuning section * Add OLive to perf tuning section * Update README.md * Update ONNX_Runtime_Perf_Tuning.md 2019-09-27 20:10:40 +00:00
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`## Profiling and Performance Report`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
Add document for ONNX Runtime latency profiling and JSON file viewing. (#1301) 2019-06-27 04:58:10 +00:00			`You can enable ONNX Runtime latency profiling in code:`
Add document of ONNXRuntime performance tuning (#1266) * Add document of ONNXRuntime performance tuning * Clarify MKL-ML 2019-06-21 17:38:22 +00:00
Add document for ONNX Runtime latency profiling and JSON file viewing. (#1301) 2019-06-27 04:58:10 +00:00			```python
			`import onnxruntime as rt`

			`sess_options = rt.SessionOptions()`
Update ONNX_Runtime_Perf_Tuning.md (#1378) 2019-07-18 02:14:43 +00:00			`sess_options.enable_profiling = True`
Add document for ONNX Runtime latency profiling and JSON file viewing. (#1301) 2019-06-27 04:58:10 +00:00			```
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			If you are using the onnxruntime_perf_test.exe tool, you can add `-p [profile_file]` to enable performance profiling.
Add document for ONNX Runtime latency profiling and JSON file viewing. (#1301) 2019-06-27 04:58:10 +00:00
Documentation Refresh (#1990) Various documentation updates, primarily for EP and main readme page 2019-10-15 22:58:02 +00:00			`In both cases, you will get a JSON file which contains the detailed performance data (threading, latency of each operator, etc). This file is a standard performance tracing file, and to view it in a user friendly way, you can open it by using chrome://tracing:`
Add document for ONNX Runtime latency profiling and JSON file viewing. (#1301) 2019-06-27 04:58:10 +00:00			`* Open chrome browser`
			`* Type chrome://tracing in the address bar`
			`* Load the generated JSON file`
Handle the case that initializers are in graph input (#3449) warn that initializers are in graph input provide a tool to move initializer out of graph input Motivation and Context ONNX model from IR_VERSION 4 only treats initializers that appear in graph input as non-constant. This may fail some of the graph optimizations, like const folding, operator fusion and etc. Warn the case and provide a tool. 2020-04-14 16:06:04 +00:00
perf tuning docs update (#3520) 2020-04-17 07:23:15 +00:00			`## Performance Tuning for Bert Models`

			`For Bert models, sometimes ONNX Runtime cannot apply the best optimization due to reasons such as framework version updates. In this case, we recommend trying out the [Bert optimization tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert), which reflects the latest changes in graph pattern matching and model conversions, and a set of [notebooks](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert/notebooks) for quickstart.`
Handle the case that initializers are in graph input (#3449) warn that initializers are in graph input provide a tool to move initializer out of graph input Motivation and Context ONNX model from IR_VERSION 4 only treats initializers that appear in graph input as non-constant. This may fail some of the graph optimizations, like const folding, operator fusion and etc. Warn the case and provide a tool. 2020-04-14 16:06:04 +00:00

			`## Model graph is not optimized even with graph_optimization_level set to ORT_ENABLE_ALL?`

			`ONNX model from IR_VERSION 4 only treats initializers that appear in graph input as non-constant. This may fail some of the graph optimizations, like const folding, operator fusion and etc. Move initializers out of graph inputs if there is no need to override them, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.`