onnxruntime/docs/InferenceHighLevelDesign.md

# ONNX Runtime High Level Design

This document outlines the high level design of
ONNX Runtime - a high performance, cross platform engine.

## Key objectives
* Maximally and automatically leverage the custom accelerators and runtimes
available on disparate platforms.
* Provide the right abstraction and runtime support for custom accelerators and
runtimes. We call this abstraction an [execution
provider](../include/onnxruntime/core/framework/execution_provider.h). It defines and exposes a set of
its capabilities to ONNXRuntime: a set of single or fused nodes it can
execute, its memory allocator, and more. Custom accelerators and runtimes are
instances of execution providers.
* We don't expect that an execution provider can always run an ONNX model fully
on its device. This means that ONNXRuntime must be able to execute a single
model in a heterogeneous environment involving multiple execution providers.
* Provide support for high-level optimizations that can be expressed as
model-to-model transformations via a [graph-transformation
API](../include/onnxruntime/core/optimizer/graph_transformer.h). Such
transformations fall into two categories: global transformations, those that
require analysis and transformation of the entire graph, and local
transformations, which can be captured as simple (algebraic) [rewriting
rules](../include/onnxruntime/core/optimizer/rewrite_rule.h).

## High-level system architecture
The flow is quite simple. Starting from an ONNX model, ONNXRuntime first
converts the model graph into its in-memory graph representation. It then
applies a number of graph transformations that a) perform a set of provider
independent optimizations such cast transformations between float16 and float32, and b) partition the
graph into a set of subgraphs based on the available execution providers. Each
subgraph is assigned to an execution provider. We ensure that a subgraph can be
executed by an execution provider by querying the capability of the execution
provider using the GetCapability() API.

![ONNXRuntime high level system architecture](https://azurecomcdn.azureedge.net/mediahandler/acomblog/media/Default/blog/228d22d3-6e3e-48b1-811c-1d48353f031c.png)

### More about partitioning
ONNXRuntime partitions a model graph into subgraphs based on the available execution providers, one for each distinct provider. ONNXRuntime provides
a default execution provider that is used as the fallback execution for the
operators that cannot be pushed onto the more specialized but more efficient
execution providers. Intuitively we want to push computation to more
specialized execution providers whenever possible.

We use a simple graph partitioning technique. The available execution providers
will be considered in a specific order, and each will be assigned the maximal
subgraphs (possibly more than one) that it is able to handle. The
ONNXRuntime-provided default execution provider will be the last one
considered, and it ensures completeness. More sophisticated optimizations can be
considered in the future (or can even be implemented as a composite execution
provider).

Conceptually, each partition is reduced to a single fused operator. It is
created by invoking the execution provider's Compile() method and wraps it as a
custom operator. Currently we support only synchronous mode of execution. An execution
provider exposes its memory allocator, which is used to allocate the input
tensors for the execution provider. The rewriting and partitioning transform the
initial model graph into a new graph composed of operators assigned to either
the default execution provider or other registered execution
providers. The ONNXRuntime execution engine is responsible for running this graph.

## Key design decisions
* Multiple threads can invoke the Run() method on the same
inference session object. See [API doc](C_API.md) for more details.
* To facilitate this, the Compute() function of all kernels is const
implying the kernels are stateless.
* Implementations of the operators by execution providers are called
kernels. Each execution provider supports a subset of the (ONNX)
operators/kernels.
* The ONNX Runtime guarantees that all operators are supported by the default
execution provider.
* Tensor representation: ONNXRuntime will utilize a standard representation for
the tensor runtime values. The execution providers can internally use a
different representation if they choose to, but it is their responsibility to
convert the values from/to the standard representation at the boundaries of
their subgraph.

## Extensibility Options
* [Add a custom operator/kernel](AddingCustomOp.md)
* [Add an execution provider](AddingExecutionProvider.md)
* [Add a new graph
transform](../include/onnxruntime/core/optimizer/graph_transformer.h)
* [Add a new rewrite rule](../include/onnxruntime/core/optimizer/rewrite_rule.h)

## The ONNX Runtime and Windows OS integration

The ONNX runtime shipped with the Windows operating system in build 1809 (RS5).  The runtime was embedded inside the Windows.AI.MachineLearning.dll and was exposed via that WinRT API (WinML for short).  It includes CPU support and a DirectML execution provider for GPU support.   Since then it has continued to ship in every version of Windows.

Starting with the ONNX Runtime 1.2 release we are bringing a new layered architecture to the ONNX Runtime and Windows ML.
*Note:  This feature is preview as of the 1.2 release*

The high level design looks like this

![ONNX + WinML layered architecture](images/layered-architecture.png)

You can see we replaced the embedded ONNX runtime with the new ONNXRuntime.dll.  With this new approach customers have flexibility on which API they choose to use and on how they want to distribute the binaries.

### API choice

Developers can now choose which API works best for their scenario.

||WinRT|C API|
|--|--|--|
|Type system| Integration with Windows RT types| Platform neutral types|
|Language support| Language support via WinRT Projections| Language support via per language projections|
|Tensorization| Accepts VideoFrames and converts to tensors (support for CPU and GPU)| Accepts tensors|

### Distribution choice

You can also choose to use runtimes included in the Windows OS, or use the redist nuget to ship the runtime with the app.

|Distribution|Inbox|App NuGet|
|--|--|--|
|Disk footprint| Included in the OS| Included in the App|
|Servicing fixes| Serviced by OS updates| Serviced by the App|
|Execution Providers| CPU & DirectML EP | App chosen EP|
|Compatability testing| Tested with OS flights against supported GPU's and CPU's | App performs compatibility testing|
|Opset| Refreshed in OS updates| App chooses|

 
### Using the NuGet WinRT API with other C-API distributions
The WinRT API NuGet is distributed with a curated build of the OnnxRuntime engine. App developers may wish to use the WinRT API, but find themselves limited to the functionality provided by the curated OnnxRuntime engine distributed as part of the WinRT API NuGet package. This can happen because the OnnxRuntime engine shipped with the WinRT API NuGet package only contains the CPU and DML execution providers.

App developers may additionally wish to use a custom build-from-source version of the OnnxRuntime engine as well, or use a prebuilt version of the OnnxRuntime engine from another distribution source like the Micorosoft.ML.OnnxRuntime.MKLML distribution.

To enable this, the WinRT API NuGet has been made to be compatible with a set of OnnxRuntime engines that ship in different NuGet packages.

Please refer to the following table listing the distributions with compatible OnnxRuntime engines.
- [Microsoft.ML.OnnxRuntime](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime)
- [Microsoft.ML.OnnxRuntime.DirectML](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.DirectML/)
- [Microsoft.ML.OnnxRuntime.MKLML](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.MKLML)

Note that compatible distributions must match in release version.

In order to use compatible engines, replace the onnxruntime.dll with the desired engine binary and its associated binaries.
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`# ONNX Runtime High Level Design`

			`This document outlines the high level design of`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`ONNX Runtime - a high performance, cross platform engine.`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00
			`## Key objectives`
			`* Maximally and automatically leverage the custom accelerators and runtimes`
			`available on disparate platforms.`
			`* Provide the right abstraction and runtime support for custom accelerators and`
			`runtimes. We call this abstraction an [execution`
			`provider](../include/onnxruntime/core/framework/execution_provider.h). It defines and exposes a set of`
			`its capabilities to ONNXRuntime: a set of single or fused nodes it can`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`execute, its memory allocator, and more. Custom accelerators and runtimes are`
			`instances of execution providers.`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`* We don't expect that an execution provider can always run an ONNX model fully`
			`on its device. This means that ONNXRuntime must be able to execute a single`
			`model in a heterogeneous environment involving multiple execution providers.`
			`* Provide support for high-level optimizations that can be expressed as`
			`model-to-model transformations via a [graph-transformation`
Fixing broken links for graph transformations in High Level Design doc (#1128) 2019-05-29 22:36:59 +00:00			`API](../include/onnxruntime/core/optimizer/graph_transformer.h). Such`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`transformations fall into two categories: global transformations, those that`
			`require analysis and transformation of the entire graph, and local`
			`transformations, which can be captured as simple (algebraic) [rewriting`
Fixing broken links for graph transformations in High Level Design doc (#1128) 2019-05-29 22:36:59 +00:00			`rules](../include/onnxruntime/core/optimizer/rewrite_rule.h).`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00
			`## High-level system architecture`
			`The flow is quite simple. Starting from an ONNX model, ONNXRuntime first`
			`converts the model graph into its in-memory graph representation. It then`
			`applies a number of graph transformations that a) perform a set of provider`
			`independent optimizations such cast transformations between float16 and float32, and b) partition the`
			`graph into a set of subgraphs based on the available execution providers. Each`
			`subgraph is assigned to an execution provider. We ensure that a subgraph can be`
			`executed by an execution provider by querying the capability of the execution`
			`provider using the GetCapability() API.`

			`![ONNXRuntime high level system architecture](https://azurecomcdn.azureedge.net/mediahandler/acomblog/media/Default/blog/228d22d3-6e3e-48b1-811c-1d48353f031c.png)`

			`### More about partitioning`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`ONNXRuntime partitions a model graph into subgraphs based on the available execution providers, one for each distinct provider. ONNXRuntime provides`
			`a default execution provider that is used as the fallback execution for the`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`operators that cannot be pushed onto the more specialized but more efficient`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`execution providers. Intuitively we want to push computation to more`
			`specialized execution providers whenever possible.`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00
			`We use a simple graph partitioning technique. The available execution providers`
			`will be considered in a specific order, and each will be assigned the maximal`
			`subgraphs (possibly more than one) that it is able to handle. The`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`ONNXRuntime-provided default execution provider will be the last one`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`considered, and it ensures completeness. More sophisticated optimizations can be`
			`considered in the future (or can even be implemented as a composite execution`
			`provider).`

			`Conceptually, each partition is reduced to a single fused operator. It is`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`created by invoking the execution provider's Compile() method and wraps it as a`
Add remaining build options and make minor changes in documentation (#39) * Minor changes in documentation * Synchronous, not sync * Add remaining build options after mkldnn fix 2018-11-28 03:59:40 +00:00			`custom operator. Currently we support only synchronous mode of execution. An execution`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`provider exposes its memory allocator, which is used to allocate the input`
			`tensors for the execution provider. The rewriting and partitioning transform the`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`initial model graph into a new graph composed of operators assigned to either`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`the default execution provider or other registered execution`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`providers. The ONNXRuntime execution engine is responsible for running this graph.`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00
			`## Key design decisions`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`* Multiple threads can invoke the Run() method on the same`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`inference session object. See [API doc](C_API.md) for more details.`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`* To facilitate this, the Compute() function of all kernels is const`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`implying the kernels are stateless.`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`* Implementations of the operators by execution providers are called`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`kernels. Each execution provider supports a subset of the (ONNX)`
			`operators/kernels.`
Minor documentation changes (#78) 2018-12-03 20:55:29 +00:00			`* The ONNX Runtime guarantees that all operators are supported by the default`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`execution provider.`
			`* Tensor representation: ONNXRuntime will utilize a standard representation for`
			`the tensor runtime values. The execution providers can internally use a`
Minor wording changes to design doc (#51) * Update HighLevelDesign.md * Update HighLevelDesign.md * Update HighLevelDesign.md 2018-11-29 03:43:03 +00:00			`different representation if they choose to, but it is their responsibility to`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`convert the values from/to the standard representation at the boundaries of`
			`their subgraph.`

Documentation reorganization (#1143) * Update Versioning.md * Update Versioning.md * Update README.md * Update README.md * Update README.md * Update README.md * Update BUILD.md * Update HighLevelDesign.md * Update Versioning.md * Update README.md * Update tool compat table * typo * Updates based on feedback * Update template to include model * Updates based on feedback * Typos 2019-07-02 00:11:50 +00:00			`## Extensibility Options`
Initial bootstrap commit. 2018-11-20 00:48:22 +00:00			`* [Add a custom operator/kernel](AddingCustomOp.md)`
			`* [Add an execution provider](AddingExecutionProvider.md)`
			`* [Add a new graph`
Fix the broken link. 2019-03-08 21:35:48 +00:00			`transform](../include/onnxruntime/core/optimizer/graph_transformer.h)`
			`* [Add a new rewrite rule](../include/onnxruntime/core/optimizer/rewrite_rule.h)`
Documentation updates for 1.2 for WinML (#3149) * api goverannce draft * Update CONTRIBUTING.md updated for ABI proposals * Update CONTRIBUTING.md * Update CONTRIBUTING.md * Incomplete, a draft iteartion of 2 more changes - api docs and high levle design * pushing to see how the picture size works on screen. * added 2 charts on api choice and distribution choice * details on contract checking * lint cleanup and links * PR feedback. * fixed markdown and lists * more markdown and lists * fixed broken links * PR feedback * commas * PR comments from nick * PR feedback * fixed build section Co-authored-by: Nick Geisler <36938193+ngeisler11@users.noreply.github.com> 2020-03-11 21:19:30 +00:00
			`## The ONNX Runtime and Windows OS integration`

			`The ONNX runtime shipped with the Windows operating system in build 1809 (RS5). The runtime was embedded inside the Windows.AI.MachineLearning.dll and was exposed via that WinRT API (WinML for short). It includes CPU support and a DirectML execution provider for GPU support. Since then it has continued to ship in every version of Windows.`

			`Starting with the ONNX Runtime 1.2 release we are bringing a new layered architecture to the ONNX Runtime and Windows ML.`
			`Note: This feature is preview as of the 1.2 release`

			`The high level design looks like this`

			`![ONNX + WinML layered architecture](images/layered-architecture.png)`

			`You can see we replaced the embedded ONNX runtime with the new ONNXRuntime.dll. With this new approach customers have flexibility on which API they choose to use and on how they want to distribute the binaries.`

			`### API choice`

			`Developers can now choose which API works best for their scenario.`

			`\|\|WinRT\|C API\|`
			`\|--\|--\|--\|`
			`\|Type system\| Integration with Windows RT types\| Platform neutral types\|`
Fix some typos. (#3582) * Fix some typos. * Fix a typo. 2020-04-18 21:18:05 +00:00			`\|Language support\| Language support via WinRT Projections\| Language support via per language projections\|`
Documentation updates for 1.2 for WinML (#3149) * api goverannce draft * Update CONTRIBUTING.md updated for ABI proposals * Update CONTRIBUTING.md * Update CONTRIBUTING.md * Incomplete, a draft iteartion of 2 more changes - api docs and high levle design * pushing to see how the picture size works on screen. * added 2 charts on api choice and distribution choice * details on contract checking * lint cleanup and links * PR feedback. * fixed markdown and lists * more markdown and lists * fixed broken links * PR feedback * commas * PR comments from nick * PR feedback * fixed build section Co-authored-by: Nick Geisler <36938193+ngeisler11@users.noreply.github.com> 2020-03-11 21:19:30 +00:00			`\|Tensorization\| Accepts VideoFrames and converts to tensors (support for CPU and GPU)\| Accepts tensors\|`

			`### Distribution choice`

			`You can also choose to use runtimes included in the Windows OS, or use the redist nuget to ship the runtime with the app.`

Add docs indicating that the onnxruntime engine from other distributions can be compatible with the WinRT NuGet (#5009) * add docs for mix and matching * typos Co-authored-by: Sheil Kumar <sheilk@microsoft.com> 2020-09-15 04:15:51 +00:00			`\|Distribution\|Inbox\|App NuGet\|`
Documentation updates for 1.2 for WinML (#3149) * api goverannce draft * Update CONTRIBUTING.md updated for ABI proposals * Update CONTRIBUTING.md * Update CONTRIBUTING.md * Incomplete, a draft iteartion of 2 more changes - api docs and high levle design * pushing to see how the picture size works on screen. * added 2 charts on api choice and distribution choice * details on contract checking * lint cleanup and links * PR feedback. * fixed markdown and lists * more markdown and lists * fixed broken links * PR feedback * commas * PR comments from nick * PR feedback * fixed build section Co-authored-by: Nick Geisler <36938193+ngeisler11@users.noreply.github.com> 2020-03-11 21:19:30 +00:00			`\|--\|--\|--\|`
			`\|Disk footprint\| Included in the OS\| Included in the App\|`
			`\|Servicing fixes\| Serviced by OS updates\| Serviced by the App\|`
			`\|Execution Providers\| CPU & DirectML EP \| App chosen EP\|`
Fix some typos. (#3582) * Fix some typos. * Fix a typo. 2020-04-18 21:18:05 +00:00			`\|Compatability testing\| Tested with OS flights against supported GPU's and CPU's \| App performs compatibility testing\|`
Documentation updates for 1.2 for WinML (#3149) * api goverannce draft * Update CONTRIBUTING.md updated for ABI proposals * Update CONTRIBUTING.md * Update CONTRIBUTING.md * Incomplete, a draft iteartion of 2 more changes - api docs and high levle design * pushing to see how the picture size works on screen. * added 2 charts on api choice and distribution choice * details on contract checking * lint cleanup and links * PR feedback. * fixed markdown and lists * more markdown and lists * fixed broken links * PR feedback * commas * PR comments from nick * PR feedback * fixed build section Co-authored-by: Nick Geisler <36938193+ngeisler11@users.noreply.github.com> 2020-03-11 21:19:30 +00:00			`\|Opset\| Refreshed in OS updates\| App chooses\|`
Add docs indicating that the onnxruntime engine from other distributions can be compatible with the WinRT NuGet (#5009) * add docs for mix and matching * typos Co-authored-by: Sheil Kumar <sheilk@microsoft.com> 2020-09-15 04:15:51 +00:00

			`### Using the NuGet WinRT API with other C-API distributions`
			`The WinRT API NuGet is distributed with a curated build of the OnnxRuntime engine. App developers may wish to use the WinRT API, but find themselves limited to the functionality provided by the curated OnnxRuntime engine distributed as part of the WinRT API NuGet package. This can happen because the OnnxRuntime engine shipped with the WinRT API NuGet package only contains the CPU and DML execution providers.`

			`App developers may additionally wish to use a custom build-from-source version of the OnnxRuntime engine as well, or use a prebuilt version of the OnnxRuntime engine from another distribution source like the Micorosoft.ML.OnnxRuntime.MKLML distribution.`

			`To enable this, the WinRT API NuGet has been made to be compatible with a set of OnnxRuntime engines that ship in different NuGet packages.`

			`Please refer to the following table listing the distributions with compatible OnnxRuntime engines.`
			`- [Microsoft.ML.OnnxRuntime](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime)`
			`- [Microsoft.ML.OnnxRuntime.DirectML](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.DirectML/)`
			`- [Microsoft.ML.OnnxRuntime.MKLML](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.MKLML)`

			`Note that compatible distributions must match in release version.`

			`In order to use compatible engines, replace the onnxruntime.dll with the desired engine binary and its associated binaries.`