Refactor Python API docs to better explain IO binding scenarios (#10651)

2026-07-14 18:12:05 +00:00 · 2022-03-15 09:40:59 -07:00 · 2022-03-15 09:40:59 -07:00 · 2d961604b1
commit 2d961604b1
parent f385c73058
2 changed files with 119 additions and 70 deletions
--- a/.github/workflows/publish-python-apidocs.yml
+++ b/.github/workflows/publish-python-apidocs.yml
@ -27,14 +27,13 @@ jobs:
      - name: Set vars
        id: vars
        run: echo "::set-output name=sha_short::$(git rev-parse --short HEAD)"
-      - name: Check outputs
-        run: echo ${{ steps.vars.outputs.sha_short }}
      - uses: actions/checkout@v2
        with:
          ref: gh-pages
          clean: false    
      - name: Move API docs into target area
        run: |
+          ls docs/api
          rm -rf docs/api/python
          mv build/docs/inference/html docs/api/python   
      - name: Create Pull Request
--- a/docs/python/inference/api_summary.rst
+++ b/docs/python/inference/api_summary.rst
@ -1,65 +1,107 @@

-===========
-API Summary
-===========
-
-Summary of public functions and classes exposed
-in *ONNX Runtime*.
+===
+API
+===

 .. contents::
    :local:

-OrtValue
-=========
+API Overview
+============

-*ONNX Runtime* works with native Python data structures which are mapped into ONNX data formats :
-Numpy arrays (tensors), dictionaries (maps), and a list of Numpy arrays (sequences).
-The data backing these are on CPU.
+*ONNX Runtime* loads and runs inference on a model in ONNX graph format, or ORT format (for memory and disk constrained environments).

-*ONNX Runtime* supports a custom data structure that supports all ONNX data formats that allows users
-to place the data backing these on a device, for example, on a CUDA supported device. This allows for
-interesting *IOBinding* scenarios (discussed below). In addition, *ONNX Runtime* supports directly
-working with *OrtValue* (s) while inferencing a model if provided as part of the input feed.
+The data consumed and produced by the model can be specified and accessed in the way that best matches your scenario.

-Below is an example showing creation of an *OrtValue* from a Numpy array while placing its backing memory
-on a CUDA device:
+Load and run a model
+--------------------
+
+InferenceSession is the main class of ONNX Runtime. It is used to load and run an ONNX model,
+as well as specify environment and application configuration options. 

 .. code-block:: python

-	# X is numpy array on cpu, create an OrtValue and place it on cuda device id = 0
-	ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(X, 'cuda', 0)
-	ortvalue.device_name()  # 'cuda'
-	ortvalue.shape()  # shape of the numpy array X
-	ortvalue.data_type()  # 'tensor(float)'
-	ortvalue.is_tensor()  # 'True'
+	session = onnxruntime.InferenceSession('model.onnx')
+
+	outputs = session.run([output names], inputs)
+
+ONNX and ORT format models consist of a graph of computations, modeled as operators,
+and implemented as optimized operator kernels for different hardware targets.
+ONNX Runtime orchestrates the execution of operator kernels via `execution providers`.
+An execution provider contains the set of kernels for a specific execution target (CPU, GPU, IoT etc).
+Execution provides are configured using the `providers` parameter. Kernels from different execution
+providers are chosen in the priority order given in the list of providers. In the example below
+if there is a kernel in the CUDA execution provider ONNX Runtime executes that on GPU. If not
+the kernel is executed on CPU.
+
+.. code-block:: python
+
+	session = onnxruntime.InferenceSession(model,
+	                                       providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
+
+The list of available execution providers can be found here: `Execution Providers <https://onnxruntime.ai/docs/execution-providers>`_.
+
+Since ONNX Runtime 1.10, you must explicitly specify the execution provider for your target.
+Running on CPU is the only time the API allows no explicit setting of the `provider` parameter.
+In the examples that follow, the `CUDAExecutionProvider` and `CPUExecutionProvider` are used, assuming the application is running on NVIDIA GPUs.
+Replace these with the execution provider specific to your environment.
+
+You can supply other session configurations via the `session options` parameter. For example, to enable
+profiling on the session:
+
+.. code-block:: python
+
+	options = onnxruntime.SessionOptions()
+	options.enable_profiling=True
+	session = onnxruntime.InferenceSession('model.onnx', sess_options=options, providers=['CUDAExecutionProvider', 'CPUExecutionProvider']))
+
+
+Data inputs and outputs
+-----------------------
+
+The ONNX Runtime Inference Session consumes and produces data using its OrtValue class.
+
+Data on CPU
+^^^^^^^^^^^
+
+On CPU (the default), OrtValues can be mapped to and from native Python data structures: numpy arrays, dictionaries and lists of
+numpy arrays. 
+
+.. code-block:: python
+
+	# X is numpy array on cpu
+	ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(X)
+	ortvalue.device_name()  # 'cpu'
+	ortvalue.shape()        # shape of the numpy array X
+	ortvalue.data_type()    # 'tensor(float)'
+	ortvalue.is_tensor()    # 'True'
 	np.array_equal(ortvalue.numpy(), X)  # 'True'

 	# ortvalue can be provided as part of the input feed to a model
-	ses = onnxruntime.InferenceSession('model.onnx')
-	res = sess.run(["Y"], {"X": ortvalue})
+	session = onnxruntime.InferenceSession('model.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider']))
+	results = session.run(["Y"], {"X": ortvalue})

-IOBinding
-=========
+By default, *ONNX Runtime* always places input(s) and output(s) on CPU. Having the data on CPU
+may not optimal if the input or output is consumed and produced on a device
+other than CPU because it introduces data copy between CPU and the device.

-By default, *ONNX Runtime* always places input(s) and output(s) on CPU, which 
-is not optimal if the input or output is consumed and produced on a device
-other than CPU because it introduces data copy between CPU and the device. 
-*ONNX Runtime* provides a feature, *IO Binding*, which addresses this issue by
-enabling users to specify which device to place input(s) and output(s) on. 
-Here are scenarios to use this feature. 

-(In the following code snippets, *model.onnx* is the model to execute, 
-*X* is the input data to feed, and *Y* is the output data.)
+Data on device
+^^^^^^^^^^^^^^

-Scenario 1:
+*ONNX Runtime* supports a custom data structure that supports all ONNX data formats that allows users
+to place the data backing these on a device, for example, on a CUDA supported device. In ONNX Runtime,
+this called `IOBinding`.
+
+To use the `IOBinding` feature, replace `InferenceSession.run()` with `InferenceSession.run_with_iobinding()`.

 A graph is executed on a device other than CPU, for instance CUDA. Users can 
-use IOBinding to put input on CUDA as the follows.
+use IOBinding to copy the data onto the GPU.

 .. code-block:: python

 	# X is numpy array on cpu 
-	session = onnxruntime.InferenceSession('model.onnx')
+	session = onnxruntime.InferenceSession('model.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider']))
 	io_binding = session.io_binding()
 	# OnnxRuntime will copy the data over to the CUDA device if 'input' is consumed by nodes on the CUDA device 
 	io_binding.bind_cpu_input('input', X)
@ -67,37 +109,32 @@ use IOBinding to put input on CUDA as the follows.
 	session.run_with_iobinding(io_binding)
 	Y = io_binding.copy_outputs_to_cpu()[0]

-Scenario 2:
-
 The input data is on a device, users directly use the input. The output data is on CPU.

 .. code-block:: python

 	# X is numpy array on cpu
 	X_ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(X, 'cuda', 0)
-	session = onnxruntime.InferenceSession('model.onnx')
+	session = onnxruntime.InferenceSession('model.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider']))
 	io_binding = session.io_binding()
 	io_binding.bind_input(name='input', device_type=X_ortvalue.device_name(), device_id=0, element_type=np.float32, shape=X_ortvalue.shape(), buffer_ptr=X_ortvalue.data_ptr())
 	io_binding.bind_output('output')
 	session.run_with_iobinding(io_binding)
 	Y = io_binding.copy_outputs_to_cpu()[0]

-Scenario 3:
-
-The input data and output data are both on a device, users directly use the input and also place output on the device.
+The input data and output data are both on a device, users directly use the input and also place output on the device.	

 .. code-block:: python

 	#X is numpy array on cpu
 	X_ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(X, 'cuda', 0)
 	Y_ortvalue = onnxruntime.OrtValue.ortvalue_from_shape_and_type([3, 2], np.float32, 'cuda', 0)  # Change the shape to the actual shape of the output being bound
-	session = onnxruntime.InferenceSession('model.onnx')
+	session = onnxruntime.InferenceSession('model.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider']))
 	io_binding = session.io_binding()
 	io_binding.bind_input(name='input', device_type=X_ortvalue.device_name(), device_id=0, element_type=np.float32, shape=X_ortvalue.shape(), buffer_ptr=X_ortvalue.data_ptr())
 	io_binding.bind_output(name='output', device_type=Y_ortvalue.device_name(), device_id=0, element_type=np.float32, shape=Y_ortvalue.shape(), buffer_ptr=Y_ortvalue.data_ptr())
 	session.run_with_iobinding(io_binding)

-Scenario 4:

 Users can request *ONNX Runtime* to allocate an output on a device. This is particularly useful for dynamic shaped outputs.
 Users can use the *get_outputs()* API to get access to the *OrtValue* (s) corresponding to the allocated output(s).
@ -107,7 +144,7 @@ Users can thus consume the *ONNX Runtime* allocated memory for the output as an

 	#X is numpy array on cpu
 	X_ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(X, 'cuda', 0)
-	session = onnxruntime.InferenceSession('model.onnx')
+	session = onnxruntime.InferenceSession('model.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider']))
 	io_binding = session.io_binding()
 	io_binding.bind_input(name='input', device_type=X_ortvalue.device_name(), device_id=0, element_type=np.float32, shape=X_ortvalue.shape(), buffer_ptr=X_ortvalue.data_ptr())
 	#Request ONNX Runtime to bind and allocate memory on CUDA for 'output'
@ -117,7 +154,7 @@ Users can thus consume the *ONNX Runtime* allocated memory for the output as an
 	ort_output = io_binding.get_outputs()[0]


-Scenario 5:
+In addition, *ONNX Runtime* supports directly working with *OrtValue* (s) while inferencing a model if provided as part of the input feed.

 Users can bind *OrtValue* (s) directly.

@ -127,39 +164,52 @@ Users can bind *OrtValue* (s) directly.
 	#X is numpy array on cpu
 	X_ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(X, 'cuda', 0)
 	Y_ortvalue = onnxruntime.OrtValue.ortvalue_from_shape_and_type([3, 2], np.float32, 'cuda', 0)  # Change the shape to the actual shape of the output being bound
-	session = onnxruntime.InferenceSession('model.onnx')
+	session = onnxruntime.InferenceSession('model.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider']))
 	io_binding = session.io_binding()
 	io_binding.bind_ortvalue_input('input', X_ortvalue)
 	io_binding.bind_ortvalue_output('output', Y_ortvalue)
 	session.run_with_iobinding(io_binding)

-Device
-======

-The package is compiled for a specific device, GPU or CPU.
-The CPU implementation includes optimizations
-such as MKL (Math Kernel Libary). The following function
-indicates the chosen option:
+You can also bind inputs and outputs directly to a PyTorch tensor.

-.. autofunction:: onnxruntime.get_device
+.. code-block:: python

-Examples and datasets
-=====================
+    # X is a PyTorch tensor on device
+    session = onnxruntime.InferenceSession('model.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider']))
+    binding = session.io_binding()

-The package contains a few models stored in ONNX format
-used in the documentation. These don't need to be downloaded
-as they are installed with the package.
+    X_tensor = X.contiguous()

-.. autofunction:: onnxruntime.datasets.get_example
+    binding.bind_input(
+        name='X',
+        device_type='cuda',
+        device_id=0,
+        element_type=np.float32,
+        shape=tuple(x_tensor.shape),
+        buffer_ptr=x_tensor.data_ptr(),
+        )

-Load and run a model
-====================
+    ## Allocate the PyTorch tensor for the model output
+    Y_shape = ... # You need to specify the output PyTorch tensor shape
+    Y_tensor = torch.empty(Y_shape, dtype=torch.float32, device='cuda:0').contiguous()
+    binding.bind_output(
+        name='Y',
+        device_type='cuda',
+        device_id=0,
+        element_type=np.float32,
+        shape=tuple(Y_tensor.shape),
+        buffer_ptr=Y_tensor.data_ptr(),
+    )

-*ONNX Runtime* reads a model saved in ONNX format.
-The main class *InferenceSession* wraps these functionalities
-in a single place.
+    session.run_with_iobinding(binding)

-Main class
+
+API Details
+===========
+
+
+InferenceSession
 ----------

 .. autoclass:: onnxruntime.InferenceSession