ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Find a file
pengwa 0471f6fbb3
Check type for building gradient graph (#17046)
### Check type for building gradient graph

**Bug1**: 

To fix the error when running the model with ORTModule + Stage 3:

```
Exception happens when running  <bound method Function.apply of <class 'onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction'>>
Traceback (most recent call last):
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py", line 207, in call_python_forward_function
    wrapped_arg.requires_grad = is_training_mode and grad_flag
RuntimeError: only Tensors of floating point and complex dtype can require gradients

```

This is because when running PythonA, the 3rd input is int64, we find it
requires gradient during the check in gradient builder, so we set its
requires_grad = True, but PyTorch thinks it is incorrect, throwing the
exception. So we need understand why ORT gradient builder think the 3rd
input need gradients.


During `ReverseBFSWithStopGradient`, which do reverse BFS from graph
outputs, it collects all nodes that are needed for computing the graph
outputs. `ReverseBFSWithStopGradient` define a queue, initially add all
nodes that generate graph outputs, then iterate the nodes one by one,
checking each node's input, if the input did not hit stop edge and its
node arg type is allowed type (float, etc), then the input node is
append into the queue, do the next iteration of work.

PythonOpA is such a node that is needed to compute graph outputs, then
IsReachable(PythonOpA) will return True.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/c4c53fb9-15f7-4e8d-9aa2-7fc20555a001)

In the above code snippet, when node is PythonOpB, and next_node being
PythonOpA, we did not check node_arg type between node and next_node on
the connection of PythonOpA's 3rd input to PythonOpB's outputs. So we
append the int64 typed node args to sets that require gradient.


**Fix1**: add the node arg type check before appending it into require
grad lists.


After the fixing, A unit test failed
"orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax[data_type0-True-0-min]
Fatal Python error: Segmentation fault". After investigation, it is
another bug.

**Bug2**: 

Without the above fix1, the execution graph looks like this


![image](https://github.com/microsoft/onnxruntime/assets/10530022/b2fd4b03-95c7-414a-b268-2ba6a7300105)

As you can see, int64 type has a gradient edge built, while it is not
used for any consumers. And the execution runs well. While think twice,
int type should not have grad edge built.

With the Fix1, the execution graph looks like this;


![image](https://github.com/microsoft/onnxruntime/assets/10530022/1870d3cc-2fe5-4aa7-ad6b-0d88dcc40f8a)

So the int type node arg did not has gradient edge built. **Fix1** is
fixing this problem.

But another bug happens if the inital "y_node_arg_names" e.g. in this
case Aten's two outputs, 1st one in float, 2nd one in int. When we check
the y_node
(6e6f582e08/orttraining/orttraining/core/framework/gradient_graph_builder.cc (L60C16-L60C16)),
we did not check the data type, then add it into `y_node_args_` which is
the list of graph output node args that requires gradient. Then
`non_differentiable_y_node_arg_names_` did not has the int type graph
output.

Then
6e6f582e08/orttraining/orttraining/core/framework/ortmodule_graph_builder.cc (L312C18-L312C18)
will try to get the grad node arg into `yield_output_node_args`, BUT the
grad node arg is not built for int type node arg (with the **Fix1**). So
we insert a nullptr, later when we using it, we get segment fault.

**Fix2** 

Again, we add the type check when handle y_node_args, also add null
check when getting gradient node arg and append into
yield_output_node_args
2023-08-10 14:24:42 +08:00
.config Update tsaoptions.json: update the email alias (#13448) 2022-10-26 15:56:16 -07:00
.devcontainer Remove two lines in the Dockerfile for Github Codespace (#12278) 2022-07-21 20:52:17 -07:00
.gdn Update win-ci-pipeline.yml: enable xnnpack tests (#16244) 2023-06-14 19:12:42 -07:00
.github Add "windows_sdk_version" build arg and fix SCA build pipeline (#17062) 2023-08-09 14:01:16 -07:00
.pipelines Workaround to upgrade VS2022 for Windows ARM build (#16826) 2023-07-25 08:35:52 +08:00
.vscode Broadcasting for SLN for CPU and CUDA (#16510) 2023-08-07 09:55:42 -07:00
cgmanifests [TensorRT EP] TRT 8.6 minor version update (#16475) 2023-06-26 10:44:27 -07:00
cmake [ROCm] update header and binary search paths used by cmake (#17083) 2023-08-10 11:05:21 +08:00
csharp Make AzureEP default for python and c# packaging (#17025) 2023-08-09 12:36:52 -07:00
dockerfiles Enable model subgraph execution in OVEP and setting the OpenVINO dll's to the path from the OpenVINO pypi packge in OVEP and fix OVEP windows io buffer sample (#16147) 2023-06-16 19:47:09 -07:00
docs Openvino ep ort 5.1 (#17042) 2023-08-09 11:50:10 -07:00
include/onnxruntime/core Add API for updating CUDA EP provider option user compute stream (#17037) 2023-08-09 09:24:19 -07:00
java [java] Relaxing CoreML test (#16777) 2023-08-09 11:43:05 -07:00
js Fix Resize op input check (#16594) 2023-08-09 15:42:30 -07:00
objectivec Objective-C Add Support to Create and Query String ORTValues (#16764) 2023-07-20 17:39:29 -07:00
onnxruntime GRU Training and GRU Gradient Kernels (#16929) 2023-08-09 21:24:47 -07:00
orttraining Check type for building gradient graph (#17046) 2023-08-10 14:24:42 +08:00
rust Add rust bindings (#12606) 2023-02-08 14:57:15 -08:00
samples Enable pylint and numpy rules (#15218) 2023-03-27 20:37:53 -07:00
swift/OnnxRuntimeBindingsTests Add iOS Swift Package Manager support (#15297) 2023-04-20 16:18:35 +10:00
tools Add "windows_sdk_version" build arg and fix SCA build pipeline (#17062) 2023-08-09 14:01:16 -07:00
winml Format c++ code under winml/ (#16660) 2023-07-25 21:56:50 -07:00
.clang-format Run clang-format in CI (#15524) 2023-04-18 09:26:58 -07:00
.clang-tidy Create clang-tidy CI (#12653) 2022-09-30 08:05:38 -07:00
.dockerignore
.gitattributes
.gitignore remove 'lib/' from .gitignore (#15613) 2023-04-24 18:43:32 -07:00
.gitmodules Update eigen to 3.4 and remove the eigen from git submodule (#15875) 2023-05-11 11:56:59 -07:00
.lintrunner.toml Format c++ code under winml/ (#16660) 2023-07-25 21:56:50 -07:00
build.bat Upgrade old Python version in packaging pipeline (#16667) 2023-07-17 08:24:47 -07:00
build.sh Upgrade old Python version in packaging pipeline (#16667) 2023-07-17 08:24:47 -07:00
CITATION.cff Fix CITATION.cff and add automatic validation of your citation metadata (#10478) 2022-04-13 10:03:52 -07:00
CODEOWNERS Add owners for public facing API files (#15288) 2023-03-30 17:16:15 -07:00
CONTRIBUTING.md Fix link to High Level Design (#11786) 2023-02-28 11:05:54 -08:00
lgtm.yml Fix lgtm C++ error (#13613) 2022-11-10 10:06:22 -08:00
LICENSE
NuGet.config
ort.wprp
ORT_icon_for_light_bg.png
Package.swift Objective-C Add Support to Create and Query String ORTValues (#16764) 2023-07-20 17:39:29 -07:00
packages.config [DML EP] Update DirectML version to 1.12.0 (#16011) 2023-05-18 19:37:12 -07:00
pyproject.toml Updating QDQ to support Float8E4M3FN (#16550) 2023-08-08 12:18:48 +02:00
README.md add third-party pipeline status to README.md (#16155) 2023-05-31 22:14:39 -07:00
requirements-dev.txt Remove codecov from requirements-dev.txt (#15487) 2023-04-12 18:48:02 -07:00
requirements-doc.txt
requirements-lintrunner.txt [Better Engineering] Bump ruff to 0.0.278 and fix new lint errors (#16789) 2023-07-21 12:53:41 -07:00
requirements-training.txt Remove protobuf pin from training requirements (#13695) 2022-11-22 12:27:18 -08:00
requirements.txt.in Add additional python requirements (#11522) 2022-05-20 16:16:18 -07:00
SECURITY.md Microsoft mandatory file (#11619) 2022-05-25 13:56:10 -07:00
setup.py Add mac and windows python packages for onnxruntime-training (#16993) 2023-08-07 20:32:55 -07:00
ThirdPartyNotices.txt Support SmoothQuant for ORT static quantization (#16288) 2023-07-26 18:56:45 -07:00
VERSION_NUMBER Update VERSION_NUMBER (#15773) 2023-05-03 15:07:34 -07:00

ONNX Runtime is a cross-platform inference and training machine-learning accelerator.

ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. Learn more →

ONNX Runtime training can accelerate the model training time on multi-node NVIDIA GPUs for transformer models with a one-line addition for existing PyTorch training scripts. Learn more →

Get Started & Resources

Builtin Pipeline Status

System Inference Training
Windows Build Status
Build Status
Build Status
Linux Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Mac Build Status
Android Build Status
iOS Build Status
Web Build Status
Other Build Status
Build Status

Third-party Pipeline Status

System Inference Training
Linux Build Status

Data/Telemetry

Windows distributions of this project may collect usage data and send it to Microsoft to help improve our products and services. See the privacy statement for more details.

Contributions and Feedback

We welcome contributions! Please see the contribution guidelines.

For feature requests or bug reports, please file a GitHub Issue.

For general discussion or questions, please use GitHub Discussions.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

License

This project is licensed under the MIT License.