saymrwulf/onnxruntime: ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-25 19:48:11 +00:00

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Find a file

pengwa 0471f6fbb3 Check type for building gradient graph (#17046 ) ### Check type for building gradient graph Bug1: To fix the error when running the model with ORTModule + Stage 3: ``` Exception happens when running <bound method Function.apply of <class 'onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction'>> Traceback (most recent call last): File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py", line 207, in call_python_forward_function wrapped_arg.requires_grad = is_training_mode and grad_flag RuntimeError: only Tensors of floating point and complex dtype can require gradients ``` This is because when running PythonA, the 3rd input is int64, we find it requires gradient during the check in gradient builder, so we set its requires_grad = True, but PyTorch thinks it is incorrect, throwing the exception. So we need understand why ORT gradient builder think the 3rd input need gradients. During `ReverseBFSWithStopGradient`, which do reverse BFS from graph outputs, it collects all nodes that are needed for computing the graph outputs. `ReverseBFSWithStopGradient` define a queue, initially add all nodes that generate graph outputs, then iterate the nodes one by one, checking each node's input, if the input did not hit stop edge and its node arg type is allowed type (float, etc), then the input node is append into the queue, do the next iteration of work. PythonOpA is such a node that is needed to compute graph outputs, then IsReachable(PythonOpA) will return True. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/c4c53fb9-15f7-4e8d-9aa2-7fc20555a001) In the above code snippet, when node is PythonOpB, and next_node being PythonOpA, we did not check node_arg type between node and next_node on the connection of PythonOpA's 3rd input to PythonOpB's outputs. So we append the int64 typed node args to sets that require gradient. Fix1: add the node arg type check before appending it into require grad lists. After the fixing, A unit test failed "orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax[data_type0-True-0-min] Fatal Python error: Segmentation fault". After investigation, it is another bug. Bug2: Without the above fix1, the execution graph looks like this ![image](https://github.com/microsoft/onnxruntime/assets/10530022/b2fd4b03-95c7-414a-b268-2ba6a7300105) As you can see, int64 type has a gradient edge built, while it is not used for any consumers. And the execution runs well. While think twice, int type should not have grad edge built. With the Fix1, the execution graph looks like this; ![image](https://github.com/microsoft/onnxruntime/assets/10530022/1870d3cc-2fe5-4aa7-ad6b-0d88dcc40f8a) So the int type node arg did not has gradient edge built. Fix1 is fixing this problem. But another bug happens if the inital "y_node_arg_names" e.g. in this case Aten's two outputs, 1st one in float, 2nd one in int. When we check the y_node (`6e6f582e08/orttraining/orttraining/core/framework/gradient_graph_builder.cc (L60C16-L60C16)`), we did not check the data type, then add it into `y_node_args_` which is the list of graph output node args that requires gradient. Then `non_differentiable_y_node_arg_names_` did not has the int type graph output. Then `6e6f582e08/orttraining/orttraining/core/framework/ortmodule_graph_builder.cc (L312C18-L312C18)` will try to get the grad node arg into `yield_output_node_args`, BUT the grad node arg is not built for int type node arg (with the Fix1). So we insert a nullptr, later when we using it, we get segment fault. Fix2 Again, we add the type check when handle y_node_args, also add null check when getting gradient node arg and append into yield_output_node_args		2023-08-10 14:24:42 +08:00
.config	Update tsaoptions.json: update the email alias (#13448 )	2022-10-26 15:56:16 -07:00
.devcontainer	Remove two lines in the Dockerfile for Github Codespace (#12278 )	2022-07-21 20:52:17 -07:00
.gdn	Update win-ci-pipeline.yml: enable xnnpack tests (#16244 )	2023-06-14 19:12:42 -07:00
.github	Add "windows_sdk_version" build arg and fix SCA build pipeline (#17062 )	2023-08-09 14:01:16 -07:00
.pipelines	Workaround to upgrade VS2022 for Windows ARM build (#16826 )	2023-07-25 08:35:52 +08:00
.vscode	Broadcasting for SLN for CPU and CUDA (#16510 )	2023-08-07 09:55:42 -07:00
cgmanifests	[TensorRT EP] TRT 8.6 minor version update (#16475 )	2023-06-26 10:44:27 -07:00
cmake	[ROCm] update header and binary search paths used by cmake (#17083 )	2023-08-10 11:05:21 +08:00
csharp	Make AzureEP default for python and c# packaging (#17025 )	2023-08-09 12:36:52 -07:00
dockerfiles	Enable model subgraph execution in OVEP and setting the OpenVINO dll's to the path from the OpenVINO pypi packge in OVEP and fix OVEP windows io buffer sample (#16147 )	2023-06-16 19:47:09 -07:00
docs	Openvino ep ort 5.1 (#17042 )	2023-08-09 11:50:10 -07:00
include/onnxruntime/core	Add API for updating CUDA EP provider option user compute stream (#17037 )	2023-08-09 09:24:19 -07:00
java	[java] Relaxing CoreML test (#16777 )	2023-08-09 11:43:05 -07:00
js	Fix Resize op input check (#16594 )	2023-08-09 15:42:30 -07:00
objectivec	Objective-C Add Support to Create and Query String ORTValues (#16764 )	2023-07-20 17:39:29 -07:00
onnxruntime	GRU Training and GRU Gradient Kernels (#16929 )	2023-08-09 21:24:47 -07:00
orttraining	Check type for building gradient graph (#17046 )	2023-08-10 14:24:42 +08:00
rust	Add rust bindings (#12606 )	2023-02-08 14:57:15 -08:00
samples	Enable pylint and numpy rules (#15218 )	2023-03-27 20:37:53 -07:00
swift/OnnxRuntimeBindingsTests	Add iOS Swift Package Manager support (#15297 )	2023-04-20 16:18:35 +10:00
tools	Add "windows_sdk_version" build arg and fix SCA build pipeline (#17062 )	2023-08-09 14:01:16 -07:00
winml	Format c++ code under `winml/` (#16660 )	2023-07-25 21:56:50 -07:00
.clang-format	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00
.clang-tidy	Create clang-tidy CI (#12653 )	2022-09-30 08:05:38 -07:00
.dockerignore	Update dockerfiles (#5929 )	2020-11-25 15:38:22 -08:00
.gitattributes
.gitignore	remove 'lib/' from .gitignore (#15613 )	2023-04-24 18:43:32 -07:00
.gitmodules	Update eigen to 3.4 and remove the eigen from git submodule (#15875 )	2023-05-11 11:56:59 -07:00
.lintrunner.toml	Format c++ code under `winml/` (#16660 )	2023-07-25 21:56:50 -07:00
build.bat	Upgrade old Python version in packaging pipeline (#16667 )	2023-07-17 08:24:47 -07:00
build.sh	Upgrade old Python version in packaging pipeline (#16667 )	2023-07-17 08:24:47 -07:00
CITATION.cff	Fix CITATION.cff and add automatic validation of your citation metadata (#10478 )	2022-04-13 10:03:52 -07:00
CODEOWNERS	Add owners for public facing API files (#15288 )	2023-03-30 17:16:15 -07:00
CONTRIBUTING.md	Fix link to High Level Design (#11786 )	2023-02-28 11:05:54 -08:00
lgtm.yml	Fix lgtm C++ error (#13613 )	2022-11-10 10:06:22 -08:00
LICENSE	Remove year from license (#6658 )	2021-02-12 00:25:56 -08:00
NuGet.config
ort.wprp
ORT_icon_for_light_bg.png
Package.swift	Objective-C Add Support to Create and Query String ORTValues (#16764 )	2023-07-20 17:39:29 -07:00
packages.config	[DML EP] Update DirectML version to 1.12.0 (#16011 )	2023-05-18 19:37:12 -07:00
pyproject.toml	Updating QDQ to support Float8E4M3FN (#16550 )	2023-08-08 12:18:48 +02:00
README.md	add third-party pipeline status to README.md (#16155 )	2023-05-31 22:14:39 -07:00
requirements-dev.txt	Remove codecov from requirements-dev.txt (#15487 )	2023-04-12 18:48:02 -07:00
requirements-doc.txt
requirements-lintrunner.txt	[Better Engineering] Bump ruff to 0.0.278 and fix new lint errors (#16789 )	2023-07-21 12:53:41 -07:00
requirements-training.txt	Remove protobuf pin from training requirements (#13695 )	2022-11-22 12:27:18 -08:00
requirements.txt.in	Add additional python requirements (#11522 )	2022-05-20 16:16:18 -07:00
SECURITY.md	Microsoft mandatory file (#11619 )	2022-05-25 13:56:10 -07:00
setup.py	Add mac and windows python packages for onnxruntime-training (#16993 )	2023-08-07 20:32:55 -07:00
ThirdPartyNotices.txt	Support SmoothQuant for ORT static quantization (#16288 )	2023-07-26 18:56:45 -07:00
VERSION_NUMBER	Update VERSION_NUMBER (#15773 )	2023-05-03 15:07:34 -07:00

README.md

ONNX Runtime is a cross-platform inference and training machine-learning accelerator.

ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. Learn more →

ONNX Runtime training can accelerate the model training time on multi-node NVIDIA GPUs for transformer models with a one-line addition for existing PyTorch training scripts. Learn more →

Get Started & Resources

General Information: onnxruntime.ai
Usage documention and tutorials: onnxruntime.ai/docs
YouTube video tutorials: youtube.com/@ONNXRuntime
Upcoming Release Roadmap
Companion sample repositories:
- ONNX Runtime Inferencing: microsoft/onnxruntime-inference-examples
- ONNX Runtime Training: microsoft/onnxruntime-training-examples

Builtin Pipeline Status

System	Inference	Training
Windows
Linux
Mac
Android
iOS
Web
Other

Third-party Pipeline Status

System	Inference	Training
Linux

Data/Telemetry

Windows distributions of this project may collect usage data and send it to Microsoft to help improve our products and services. See the privacy statement for more details.

Contributions and Feedback

We welcome contributions! Please see the contribution guidelines.

For feature requests or bug reports, please file a GitHub Issue.

For general discussion or questions, please use GitHub Discussions.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

License

This project is licensed under the MIT License.