onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-14 20:48:00 +00:00

Author	SHA1	Message	Date
Changming Sun	328a13c06d	Enable VCPKG in more pipelines (#23590 ) ### Description Enable VCPKG in more pipelines	2025-02-06 10:10:31 -08:00
Changming Sun	5f6a3158f8	Enable VCPKG in CI build (#23426 ) ### Description 1. Enable VCPKG flag in Windows CPU CI build pipelines. 2. Increased the min supported cmake version from 3.26 to 3.28. Because of it, drop the support for the old way of finding python by "find_package(PythonLibs)". Therefore, in build.py we no longer set "PYTHON_EXECUTABLE" cmake var when doing cmake configure. 3. Added "xnnpack-ep" as a feature for ORT's vcpkg config. 4. Added asset cache support for ORT's vcpkg build 5. Added VCPKG triplet files for Android build. 6. Set VCPKG triplet to "universal2-osx" if CMAKE_OSX_ARCHITECTURES was found in cmake extra defines. 7. Removed a small piece of code in build.py, which was for support CUDA version < 11.8. 8. Fixed an issue that CMAKE_OSX_ARCHITECTURES sometimes got specified twice when build.py invoked cmake. 9. Added more model tests to Android build. After this change, we will test all ONNX versions instead of just the latest one. 10. Fixed issues that are related to build.py's "--build_nuget" parameter. Also, enable the flag in most Windows CPU CI build jobs. 11. Removed a restriction in build.py that disallowed cross-compiling Windows ARM64 nuget package on Windows x86. ### Motivation and Context Adopt vcpkg.	2025-02-05 10:58:53 -08:00
Takeshi Watanabe	7e2408880e	Enable dlpack by default (#23110 ) ### Description <!-- Describe your changes. --> This PR will enable python dlpack interface by default. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> dlpack python interface is useful in inference mode not only training mode. Since some inference result preprocess may be written in torch and making unnecessary device transfer should be reduced in those cases. closes https://github.com/microsoft/onnxruntime/issues/15963 closes https://github.com/microsoft/onnxruntime/issues/22061 TODOs: - [x] Add tests like `5407c69028/orttraining/orttraining/test/python/orttraining_test_ortvalue.py` that's unrelated to training feature --------- Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-01-30 23:23:56 +01:00
Adrian Lizarraga	3b4c7df4e9	[QNN EP] Make QNN EP a shared library (#23120 ) ### Description - Makes QNN EP a shared library by default when building with `--use_qnn` or `--use_qnn shared_lib`. Generates the following build artifacts: - Windows: `onnxruntime_providers_qnn.dll` and `onnxruntime_providers_shared.dll` - Linux: `libonnxruntime_providers_qnn.so` and `libonnxruntime_providers_shared.so` - Android: Not supported. Must build QNN EP as a static library. - Allows QNN EP to still be built as a static library with `--use_qnn static_lib`. This is primarily for the Android QNN AAR package. - Unit tests run for both the static and shared QNN EP builds. ### Detailed changes - Updates Java bindings to support both shared and static QNN EP builds. - Provider bridge API: - Adds logging sink ETW to the provider bridge. Allows EPs to register ETW callbacks for ORT logging. - Adds a variety of methods for onnxruntime objects that are needed by QNN EP. - QNN EP: - Adds `ort_api.h` and `ort_api.cc` that encapsulates the API provided by ORT in a manner that allows the EP to be built as either a shared or static library. - Adds custom function to transpose weights for Conv and Gemm (instead of adding util to provider bridge API). - Adds custom function to quantize data for LeakyRelu (instead of adding util to provider bridge API). - Adds custom ETW tracing for QNN profiling events: - shared library: defines its own TraceLogging provider handle - static library: uses ORT's TraceLogging provider handle and existing telemetry provider. - ORT-QNN Packages: - Python: Pipelines build QNN EP as a shared library by default. User can build a local python wheel with QNN EP as a static library by passing `--use_qnn static_lib`. - NuGet: Pipelines build QNN EP as a shared library by default. `build.py` currently enforces QNN EP to be built as a shared library. Can add support for building a QNN NuGet package with static later if deemed necessary. - Android: Pipelines build QNN EP as a static library. `build.py` enforces QNN EP to be built as a static library. Packaging multiple shared libraries into an Android AAR package is not currently supported due to the added need to also distribute a shared libcpp.so library. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2025-01-22 12:11:00 -08:00
Changming Sun	69bb53db85	Enable delay loading hooker for python packages (#23227 ) ### Description Enable delay loading hooker for python packages	2024-12-31 10:12:31 -08:00
Kee	8c52fa3924	[VSINPU]Split/Pad and some element-wise OPs support (#22916 ) ### Description -Add split/pad/neg/not/ceil/round/min/max op support -Fix conv2d op default pads value issue -Add VSINPU EP to support python bindings ### Motivation and Context -New OPs support for VSINPU EP --------- Signed-off-by: Kee <xuke537@hotmail.com>	2024-12-02 13:57:30 -08:00
Changming Sun	13346fdf18	Cleanup code (#22827 ) ### Description 1. Delete TVM EP because it is out of maintain 2. Delete ortmodule related docker files and scripts.	2024-11-19 14:13:33 -08:00
Changming Sun	66980e4646	Refactor the cmake code that is related to delay loading (#22646 ) ### Description Refactor the cmake code that is related to delay loading. Provide a cmake option to control if delay loading should be enabled or not. Disabling the option when python is enabled, due to a known issue. ### Motivation and Context ONNX Runtime's python package depends on DirectML.dll, but supposedly the DLL should be delay loaded. This PR only refactor the code. It doesn't change the behavior.	2024-11-04 16:30:50 -08:00
Yulong Wang	c5d28cac4d	Initial WebGPU EP checkin (#22318 ) ### Description This change introduces the WebGPU EP into ONNX Runtime. To make the PR as simple as possible, this PR excluded the following: - C API changes for WebGPU EP - actual implementation of WebGPU EP. Currently in this PR, WebGPU is a stub implementation that does not register any kernel. - Python IO Binding update - Node.js IO Binding update This PR now contains only 43 file changes (while the working branch contains 130+) and hopefully this makes it easier to review. There is going to be separated PRs for each mentioned above. Current working branch: #21904	2024-10-08 16:10:46 -07:00
Dmitri Smirnov	d9de054eb5	Multi-Lora support (#22046 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-30 15:59:07 -07:00
George Wu	944d87381d	[QNN EP] set up py packaging pipeline for Linux x64 (#22132 ) set up a pipeline to produce nightly Linux x64 whls for onnxruntime-qnn this can be used for offline context binary generation.	2024-09-18 23:24:32 -07:00
Tianlei Wu	a9740d6f96	Add onnx export script for segment anything v2 (#22119 ) ### Description Add ONNX export script for segment anything v2 (SAM2). ### Limitations * Does not support video. Only support image right now. * The decoder does not support batch inference. ### Credits The demo that is based on [SAM2 notebook](https://github.com/facebookresearch/segment-anything-2/blob/main/notebooks/image_predictor_example.ipynb), and modified to run with ORT. The export of decoder is inspired by https://github.com/vietanhdev/samexporter. ### Demo Example output of demo: ![sam2_demo](https://github.com/user-attachments/assets/9a9fa360-8c20-482e-9935-a7aba9cf15de) ### Motivation and Context For support optimization of SAM2 image segmentation.	2024-09-18 14:31:59 -07:00
Ranjit Ranjan	02e3a430af	[AIX] Python binding enablement and gcc support (#21934 ) ### Description Enabling python binding and gcc support for AIX. ### Motivation and Context Code changes in this PR contains: 1. python binding enablement 2. gcc building support Below are list of files and the description. 1. cmake/CMakeLists.txt [gcc building support] -no-unused-function compiler flag addition for IBMClang 2. cmake/external/eigen.cmake [gcc building support] AIX check for applying the AIX patch 3. cmake/onnxruntime_python.cmake [python binding ] putting NOT AIX check for -Xlinker 4. cmake/onnxruntime_unittests.cmake [gcc building support] Fix for gtest behavior. Check the comment . [python binding ] using -Wl,-brtl for linking onnxruntime_providers_shared in test_execution_provider 5. cmake/patches/eigen/eigen-aix.patch [gcc building support] In AIX gcc, we are hitting __builtin_cpu_supports("mma") which is not supported yet. So patching code for this method . Patched code will check for P10 Processor at run-time and based on that routine will be set. 6. onnxruntime/python/onnxruntime_validation.py [python binding ] Adding AIX check in check_distro_info() 7. onnxruntime/test/providers/cpu/generator/random_test.cc [gcc building support] updating previous check for AIX , along with clang. So in case of gcc, else block will hit. 8. onnxruntime/test/python/onnxruntime_test_python.py [python binding ] powerpc check on platform.processor() 9. setup.py [python binding ] Adding AIX check for list of libs.	2024-08-30 12:17:26 -07:00
Edward Chen	a5ce65d87a	Clean up some mobile package related files and their usages. (#21606 ) The mobile packages have been removed.	2024-08-05 16:38:20 -07:00
Julius Tischbein	1391354265	Adding CUDNN Frontend and use for CUDA NN Convolution (#19470 ) ### Description Added CUDNN Frontend and used it for NHWC convolutions, and optionally fuse activation. #### Backward compatible - For model existed with FusedConv, model can still run. - If ORT is built with cuDNN 8, cuDNN frontend will not be built into binary. Old kernels (using cudnn backend APIs) are used. #### Major Changes - For cuDNN 9, we will enable cudnn frontend to fuse convolution and bias when a provider option `fuse_conv_bias=1`. - Remove the fusion of FusedConv from graph transformer for CUDA provider, so there will not be FusedConv be added to graph for CUDA EP in the future. - Update cmake files regarding to cudnn settings. The search order of CUDNN installation in build are like the following: * environment variable `CUDNN_PATH` * `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from build.py/build.sh, user can pass it through `--cudnn_home` parameter, or by environment variable `CUDNN_HOME` if `--cudnn_home` not used. * cudnn python package installation directory like python3.xx/site-packages/nvidia/cudnn * CUDA installation path #### Potential Issues - If ORT is built with cuDNN 8, FusedConv fusion is no longer done automatically, so some model might have performance regression. If user still wants FusedConv operator for performance reason, they can still have multiple ways to walkaround: like use older version of onnxruntime; or use older version of ORT to save optimized onnx, then run with latest version of ORT. We believe that majority users have moved to cudnn 9 when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3 months when 1.20 release), so the impact is small. - cuDNN graph uses TF32 by default, and user cannot disable TF32 through the use_tf32 cuda provider option. If user encounters accuracy issue (like in testing), user has to set environment variable `NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of use_tf32 later. #### Follow ups This is one of PRs that target to enable NHWC convolution in CUDA EP by default if device supports it. There are other changes will follow up to make it possible. (1) Enable `prefer_nhwc` by default for device with sm >= 70. (2) Change `fuse_conv_bias=1` by default after more testing. (3) Add other NHWC operators (like Resize or UpSample). ### Motivation and Context The new CUDNN Frontend library provides the functionality to fuse operations and provides new heuristics for kernel selection. Here it fuses the convolution with the pointwise bias operation. On the [NVIDIA ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/) we get a performance boost from 49.1144 ms to 42.4643 ms per inference on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i 'prefer_nhwc\|1' resnet50.onnx`). --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>	2024-08-02 15:16:42 -07:00
Tianlei Wu	2b7e2a5bd0	[CUDA] Fix cuda provider fallback inconsistency (#21425 ) * Fix fallback setting (cuda still falls back to cuda). * Fix cuda provider fallback inconsistent with/without CUDA_PATH environment variable. * Add cuda and cudnn major version requirement in error message. Example result in Windows: ``` >>> import onnxruntime >>> ort_session = onnxruntime.InferenceSession("model.onnx", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) 2024-07-19 17:43:44.2260019 [E:onnxruntime:Default, provider_bridge_ort.cc:1972 onnxruntime::TryGetProviderInfo_CUDA] D:\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc:1636 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Users\.conda\envs\py310\lib\site-packages\onnxruntime\capi\onnxruntime_providers_cuda.dll" 2024-07-19 17:43:44.2312351 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:970 onnxruntime::python::CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9.* and CUDA 12., and the latest MSVC runtime. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported. >>> ort_session <onnxruntime.capi.onnxruntime_inference_collection.InferenceSession object at 0x0000016BB2DF7D60> >>> ort_session.get_providers() ['CPUExecutionProvider'] ``` Example result in Linux: ``` >>> import onnxruntime >>> ort_session = onnxruntime.InferenceSession("resnet50-v2-7.onnx", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) 2024-07-20 20:33:26.486974543 [E:onnxruntime:Default, provider_bridge_ort.cc:1972 TryGetProviderInfo_CUDA] /work/onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1636 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.12: cannot open shared object file: No such file or directory 2024-07-20 20:33:26.487034646 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:961 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9. and CUDA 12.*. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported. >>> ort_session.get_providers() ['CPUExecutionProvider'] ``` ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/21424	2024-07-23 11:58:04 -07:00
Baiju Meswani	116398c1a4	onnxruntime shared lib inside python package (#21223 )	2024-07-02 15:37:50 -07:00
Changming Sun	d1ab94c2b0	Add compatibility for NumPy 2.0 (#21085 ) ### Description As suggested by SciPy's doc, we will `Build against NumPy 2.0.0, then it will work for all NumPy versions with the same major version number (NumPy does maintain backwards ABI compatibility), and as far back as NumPy 1.19 series at the time of writing` I think it works because in [numpyconfig.h#L64](https://github.com/numpy/numpy/blob/main/numpy/_core/include/numpy/numpyconfig.h#L64) there is a macro NPY_FEATURE_VERSION. By default it is set to NPY_1_19_API_VERSION. And the NPY_FEATURE_VERSION macro controls ABI. This PR only upgrade the build time dependency; When a user installs ONNX Runtime, they still can use numpy 1.x. ### Motivation and Context Recently numpy published a new version, 2.0.0, which is incompatible with the latest ONNX Runtime release.	2024-06-27 13:50:53 -07:00
Changming Sun	be423747b1	Delete pyop (#21094 ) ### Description Remove the "--enable_language_interop_ops" build flag, because the code is incompatible with the latest numpy, and the build flag is not used anywhere except a macOS CI pipeline. It does not seem to have a ship plan. ### Motivation and Context The build error was: ``` onnxruntime/core/language_interop_ops/pyop/pyop.cc:122:85: error: no member named 'elsize' in '_PyArray_Descr' static_cast<int64_t>(PyArray_DescrFromType(type)->elsize), ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ```	2024-06-19 16:21:33 -07:00
Scott McKay	159fe9d4f3	Update to mobile model usability checker (#19843 ) ### Description <!-- Describe your changes. --> - Add check for CoreML MLProgram supported ops - Only check usability with ORT Mobile package if requested - this package will be deprecated so info is a) of minimal value and b) can be confusing. - Output more things at INFO level - a lot of meaningful info was only output at DEBUG level. The default INFO level is more useful - dump full partition info at DEBUG level - Check subgraphs fully - CoreML can handle a subgraph - TBD if we want to add support for adding a subgraph to the parent graph for Loop and If nodes - most likely will be required for simple If nodes to be performant - Check 5D CoreML limitation ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve helper tools --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-06-18 07:50:33 +10:00
George Wu	49b2bebe85	[qnn ep] include qnn sdk in onnxruntime-qnn python whl (#20485 ) script changes to include qnn sdk libs with onnxruntime-qnn python package.	2024-04-29 09:44:54 -07:00
Rachel Guo	14fcf0a52d	Support visionos build (#20365 ) ### Description <!-- Describe your changes. --> This PR supports a build of onnxruntime.xcframework for xros/xrsimulator for visionos via the build command of `python3 tools/ci_build/github/apple/build_apple_framework.py --config Release/Debug tools/ci_build/github/apple/default_vision_os_framework_build_settings.json`. For officially include visionos in ios cocoapods package and testing in CI, would require separate work for upgrading the Xcode version & upgrade macOS CI agent to macos-13-arm64 or higher. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> visionos support: https://github.com/microsoft/onnxruntime/discussions/19313 --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>	2024-04-23 18:15:07 -07:00
Adam Louly	ee74fb6908	Introducing ORTPipelineModule - DeepSpeed Parallel Pipeline Support. (#20287 ) ### Description Introducing a new class ORTPipelineModule to handle wrapping layers in DeepSpeed pipeline parallel. ### Motivation and Context To support pipeline parallelism on ORTModule. This PR will include an initial support of deepspeed Pipeline parallelism. - [x] Support Pipeline parallel where layers are nn Modules in Sequential. - [ ] Support LayerSpec and TiedLayerSpec - [ ] Enable partitioning to accept List - [ ] Full-GPU Graph Consolidation - [ ] Subgraph Merging for Inference	2024-04-18 11:30:15 -07:00
George Wu	08d208b969	[QNN EP] refactor QNN deps/copy logic. start copying deps to target python loc… (#20317 ) copy QNN deps when building python bindings as well. tweak the wildcard to only copy QNN related files. latest sdk from Qualcomm (>= 2.21) also include SNPE dll's which we don't want to include.	2024-04-15 22:33:12 -07:00
Maximilian Müller	c20ced4132	Use CMake's find package for CUDA libs (#19673 ) ### Description Answers issue #19640 More details are in the issue, basically I am changing all the include directory and link directory usage to CMake's `CUDA::*` targets	2024-02-27 11:26:48 -08:00
Ye Wang	aaf32fb1b1	phi2 conversion/optimization script (#19338 ) ### Description <!-- Describe your changes. --> This PR adds onnx conversion script for dynamo exported phi2, optimization script, and inference example script A readme file is added as documentation. https://github.com/microsoft/onnxruntime/tree/wangye/phi2_doc/onnxruntime/python/tools/transformers/models/phi2#readme ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-02-05 10:15:16 -08:00
Yueqing Zhang	1d6f13fb92	[VitisAI] Refactor the VAIEP to use MSFT's standalone API (#19058 ) ### Description <!-- Describe your changes. --> Refactor the VAIEP to use MSFT's standalone API ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Vitis ONNX RT VAI should switch to using the standalone API for ONNX EPs in order to decouple the EP from onnxruntime.dll and the providers.dll. This will help to simplify customer deployment of applications and use cases that need to share their onnxruntime.dll with other applications. --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com> Co-authored-by: zz002 <zhenze.wang@amd.com>	2024-01-31 21:08:26 -08:00
Wei-Sheng Chin	658e30eb33	Remove DORT since it's in PyTorch main now (#18996 ) Main code are removed and tests are modified to use DORT directly from PyTorch.	2024-01-04 12:59:47 -08:00
Adrian Lizarraga	81796a3081	[QNN EP Quantization] Add fusion preprocessing to QNN quantization (#18719 ) ### Description - Adds graph fusions to preprocessing step that can be called before creating a QDQ model for QNN EP. - Fuse Erf sequence to Gelu (adapted from [optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_gelu.py)). Required by QNN EP. - Fuse ReduceMean sequence to LayerNormaliation (adapted from [optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_layernorm.py)). Not required by QNN EP. - Fuse ReduceL2 sequence to LpNormalization (new, specific to QNN EP). Required by QNN EP. Example use: ```python3 from quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model # Added by this PR: model_updated = qnn_preprocess_model("model.fp32.onnx", "model.fp32.preprocessed.onnx", fuse_layernorm=True) model_to_quantize = "model.fp32.preprocessed.onnx" if model_updated else "model.fp32.onnx" # Quantize model ... qnn_config = get_qnn_qdq_config(model_to_quantize, data_reader, activation_type=QuantType.QUInt16) quantize(model_to_quantize, "model.qdq.onnx", qnn_config) ``` ### Motivation and Context Allow more models to be quantized for use with QNN EP --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2023-12-12 08:43:04 -08:00
Adrian Lizarraga	e066fca777	[Quantization] Tensor quant overrides and QNN EP quantization configuration (#18465 ) ### Description #### 1. Adds `TensorQuantOverrides` extra option Allows specifying a dictionary of tensor-level quantization overrides: ``` TensorQuantOverrides = dictionary : Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For per-channel quantization, the list contains a dictionary for each channel in the tensor. Each dictionary contains optional overrides with the following keys and values. 'quant_type' = QuantType : The tensor's quantization data type. 'scale' = Float : The scale value to use. Must also specify `zero_point` if set. 'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set. 'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also set `scale` or `zero_point`. 'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also set `scale` or `zero_point`. 'rmax' = Float : Override the maximum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. 'rmin' = Float : Override the minimum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. ``` - All of the options are optional. - Some combinations are invalid. - Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale` are also specified. Example for per-tensor quantization overrides: ```Python3 extra_options = { "TensorQuantOverrides": { "SIG_OUT": [{"scale": 1.0, "zero_point": 127}], "WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], "BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], }, } ``` Example for per-channel quantization overrides (Conv weight and bias): ```Python3 extra_options = { "TensorQuantOverrides": { "WGT": [ { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.0, "rmax": 2.5, "reduce_range": True, }, { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.2, "rmax": 2.55, "reduce_range": False, }, ], "BIAS": [ {"zero_point": 0, "scale": 0.000621}, {"zero_point": 0, "scale": 0.23}, ], }, } ``` #### 2. Adds utilities to get the default QDQ configs for QNN EP Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method that inspects the model and returns suitable quantization configurations. Example usage: ```python3 from quantization import quantize, QuantType from quantization.execution_providers.qnn import get_qnn_qdq_config qnn_config = get_qnn_qdq_config(input_model_path, data_reader, activation_type=QuantType.QUInt16, weight_type=QuantType.QUInt8) quantize(input_model_path, output_model_path, qnn_config) ``` ### Motivation and Context Make it possible to create more QDQ models that run on QNN EP. --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2023-12-04 17:54:58 -08:00
Akshay Sonawane	97cc40d75a	Add fusion patterns for conformer-transducer model (#18461 ) ### Description Add conformer-transducer model type to optimizer. This PR adds pattern matches for attention shown below: Unfused attention: ![ct_unfused](https://github.com/microsoft/onnxruntime/assets/111780983/46c71ed8-67e0-4607-85b1-bcadba5a2956) Fused attention: ![ct_fused](https://github.com/microsoft/onnxruntime/assets/111780983/fbb91c96-0d4b-4f0b-8674-1ae3b9b9a92e)	2023-11-18 23:39:04 -08:00
Ashwini Khade	02333293de	Removed all the deprecated python training code and related tests and utils (#18333 ) ### Description Motivation for this PR is code cleanup. 1. Remove all deprecated python code related to orttrainer, old checkpoint, related tests and utils 2. Cleanup orttraining_pybind_state.cc to remove all deprecated bindings.	2023-11-17 18:19:21 -08:00
Vincent Wang	b7408f7389	[ORTModule] ATen Efficient Attention and Triton Flash Attention (#17959 ) This PR is to support efficient attention and flash attention in ORTModule, including: - Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable. - Integrate Triton Flash attention, which requires triton==2.0.0.dev20221202. Need A100 or H100. ORTMODULE_USE_FLASH_ATTENTION=1 to enable. - A python transformer tool to match sub-graph by config and write transformer quickly. Current transformers supports attention mask for both efficient attn and flash attn, and dropout for efficient attn only. To support more training scenarios (such as causal mask in GPT2), more transformers need to be added. The feature is guarded by system environment variables, it won't effect any current behavior if not enabled. Since it requires specific PyTorch/Triton versions, related tests is not added for now.	2023-10-27 10:29:27 +08:00
kunal-vaishnavi	edac3ef150	Add LLaMA scripts (#17020 ) ### Description This PR adds the following scripts for LLaMA: - LLaMA conversion (support for TorchScript and Dynamo exporters) - LLaMA parity - LLaMA benchmark - LLaMA quantization - LLaMA integration with [Hugging Face Optimum](https://github.com/huggingface/optimum) ### Motivation and Context This PR adds scripts for using LLaMA. There is a [follow-up PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding scripts for Whisper.	2023-08-22 18:05:11 -07:00
pengwa	a6887f171f	Refactor schema extraction and output unflattening (#16894 ) ### Motivation and Context When we handle PyTorch models' inputs in different places (ORTModule or others), it's common for us to flatten a structured data into a 1-D tensor list (required by lib for example torch.onnx.export, torch.autograd.Function.forward or ORT inference session), then do subsequent work, then unflatten back to original hierarchy as returned values. DeepStage3 hooks support work also need such a lib to do similar things, so I was proposing to extract this pair of APIs in training/utils/, which can be more used more generally. Also a comprehensive set of test data are used for testing unflatten/flatten in unit tests. Let me know if you have any other suggestions. ### Refactor schema extraction and output unflattening Move `_extract_schema` and `unflatten_user_output` in `orttraining/orttraining/python/training/ortmodule/_io.py` . to `extract_data_and_schema` and `unflatten_data_using_schema` in `orttraining/orttraining/python/training/utils/torch_io_helper.py` as shared libs, which can be used later by other features (deepspeed stage 3 hook rewrite). While there are still a few duplicated logic handling flatten with different task by recursively loop the data struct, will change them step by step in case of heavy review efforts.	2023-08-04 13:58:21 +08:00
Vincent Wang	c07a3b869c	Triton Codegen for ORTModule (#15831 ) Fuse connected elementwise and reduce Ops to TritonOp and codegen triton code to run the kernel. This PR is co-edited by @wejoncy and @er3x3	2023-07-13 18:17:58 +08:00
Edward Chen	6be7b03e53	Enable `-Wshorten-64-to-32` warning if available. (#16524 ) - Fix some warnings from Xcode build (`-Wshorten-64-to-32`). - Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet. - Some clean up in build.py including setting CMake generator more consistently.	2023-07-07 08:11:44 -07:00
saurabh	a6ce7b339f	Enable model subgraph execution in OVEP and setting the OpenVINO dll's to the path from the OpenVINO pypi packge in OVEP and fix OVEP windows io buffer sample (#16147 ) ### Description This PR enables execution of subgraphs in OVEP and currently, when OVEP developers install the onnxruntime-openvino package on windows from pypi, they would have to additionally download OpenVINO windows binaries and run the setupvars.bat script which sets the environment PATH to locate the OV dll's. Also this PR fixes issues of OVEP windows io buffer sample. ### Motivation and Context Fix: We want to make the user experience easy for OVEP Python developers on windows platform. This fix, introduces a function add_openvino_libs_to_path at the location tools/python/util/add_openvino_win_libs.py. The above function, can be called by OVEP python users in the application code and that takes care of setting the OpenVINO dll's to the path from the OpenVINO pypi packge (openvino) which was installed. This change also makes sure that add_openvino_libs_to_path() function is added to onnxruntime python package only when it is build for OpenVINO Execution Provider for ONNXRuntime and not for default ORT python package builds. New user experience for Python OVEP developers on windows platform: step 1: pip install onnxruntime-openvino step 2: pip install openvino step 3: <Add these 2 lines in the application code> import onnxruntime.tools.add_openvino_win_libs as utils utils.add_openvino_libs_to_path() --------- Signed-off-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>	2023-06-16 19:47:09 -07:00
Sumit Agarwal	70d2dc8209	[DML EP] Fix issue with --dml_path build option (#15972 ) ### Description DML_PACKAGE_DIR cmake variable is not getting set properly when dml_path build options is used. ### Motivation and Context - Why is this change required? What problem does it solve? It is required for DML Perf dashboard. <!--- If it fixes an open issue, please link to the issue here. -->	2023-05-24 19:20:40 -05:00
Changming Sun	0204594f90	Cleanup WASM cmake code (#15996 ) ### Description Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if (CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code look more nature. For example, ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY) ``` becomes ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten") ```	2023-05-20 18:07:39 -07:00
Baiju Meswani	ba7b83ff3c	Remove onnxruntime_PYBIND_EXPORT_OPSCHEMA definition from onnxruntime (#15776 )	2023-05-03 13:08:35 -07:00
Changming Sun	5352f6d9b0	Make "--cuda_version" build arg optional (#15758 ) ### Description This change will allow us building CUDA EP without installing CUDA SDK on Windows. ### Motivation and Context Nvidia's CUDA installer comes with a VS extension. In the past, we require installing the extension. It is a little bit inconvenient since: 1. Visual Studio must be installed before CUDA SDK. CUDA's installer will not install the extension if your machine doesn't have Visual Studio. 2. We need to install CUDA SDK on our build machines, instead of just downloading it and using it. After this change, we will not need to install CUDA SDK on our build machines. So it will be easier to add a support for a different CUDA version. Also, fix two PreFast warnings.	2023-05-01 18:00:47 -07:00
Sumit Agarwal	4c4f688a93	[DML EP] Fix dml_external_project (#15656 ) ### Description While building ORT for DML EP with `dml_EXTERNAL_PROJECT` flag, 2 variables (`DML_SHARED_LIB`, `DML_PACKAGE_DIR`) value is not set properly. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-01 12:02:56 -07:00
kunal-vaishnavi	cfb8c0e2ca	Add Whisper custom export to wheel (#15685 ) ### Description This PR adds the Whisper custom export scripts to the wheel. ### Motivation and Context This enables access to the custom export scripts in the wheel.	2023-04-26 10:45:52 -07:00
kunal-vaishnavi	901c2bc384	Whisper Model Optimization (#15473 ) ### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - https://github.com/huggingface/optimum/pull/872 - https://github.com/huggingface/optimum/pull/920 ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/15100 - https://github.com/microsoft/onnxruntime/issues/15235 - https://github.com/huggingface/optimum/issues/869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - https://github.com/microsoft/onnxruntime/pull/15247 - https://github.com/microsoft/onnxruntime/pull/15339 - https://github.com/microsoft/onnxruntime/pull/15362 - https://github.com/microsoft/onnxruntime/pull/15365 - https://github.com/microsoft/onnxruntime/pull/15427 This PR uses changes from the following merged PRs: - https://github.com/microsoft/onnxruntime/pull/14198 - https://github.com/microsoft/onnxruntime/pull/14146 - https://github.com/microsoft/onnxruntime/pull/14201 - https://github.com/microsoft/onnxruntime/pull/14928 (this introduced the new multi-head attention spec)	2023-04-18 17:13:54 -07:00
Changming Sun	d175e87a1f	Delete eager mode code and increase minimal required python version to 3.8 (#15450 ) ### Description 1. Delete eager mode code. 2. Increase the minimal required python version to 3.8.	2023-04-10 16:00:04 -07:00
pengwa	1d32285536	Statistics tool for ORTModule convergence parity (#15020 ) ### Statistics tool for ORTModule convergence parity As ORTModule get more and more validated, it is pretty fast to intergrade PyTorch based model with ORT. The same time, we need make sure once there is convergence issue, we don't spend months of time to investigate. As part of this efforts, this PR is introducing a tool to dump activation statistics without much involvement from users. The dumping results contains only some statistic numbers plus sampled data, which is not big, compared with dumping all the tensors, it is much faster and space efficient. For us to use it, two single lines are needed before wrapping ORTModule. For baseline run, need also apply the same trick. ``` + from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber + SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)]) ``` Once you run the steps, following command can be used to merge result into per-step-summary respectively for ORT and baseline runs. ```bash python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output ``` Docs is added here as part of this PR [convergence investigation notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md) Based on the generated merged files, we can compare them with tools. ![image](https://user-images.githubusercontent.com/10530022/224653929-4e4480bd-bb02-4bbe-bd44-2672bdf91a87.png) ### Design and Implementation This PR introduced a common mechanism registering custom logic for nn.Module's post forward hooks. And statistics for activation (StatisticsSubscriber) is one of the implementations. If there is other needs, we can define another XXSubscriber to do the customized things.	2023-03-23 20:34:24 +08:00
George Wu	289f7dbcdd	enable pybind for qnn ep (#14897 ) enable python bindings for QNN EP. tested on Windows Dev Kit 2023 (ARM64) with python 3.11 (ARM64) from https://www.python.org/ftp/python/3.11.1/python-3.11.1-arm64.exe	2023-03-03 07:26:53 -08:00
Tianlei Wu	742658d171	Stable Diffusion CUDA optimizations Part 2 (#14597 ) ### Description This is a follow-up of https://github.com/microsoft/onnxruntime/pull/14428 for Stable Diffusion CUDA optimizations: (1) use NchwConv to replace Conv in onnx graph and add Tranpose nodes accordingly (2) reduce sequential Transpose nodes to at most one. (3) symbolic shape infer of NchwConv (4) fix add bias transpose which causes CUDA error (launching more than 1024 threads per block) in inferencing fp32 model. (5) add models (bert, bart, stable_diffusion subdirectories) to package; (6) remove option --disable_channels_last Note that (1) We can add a few graph transformations to reduce Transpose nodes further. It is not done in this PR due to time limit. (2) Stable diffusion 2.1 model outputs black images. It seems that forcing Attention to float32 could avoid the issue. However it is much slow to use float32 Attention. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-07 07:49:15 -08:00
Scott McKay	7f374f4012	Fix build error on Windows if Python debug libraries are installed (#14308 ) ### Description <!-- Describe your changes. --> If a user installs the debug libraries from Python on Windows the ORT python project file attempts to use the debug python lib, which conflicts with a pragma in pyconfig.h that wants the release lib (due to pybind11 undefining _DEBUG). Explicitly use the release lib instead of Python::Module so the build doesn't break. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix obtuse build break.	2023-01-17 09:48:26 +10:00

1 2 3 4 5

213 commits