onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-07 17:15:29 +00:00

Author	SHA1	Message	Date
Yulong Wang	7a8fa12850	Add implementation of WebGPU EP (#22591 ) ### Description This PR adds the actual implementation of the WebGPU EP based on https://github.com/microsoft/onnxruntime/pull/22318. This change includes the following: <details> <summary><b>core framework of WebGPU EP</b></summary> - WebGPU EP factory classes for: - handling WebGPU options - creating WebGPU EP instance - creating WebGPU context - WebGPU Execution Provider classes - GPU Buffer allocator - data transfer - Buffer management classes - Buffer Manager - BufferCacheManager - DisabledCacheManager - SimpleCacheManager - LazyReleaseCacheManager - BucketCacheManager - Program classes - Program (base) - Program Cache Key - Program Manager - Shader helper classes - Shader Helper - ShaderIndicesHelper - ShaderVariableHelper - Utils - GPU Query based profiler - compute context - string utils - Miscs - Python binding webgpu support (basic) </details> <details> <summary><b>Kernel implementation</b></summary> - onnx.ai (default opset): - Elementwise (math): Abs, Neg, Floor, Ceil, Reciprocal, Sqrt, Exp, Erf, Log, Sin, Cos, Tan, Asin, Acos, Atan, Sinh, Cosh, Asinh, Acosh, Atanh, Tanh, Not, Cast - Elementwise (activation): Sigmoid, HardSigmoid, Clip, Elu, Relu, LeakyRelu, ThresholdedRelu, Gelu - Binary (math): Add, Sub, Mul, Div, Pow, Equal, Greater, GreaterOrEqual, Less, LessOrEqual - (Tensors): Shape, Reshape, Squeeze, Unsqueeze - Where - Transpose - Concat - Expand - Gather - Tile - Range - LayerNormalization - com.microsoft - FastGelu - MatMulNBits - MultiHeadAttention - RotaryEmbedding - SkipLayerNormalization - LayerNormalization - SimplifiedLayerNormalization - SkipSimplifiedLayerNormalization </details> <details> <summary><b>Build, test and CI pipeline integration</b></summary> - build works for Windows, macOS and iOS - support onnxruntime_test_all and python node test - added a new unit test for `--use_external_dawn` build flag. - updated MacOS pipeline to build with WebGPU support - added a new pipeline for WebGPU Windows </details> This change does not include: - Node.js binding support for WebGPU (will be a separate PR)	2024-10-29 18:29:40 -07:00
Indy Zhu	e2e837584f	[DML EP] Update DML to 1.15.4 (#22635 ) ### Description [DML EP] Update DML to 1.15.4 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We want the customer to use the latest DirectML.	2024-10-29 17:13:57 -07:00
Changming Sun	88676e62b9	Remove nsync (#20413 ) ### Description 1. Remove the onnxruntime::OrtMutex class and replace it with ~absl::Mutex~ std::mutex. 2. After this change, most source files will not include <Windows.h> indirectly. ### Motivation and Context To reduce the number of deps we have, and address some Github issues that are related to build ONNX Runtime from source. In PR #3000 , I added a custom implementation of std::mutex . It was mainly because at that time std::mutex's default constructor was not trivial on Windows. If you had such a mutex as a global var, it could not be initialized at compile time. Then VC++ team fixed this issue. Therefore we don't need this custom implementation anymore. This PR also removes nsync. I ran several models tests on Linux. I didn't see any perf difference. This PR also reverts PR #21005 , which is no longer needed since conda has updated its msvc runtime DLL. This PR unblocks #22173 and resolves #22092 . We have a lot of open issues with nsync. This PR can resolve all of them.	2024-10-21 15:32:14 -07:00
Ted Themistokleous	572e43c5d7	[MIGraphX EP/ ROCm EP] add gfx1200, gfx1201 to CMAKE_HIP_ARCHITECTURES (#22348 ) ### Description Add additonal gfx targets for AMD GPU support ### Motivation and Context Required to integrate mainline onnxruntime support for AMD GPUs --------- Co-authored-by: Stefan Sokolovic <stsokolo@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-10-11 17:31:36 -07:00
Changming Sun	2bef89c171	Upgrade absl to the latest released version (#22365 ) ### Description Resolve #21976 . ABSL generally does not have forward/backward compatibility. Our code is only compatible with one fixed LTS version. So it's important to fix the version number there when using find_package to detect an installed version.	2024-10-09 20:21:40 -07:00
Yulong Wang	c5d28cac4d	Initial WebGPU EP checkin (#22318 ) ### Description This change introduces the WebGPU EP into ONNX Runtime. To make the PR as simple as possible, this PR excluded the following: - C API changes for WebGPU EP - actual implementation of WebGPU EP. Currently in this PR, WebGPU is a stub implementation that does not register any kernel. - Python IO Binding update - Node.js IO Binding update This PR now contains only 43 file changes (while the working branch contains 130+) and hopefully this makes it easier to review. There is going to be separated PRs for each mentioned above. Current working branch: #21904	2024-10-08 16:10:46 -07:00
Tianlei Wu	f3f33bfa05	Upgrade cutlass to 3.5.1 and cudnn frontend to 1.7.0 (#22316 ) ### Description Upgrade cutlass to 3.5.1 Upgrade cudnn_frontend to 1.7.0	2024-10-04 11:48:50 -07:00
Edward Chen	f1be92faf0	Patch fp16 to fix Xcode 16 builds with XNNPACK EP targeting x86_64. (#22294 )	2024-10-03 14:17:15 -07:00
Yufeng Li	96e9c99dce	remove neural-speed (#22236 ) ### Description <!-- Describe your changes. --> NS is not developed anymore and ORT doesn't use it for int4 inference either. Remove it to clean up the code ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-10-01 09:50:44 -07:00
Sumit Agarwal	529835cc46	[DML EP] Update DML to 1.15.2 (#22247 ) ### Description Update DML binary to the current latest redist version [1.15.2](https://www.nuget.org/packages/Microsoft.AI.DirectML/1.15.2).	2024-09-27 13:20:29 -07:00
Edward Chen	209ff86d52	Get build working on Xcode 16 (#22168 )	2024-09-24 08:33:03 -07:00
Yi Zhang	8d2d40781c	set CMAKE_SYSTEM_PROCESSOR in xnnpack.cmake (#22155 ) ### Description <!-- Describe your changes. --> ### Motivation and Context By default, CMAKE_SYSTEM_PROCESSOR is same CMAKE_HOST_SYSTEM_PROCESSOR https://cmake.org/cmake/help/latest/variable/CMAKE_SYSTEM_PROCESSOR.html KleidiAI uses CMAKE_SYSTEM_PROCESSOR to determine whether to include some arm64 ukernels. https://gitlab.arm.com/kleidi/kleidiai/-/blob/main/CMakeLists.txt#L134 We use Mac with Intel CPU to cross compile MAC with ARM in ios packaging pipeline So we need to make CMAKE_SYSTEM_PROCESSOR same with ORT_TARGET_PROCESSOR	2024-09-20 15:19:26 -07:00
Yi Zhang	b94ba09e4f	Upgrade XNNPACK to latest version (#22012 ) ### Description Update XNNPack to latest version (Sep 4) - Some op outputs are changed, channel or stride paras are moved into reshape func. e.g. `96962a602d` - input params of xnnpack's resize related function are changed a lot - KleidiAI is added as a dependency in ARM64 - The latest XNNPACK includes 2 static libs microkernels-prod and xnnpack. Without microkernels-prod, it throws the exception of Undefined symbols. - Add ORT_TARGET_PROCESSOR to get the real processor target in CMake	2024-09-17 10:12:16 -07:00
PARK DongHa	f633caa0b1	Create CMake option `onnxruntime_USE_VCPKG` (#21348 ) ### Changes 1. CMake option `onnxruntime_USE_VCPKG`. It will be used in the vcpkg port * Unit test may fail because this option leads to a mixture of unexpected external library versions. Especially ONNX, Protobuf, and Flatbuffers version can be different 2. Overhaul of `onnxruntime_external_deps.cmake` * Make `FetchContent_Declare` to try `find_package`. See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html * Relocated `FetchContent_Declare` and `FetchContent_MakeAvailable`(or `onnxruntime_fetchcontent_makeavailable`) to closer lines. It was too hard to navigate the entire file to search related sections... * Alias `IMPORTED` targets like build targets (e.g. `ONNX::onnx` --> `onnx`) ```cmake # The script uses `find_package` with the changes. # In this case, use vcpkg to search dependencies # See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html include(external/onnxruntime_external_deps.cmake) ``` 3. Create CMakePresets.json and presets to [run vcpkg in manifest mode](https://learn.microsoft.com/en-us/vcpkg/concepts/manifest-mode) * Currently, it's NOT for training build * Main triplets are `x64-windows` and `x64-osx` ```pwsh Push-Location "cmake" cmake --preset "x64-windows-vcpkg" cmake --build --preset "x64-windows-vcpkg-debug" Pop-Location ``` ```bash pushd "cmake" cmake --preset "x64-osx-vcpkg" cmake --build --preset "x64-osx-vcpkg-debug" popd ``` 4. Updated tools/ci_build/build.py * `--use_vcpkg` option: it needs `CMAKE_TOOLCHAIN_FILE` with [vcpkg.cmake toolchain script](https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake) * `--compile_no_warning_as_error` is recommended because library version differences will cause unexpected compiler warnings ```bash python ./tools/ci_build/build.py \ --compile_no_warning_as_error \ --use_vcpkg \ --cmake_extra_defines "CMAKE_TOOLCHAIN_FILE:FILEPATH=${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake" \ --cmake_extra_defines "VCPKG_TARGET_TRIPLET=..." ``` 5. Created Job `Vcpkg` for Windows and macOS * Show how to setup and use vcpkg. Similar to the CMakePresets.json usage ### Motivation and Context * Help #7150 * Help https://github.com/microsoft/vcpkg/pull/36850 * https://github.com/luncliff/vcpkg-registry/pull/212 * https://github.com/microsoft/vcpkg/pull/39881 * https://github.com/luncliff/vcpkg-registry/pull/215 * https://github.com/luncliff/vcpkg-registry/pull/216 * https://github.com/luncliff/vcpkg-registry/pull/227 * https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html * https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake ### Future Works? More feature coverage with the vcpkg supported libraries * CUDA feature support * Training feature support	2024-09-10 16:39:27 -07:00
Ranjit Ranjan	02e3a430af	[AIX] Python binding enablement and gcc support (#21934 ) ### Description Enabling python binding and gcc support for AIX. ### Motivation and Context Code changes in this PR contains: 1. python binding enablement 2. gcc building support Below are list of files and the description. 1. cmake/CMakeLists.txt [gcc building support] -no-unused-function compiler flag addition for IBMClang 2. cmake/external/eigen.cmake [gcc building support] AIX check for applying the AIX patch 3. cmake/onnxruntime_python.cmake [python binding ] putting NOT AIX check for -Xlinker 4. cmake/onnxruntime_unittests.cmake [gcc building support] Fix for gtest behavior. Check the comment . [python binding ] using -Wl,-brtl for linking onnxruntime_providers_shared in test_execution_provider 5. cmake/patches/eigen/eigen-aix.patch [gcc building support] In AIX gcc, we are hitting __builtin_cpu_supports("mma") which is not supported yet. So patching code for this method . Patched code will check for P10 Processor at run-time and based on that routine will be set. 6. onnxruntime/python/onnxruntime_validation.py [python binding ] Adding AIX check in check_distro_info() 7. onnxruntime/test/providers/cpu/generator/random_test.cc [gcc building support] updating previous check for AIX , along with clang. So in case of gcc, else block will hit. 8. onnxruntime/test/python/onnxruntime_test_python.py [python binding ] powerpc check on platform.processor() 9. setup.py [python binding ] Adding AIX check for list of libs.	2024-08-30 12:17:26 -07:00
Guenther Schmuelling	ba7baae994	Revert "Upgrade emsdk from 3.1.59 to 3.1.62" (#21817 ) Reverts microsoft/onnxruntime#21421 Users are seeing chrome memory grow to 16GB before it crashes: https://github.com/microsoft/onnxruntime/issues/21810 Revert for now so we have time to debug.	2024-08-22 11:21:00 -07:00
Tianlei Wu	fbc3927231	[CUDA] cuDNN Flash Attention (#21629 ) ### Description - [x] Add cuDNN flash attention using cudnn frontend, and enable it in MultiHeadAttention operator. - [x] Support attention mask. - [x] Support attention bias. - [x] Update tests and benchmark script. The cuDNN SDPA is disabled by default. To enable it, need the following: (1) Requires cuDNN 9.3 or newer version installed. (2) Set an environment variable `ORT_ENABLE_CUDNN_FLASH_ATTENTION=1` or set `sdpa_kernel=8` cuda provider option to enable it. (3) Only works on devices with compute capability >= 8.0. Note that some combinations of parameters might be rejected due to limited support of head dimension or sequence lengths. Future Works: (1) FP8 and BF16 APIs. Currently, only API for FP16 are exposed. (2) Add API to support ragged batching (padding removed in inputs). (3) Support other input formats (like QKV_BS3NH). (4) Currently, q are converted to BSNH, k/v are converted to either BSNH or BNSH format. May do some experiment to see whether converting q to BNSH could be better in some case. ### Example Benchmark Results on H100 The following tests are on FP16 MultiHeadAttention operator without attention mask and attention bias. #### Test Setting 1 batch_size \| sequence_length \| past_sequence_length \| num_heads \| head_size -- \| -- \| -- \| -- \| -- 16 \| 256 \| 0 \| 32 \| 128 format \| average_latency \| tflops \| kernel -- \| -- \| -- \| -- Q,K,V (BNSH) \| 0.000075 \| 229.5 \| torch:flash Q,K,V (BNSH) \| 0.000119 \| 144.8 \| torch:efficient Q,K,V (BNSH) \| 0.000224 \| 76.5 \| torch:math Q,K,V (BSNH) \| 0.000075 \| 227.8 \| ort:cudnn Q,K,V (BSNH) \| 0.000094 \| 182.8 \| ort:flash Q,K,V (BSNH) \| 0.000138 \| 124.7 \| ort:efficient Q,K,V (BSNH) \| 0.000438 \| 39.3 \| ort:math Q,KV \| 0.000129 \| 133.0 \| ort:cudnn Q,KV \| 0.000151 \| 114.1 \| ort:flash Q,KV \| 0.000194 \| 88.5 \| ort:efficient QKV \| 0.000154 \| 111.8 \| ort:cudnn QKV \| 0.000175 \| 98.0 \| ort:flash QKV \| 0.000217 \| 79.0 \| ort:efficient #### Test Setting 2 batch_size \| sequence_length \| past_sequence_length \| num_heads \| head_size -- \| -- \| -- \| -- \| -- 16 \| 512 \| 0 \| 16 \| 64 format \| average_latency \| tflops \| kernel -- \| -- \| -- \| -- Q,K,V (BNSH) \| 0.000069 \| 249.2 \| torch:flash Q,K,V (BNSH) \| 0.000141 \| 121.7 \| torch:efficient Q,K,V (BNSH) \| 0.000294 \| 58.5 \| torch:math Q,K,V (BSNH) \| 0.000077 \| 221.7 \| ort:cudnn Q,K,V (BSNH) \| 0.000087 \| 196.6 \| ort:flash Q,K,V (BSNH) \| 0.000163 \| 105.6 \| ort:efficient Q,K,V (BSNH) \| 0.000651 \| 26.4 \| ort:math Q,KV \| 0.000103 \| 167.1 \| ort:cudnn Q,KV \| 0.000117 \| 146.3 \| ort:flash Q,KV \| 0.000192 \| 89.6 \| ort:efficient QKV \| 0.000113 \| 151.5 \| ort:cudnn QKV \| 0.000128 \| 134.7 \| ort:flash QKV \| 0.000201 \| 85.3 \| ort:efficient	2024-08-20 08:50:22 -07:00
Scott McKay	c97cc5c1b0	Put all external project targets under the 'External' folder in VS (#21765 ) ### Description <!-- Describe your changes. --> Handle targets in subdirectories for external projects. All targets will now go in a per-project folder under 'External' e.g. gmock and gtest now get handled correctly and are under External/googletest vs. existing setup where they ended up as top-level projects. ![image](https://github.com/user-attachments/assets/99ec259c-47cd-44f3-954d-58569c941cc2) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve developer experience.	2024-08-16 15:51:50 +10:00
Satya Kumar Jandhyala	6d8de1f7b8	Upgrade emsdk from 3.1.59 to 3.1.62 (#21421 ) ### Description Upgrade EM SDK to 3.1.62. ### Motivation and Context The changes are required to clear wasm64 errors.	2024-08-14 12:38:52 -07:00
Sumit Agarwal	c5592fdcef	[DML EP] Update DML to 1.15.1 (#21695 ) ### Description Update DML runtime binary to 1.15.1 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-12 14:16:43 -07:00
Julius Tischbein	1391354265	Adding CUDNN Frontend and use for CUDA NN Convolution (#19470 ) ### Description Added CUDNN Frontend and used it for NHWC convolutions, and optionally fuse activation. #### Backward compatible - For model existed with FusedConv, model can still run. - If ORT is built with cuDNN 8, cuDNN frontend will not be built into binary. Old kernels (using cudnn backend APIs) are used. #### Major Changes - For cuDNN 9, we will enable cudnn frontend to fuse convolution and bias when a provider option `fuse_conv_bias=1`. - Remove the fusion of FusedConv from graph transformer for CUDA provider, so there will not be FusedConv be added to graph for CUDA EP in the future. - Update cmake files regarding to cudnn settings. The search order of CUDNN installation in build are like the following: * environment variable `CUDNN_PATH` * `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from build.py/build.sh, user can pass it through `--cudnn_home` parameter, or by environment variable `CUDNN_HOME` if `--cudnn_home` not used. * cudnn python package installation directory like python3.xx/site-packages/nvidia/cudnn * CUDA installation path #### Potential Issues - If ORT is built with cuDNN 8, FusedConv fusion is no longer done automatically, so some model might have performance regression. If user still wants FusedConv operator for performance reason, they can still have multiple ways to walkaround: like use older version of onnxruntime; or use older version of ORT to save optimized onnx, then run with latest version of ORT. We believe that majority users have moved to cudnn 9 when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3 months when 1.20 release), so the impact is small. - cuDNN graph uses TF32 by default, and user cannot disable TF32 through the use_tf32 cuda provider option. If user encounters accuracy issue (like in testing), user has to set environment variable `NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of use_tf32 later. #### Follow ups This is one of PRs that target to enable NHWC convolution in CUDA EP by default if device supports it. There are other changes will follow up to make it possible. (1) Enable `prefer_nhwc` by default for device with sm >= 70. (2) Change `fuse_conv_bias=1` by default after more testing. (3) Add other NHWC operators (like Resize or UpSample). ### Motivation and Context The new CUDNN Frontend library provides the functionality to fuse operations and provides new heuristics for kernel selection. Here it fuses the convolution with the pointwise bias operation. On the [NVIDIA ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/) we get a performance boost from 49.1144 ms to 42.4643 ms per inference on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i 'prefer_nhwc\|1' resnet50.onnx`). --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>	2024-08-02 15:16:42 -07:00
Ranjit Ranjan	82b2955268	[AIX]test failure fix using gtest-1.15.0 for AIX (#21497 ) ### Description Local CI setup for AIX reported tests failure after the gtest 1.15.0 upgrade. ### Motivation and Context Below tests failure is observed after gtest upgrade. The following tests FAILED: 1 - onnxruntime_test_all (ILLEGAL) 7 - onnxruntime_logging_apis_test (Subprocess aborted) To fix this, I am enabling pthread support under gtest. This was disabled with previous version of gtest for some reason. Now by enabling this, above tests are getting passed with gtest 1.15.0.	2024-07-27 11:17:22 -07:00
Changming Sun	f70215d4e6	Update C++ dependencies (#21410 ) 1. Update google benchmark from 1.8.3 to 1.8.5 2. Update google test from commit in main branch to tag 1.15.0 3. Update pybind11 from 2.12.0 to 2.13.1 4. Update pytorch cpuinfo to include the support for Arm Neoverse V2, Cortex X4, A720 and A520. 5. Update re2 from 2024-05-01 to 2024-07-02 6. Update cmake to 3.30.1 7. Update Linux docker images 8. Fix a warning in test/perftest/ort_test_session.cc:826:37: error: implicit conversion loses integer precision: 'streamoff' (aka 'long long') to 'const std::streamsize' (aka 'const long') [-Werror,-Wshorten-64-to-32]	2024-07-23 10:00:36 -07:00
Sheil Kumar	dd010edb37	Update DirectML from 1.14.1 to 1.15.0 (#21323 ) Update DirectML from 1.14.1 to 1.15.0 --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com> Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>	2024-07-22 16:59:03 -07:00
Ranjit Ranjan	6c7562b097	Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. (#21133 ) ### Description Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. ### Motivation and Context changes in this PR contains: 1. Enablement code for building onnxruntime on AIX operating system. 2. while testing the build on AIX, we found issues related to big endian platform . More details about few of those issues can be found in [Big endian issue: Graph Transformation Attention Fusion tests are failing #12921](https://github.com/microsoft/onnxruntime/issues/12921) Below are list of files and the description about the change. 1. cmake/CMakeLists.txt [BUILDING on AIX issue] check for "IBMClang" is added for handling -Wno-unused-parameter 2. cmake/external/onnxruntime_external_deps.cmake [BUILDING on AIX issue]Enabling gtest_disable_pthreads for AIX 3. cmake/onnxruntime.cmake [BUILDING on AIX issue] o Blocking codes for AIX which generates generated_source.c and further requires some symbol files. o Putting NO AIX check for non-supported linker flags like --Xlinker o iconv linking 4. cmake/onnxruntime_framework.cmake [BUILDING on AIX issue]Putting NO AIX check for -Wl,-rpath='$ORIGIN' 5. cmake/onnxruntime_mlas.cmake [BUILDING on AIX issue]POWER10 releated macro/function definition . 6. cmake/onnxruntime_providers_cpu.cmake [BUILDING on AIX issue]Putting NO AIX check for non-supported linker flags like --Xlinker 7. cmake/onnxruntime_unittests.cmake [BUILDING on AIX issue] o Putting NO AIX check for non-supported linker flags like --Xlinker o Adding required libraries for AIX linker under applicatiion like onnxruntime_shared_lib_test ,onnxruntime_logging_apis etc 8. cmake/patches/flatbuffers/flatbuffers.patch [BUILDING on AIX issue] Handling of TypeCode in include/flatbuffers/flatbuffers.h under AIX + clang 9. onnxruntime/contrib_ops/cpu/murmur_hash3.cc [Big endian issue] Byte-Conversion handlling in compute() and getblock() routines 10. onnxruntime/contrib_ops/cpu/quantization/matmul_nbits_impl.cc [Big endian issue] Handling of test failures . Byte swapping for quant_value. 11. onnxruntime/core/framework/tensorprotoutils.cc [Big endian issue] Implementation of SetRawDataInTensorProto , ConvertRawDataInTensorProto . o SetRawDataInTensorProto : Wrapper for set_raw_data(). Calling ConvertRawDataInTensorProto() in big-endian system o ConvertRawDataInTensorProto : function used mainly on big-endian system for byte-swapping of tensor raw_data 12. onnxruntime/core/framework/tensorprotoutils.h [Big endian issue] Declaration of SetRawDataInTensorProto, ConvertRawDataInTensorProto 13. onnxruntime/core/graph/graph.cc [Big endian issue] o Call ConvertRawDataInTensorProto for SPARSE_TENSOR type o Call ConvertRawDataInTensorProto for SaveToOrtFormat 14. onnxruntime/core/mlas/lib/platform.cpp [BUILDING on AIX issue] POWER10 released enablement for AIX 15. onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp [BUILDING on AIX issue]Handling of __vector under AIX+clang 16. onnxruntime/core/mlas/lib/qgemm.h [BUILDING on AIX issue] Adding _AIX flag 17. onnxruntime/core/mlas/lib/qlmul.cpp [BUILDING on AIX issue] Handling of __vector under AIX+clang 18. onnxruntime/core/optimizer/attention_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 19. onnxruntime/core/optimizer/compute_optimizer/shared_utils.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 20. onnxruntime/core/optimizer/constant_folding.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 21. onnxruntime/core/optimizer/embed_layer_norm_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 22. onnxruntime/core/optimizer/nchwc_transformer.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 23. onnxruntime/core/optimizer/qdq_transformer/avx2_weight_s8_to_u8.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 24. onnxruntime/core/optimizer/qdq_transformer/qdq_s8_to_u8.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 25. onnxruntime/core/optimizer/qdq_transformer/s8_to_u8.h [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 26. onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 27. onnxruntime/core/optimizer/reshape_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 28. onnxruntime/core/optimizer/stft_decomposition.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 29. onnxruntime/core/optimizer/transpose_optimization/ort_optimizer_api_impl.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 30. onnxruntime/core/platform/path_lib.h [BUILDING on AIX issue] Moving to normal function call, instead of template 31. onnxruntime/core/platform/posix/env.cc [BUILDING on AIX issue]Blocking syscall.h in AIX 32. onnxruntime/core/session/inference_session.cc [Big endian issue] Removing ORT_RETURN_IF_NOT, FLATBUFFERS_LITTLEENDIAN 33. onnxruntime/test/flatbuffers/flatbuffer_utils_test.cc [Big endian issue] Call ConvertRawDataInTensorProto in CreateInitializer and ExternalWriteReadWithLoadInitializers 34. onnxruntime/test/framework/sparse_kernels_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 35. onnxruntime/test/framework/tensorutils_test.cc [Big endian issue] Helper method ConvertEndianessForVector and call this from required place. 36. onnxruntime/test/framework/test_tensor_loader.cc o. [BUILDING on AIX issue] Handling of getcwd for AIX o. [Big endian issue] Bytes Swapping in run_external_data_test 37. onnxruntime/test/onnx/main.cc [Big endian issue] including <thread> for AIX 38. onnxruntime/test/onnx/tensorprotoutils.cc [Big endian issue] Bytes swapping in UnpackTensorWithRawData 39. onnxruntime/test/optimizer/graph_transform_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 40. onnxruntime/test/optimizer/graph_transform_test_builder.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 41. onnxruntime/test/optimizer/graph_transform_test_builder.h [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 42. onnxruntime/test/optimizer/initializer_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 43. onnxruntime/test/optimizer/nchwc_optimizer_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 44. onnxruntime/test/providers/base_tester.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 45. onnxruntime/test/providers/cpu/generator/random_test.cc [BUILDING on AIX issue] Adding AIX check in MultinomialGoodCase --------- Co-authored-by: Vamshikrishna Thatikonda <vamshikrishna@in.ibm.com>	2024-07-17 12:37:06 -07:00
cloudhan	ddd4ce3cb7	[ROCm] Update ck to use ck_tile (#21030 )	2024-06-19 14:06:10 +08:00
Tianlei Wu	b3fc9b5a0e	[CUDA] upgrade cutlass to 3.5.0 (#20940 ) ### Description Upgrade cutlass to 3.5 to fix build errors using CUDA 12.4 or 12.5 in Windows - [x] Upgrade cutlass to 3.5.0. - [x] Fix flash attention build error with latest cutlass header files and APIs. This fix is provided by @wangyems. - [x] Update efficient attention to use new cutlass fmha interface. - [x] Patch cutlass to fix `hrsqrt` not found error for sm < 53. - [x] Disable TF32 Staged Accumulation to fix blkq4_fp16_gemm_sm80_test build error for cuda 11.8 to 12.3. - [x] Disable TRT 10 deprecate warnings. The following are not included in this PR: * TRT provider replaces the deprecated APIs. * Fix blkq4_fp16_gemm_sm80_test build error for cuda 12.4 or 12.5. This test is not built by default unless you add `--cmake_extra_defines onnxruntime_ENABLE_CUDA_EP_INTERNAL_TESTS=ON` in build command. To integrate to rel-1.18.1: Either bring in other changes (like onnx 1.16.1), or generate manifest and upload a new ONNX Runtime Build Time Deps artifact based on rel-1.18.1. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/19891 https://github.com/microsoft/onnxruntime/issues/20924 https://github.com/microsoft/onnxruntime/issues/20953	2024-06-11 13:32:15 -07:00
Changming Sun	f8b5c2805e	Update abseil-cpp.cmake: add version check (#20962 ) Some dev environments come with a preinstalled abseil. For example, conda users often do that. If the preinstalled abseil version is incompatible with what we have in cmake/deps.txt, it could result in a hard-to-understand build error. This PR adds a version check to improve that.	2024-06-06 19:42:31 -07:00
liqun Fu	51bc53580d	Update to onnx 1.16.1 (#20702 )	2024-06-04 11:06:28 -07:00
Changming Sun	b522df0ae4	Update RE2 to the latest (#20775 ) Update RE2 to the latest. To keep the components up to date.	2024-05-23 14:30:15 -07:00
Yulong Wang	036fcd93d4	[js/web] optimize module export and deployment (#20165 ) ### Description This PR make numbers of optimizations to onnxruntime-web's module export and deployment. See each section below for more details. #### Preview > [onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21](https://www.npmjs.com/package/onnxruntime-web/v/1.19.0-esmtest.20240513-a16cd2bd21) > ~~onnxruntime-web@1.19.0-esmtest.20240430-c7edbcc63d~~ > ~~onnxruntime-web@1.18.0-esmtest.20240428-624c681c83~~ > ~~onnxruntime-web@1.18.0-esmtest.20240411-1abb64e894~~ <details> <summary><h4>Breaking changes</h4></summary> There is no code change required, but there are a few differences regarding code import, flags, bundler config and deployment steps. #### Importing: Import table is changed. See following for details. <details> <summary><h5>Current import table:</h5></summary> \| Target Name \| Path for "import" or "require" \| WebGL \| JSEP \| wasm \| Proxy \| Training \| \|------\|-----\|-----\|-----\|-----\|-----\|-----\| \| `ort` (default) \| `onnxruntime-web` \| ✔️ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| `ort.all` \| `onnxruntime-web/experimental` \| ✔️ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| \| `ort.node` \| `onnxruntime-web` \| ❌ \| ❌ \| ✔️ \| ❌ \| ❌ \| \| `ort.training` \| `onnxruntime-web/training` \| ❌ \| ❌ \| ✔️ \| ✔️<sup>\[1]</sup> \| ✔️ \| \| `ort.wasm` \| `onnxruntime-web/wasm` \| ❌ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| `ort.wasm-core` \| `onnxruntime-web/wasm-core` \| ❌ \| ❌ \| ✔️ \| ❌ \| ❌ \| \| `ort.webgl` \| `onnxruntime-web/webgl` \| ✔️ \| ❌ \| ❌ \| ✔️<sup>\[2]</sup> \| ❌ \| \| `ort.webgpu` \| `onnxruntime-web/webgpu` \| ❌ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| * [1] didn't test. may not actually work. * [2] not working. this is a mistake in build config. </details> <details> <summary><h5>Proposed update:</h5></summary> \| Target Name \| Path for "import" or "require" \| WebGL \| JSEP \| wasm \| Proxy \| Training \| \|------\|-----\|-----\|-----\|-----\|-----\|-----\| \| `ort` (default) \| `onnxruntime-web` \| ✔️ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| `ort.all` \| ~~`onnxruntime-web/experimental`~~<br/>`onnxruntime-web/all` \| ✔️ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| \| `ort.node` \| `onnxruntime-web` \| ❌ \| ❌ \| ✔️ \| ❌ \| ❌ \| \| `ort.training` \| `onnxruntime-web/training` \| ❌ \| ❌ \| ✔️ \| ✔️ \| ✔️ \| \| `ort.wasm` \| `onnxruntime-web/wasm` \| ❌ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| ~~`ort.wasm-core`~~ \| ~~`onnxruntime-web/wasm-core`~~ \| ~~❌~~ \| ~~❌~~ \| ~~✔️~~ \| ~~❌~~ \| ~~❌~~ \| \| `ort.webgl` \| `onnxruntime-web/webgl` \| ✔️ \| ❌ \| ❌ \| ~~✔️~~ ❌ \| ❌ \| \| `ort.webgpu` \| `onnxruntime-web/webgpu` \| ❌ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| </details> #### Flags: The following flags are deprecated: - `env.wasm.simd` (boolean): will be ignored. SIMD is always enabled in build. The following flags changed their type: - `env.wasm.wasmPaths`: When using this flag as a string ( for the URL prefix ), nothing is changed. When using this flag as an object ( for per-file path override ), the type changed: ```diff - export interface Old_WasmFilePaths{ - 'ort-wasm.wasm'?: string; - 'ort-wasm-threaded.wasm'?: string; - 'ort-wasm-simd.wasm'?: string; - 'ort-training-wasm-simd.wasm'?: string; - 'ort-wasm-simd-threaded.wasm'?: string; - }; + export interface New_WasmFilePaths { + /** + * Specify the override path for the main .wasm file. + * + * This path should be an absolute path. + * + * If not modified, the filename of the .wasm file is: + * - `ort-wasm-simd-threaded.wasm` for default build + * - `ort-wasm-simd-threaded.jsep.wasm` for JSEP build (with WebGPU and WebNN) + * - `ort-training-wasm-simd-threaded.wasm` for training build + / + wasm?: URL\|string; + /* + * Specify the override path for the main .mjs file. + * + * This path should be an absolute path. + * + * If not modified, the filename of the .mjs file is: + * - `ort-wasm-simd-threaded.mjs` for default build + * - `ort-wasm-simd-threaded.jsep.mjs` for JSEP build (with WebGPU and WebNN) + * - `ort-training-wasm-simd-threaded.mjs` for training build + / + mjs?: URL\|string; + } ``` #### Bundler compatibility: Config changes are need for bundlers. See usage example in /js/web/test/e2e/ for Webpack, parcel and rollup. #### Deployment: - if consuming from a CDN, there is no breaking change. - if consuming from a local server, need to copy all `ort-.wasm` and `ort-.mjs` files (totally 6 files) in the dist folder. (previously only need to copy `ort-.wasm` files.) </details> <details> <summary><h4>Problems</h4></summary> There are a few problems with the current module export and deployment: - Script URL cannot be correctly inferred when imported as ESM. - Workers are forcefully encoded using Blob URL, which makes onnxruntime-web not working in CSP environment and Node.js, when using proxy or multi-threading feature. - Generated JS code (by Emscripten) is encoded using `function.toString()`, which is unstable and error-prone. - When running with a different Emscripten build, always need the build step. Making it difficult to swap artifacts in deveopment/debug. </details> <details> <summary><h4>Goals</h4></summary> - Full ESM support - Support variances of ways to import. Including: - import from HTML's `<script>` tag (IIFE format, exporting to global variable `ort`) ```html <script src="https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.js"></script> ``` - import from source code inside `<script type="module">` tag (ESM) ```html <script type="module"> import * as ort from "https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.mjs"; // using 'ort' </script> ``` - import in a CommonJS project (CJS format, resolve from package.json "exports" field) ```js // myProject/main.js const ort = require('onnxruntime-web'); ``` - import in an ESM project (ESM format, resolve from package.json "exports" field) ```js // myProject/main.js (or main.mjs) import * as ort from 'onnxruntime-web'; ``` - Support popular bundlers when importing onnxruntime-web into a CJS/ESM project. - webpack (esm requires extra post-process step) - rollup - parcel (esm requires extra post-process step) - More bundlers TBD - Multi-threading support for Node.js NOTE: keeping single JavaScript file (the all-in-one bundle) is no longer a goal. This is because technically there is a conflict with the other requirements. </details> <details> <summary><h4>Important Design Decisions</h4></summary> - Drop support of single JavaScript output. - The current onnxruntime-web distribution uses a single JavaScript file to include all code. While there are a few benefits, it also creates problems as mentioned above. Since ESM is being used more and more widely, and browsers are making more restricted security checks and requirement, the old Blob based solution is going to be replaced. - To achieve the requirement, specifically, the CSP environment support, we have to offer a non Blob based solution. Therefore, we have to distribute multiple files and drop the single file solution. - Do not run parser/postprocess on Emscripten generated JavaScript. - Emscripten is evolving quickly so we should only depends on what's in its documentation instead of a certain implementation details. (for example, currently we patch on its code to deal with a special variable `_scriptDir`) - Keep the generated files as-is also helps to: - reduce the size of ort.min.js - make it easier to replace build artifacts when in development/debug - Drop support for non-SIMD and non-MultiThread. This helps to reduce the number of artifacts in distribution. - (fixed-sized) SIMD is supported in any mainstream JS environment. - Multi-thread as WebAssembly feature is supported in any mainstream JS environment. In some environment the feature is guarded with cross origin policy, but it can still work if not trying to create any worker. - Use ESM output for Emscripten generated JavaScript. - There are 2 ways to dynamically import classic (umd) modules and neither of them are recommended: - dynamically creating a <script> tag. This changes the HTML structure and have quite a lot of compatibility issue - use `fetch()` and `eval()`. However `eval` is strongly suggested to be avoid because there is a great perf hit. - importing ESM is super easy - just use the `import()` call. Considering ESM is widely supported in modern browsers and Node.js this is the better option. - Add Blob based solution as a fallback for cross-origin workers. - There are still wide use case of importing onnxruntime-web from CDN. In this usage, make it able create worker by using `fetch()`+`Blob` to create a same-origin Blob URL. </details> <details> <summary><h4>Distribution File Manifest</h4></summary> The distribution folder contains the following files: - WebAssembly artifacts. These files are the result of compiling the ONNX Runtime C++ code to WebAssembly by Emscripten. \| File Name \| Build Flags \| \|------\|-----\| \| ort-wasm-simd-threaded.mjs <br/> ort-wasm-simd-threaded.wasm \| `--enable_wasm_simd` <br/> `--enable_wasm_threads` \| \| ort-training-wasm-simd-threaded.mjs <br/> ort-training-wasm-simd-threaded.wasm \| `--enable_training_apis` <br/> `--enable_wasm_simd` <br/> `--enable_wasm_threads` \| \| ort-wasm-simd-threaded.jsep.mjs <br/> ort-wasm-simd-threaded.jsep.wasm \| `--enable_wasm_simd` <br/> `--enable_wasm_threads` <br/> `--use_jsep` <br/> `--use_webnn` \| - onnxruntime-web JavaScript artifacts. These files are generated by ESBuild as the entry point for onnxruntime-web. There are multiple build targets for different use cases: \| Target Name \| Path for "import" or "require" \| Description \| \|------\|-----\|-----\| \| `ort` \| `onnxruntime-web` \| The default target. \| \| `ort.all` \| `onnxruntime-web/all` \| The target including webgl. \| \| `ort.node` \| `onnxruntime-web` \| The default target for Node.js. \| \| `ort.training` \| `onnxruntime-web/training` \| The target including training APIs \| \| `ort.wasm` \| `onnxruntime-web/wasm` \| The target including only WebAssembly (CPU) EP \| \| `ort.webgl` \| `onnxruntime-web/webgl` \| The target including only WebGL EP \| For each target, there are multiple files generated: \| File Name \| Description \| \|------\|-----\| \| [target].js \| The entry point for the target. IIFE and CommonJS format. \| \| [target].mjs \| The entry point for the target. ESM format. \| \| [target].min.js <br/> [target].min.js.map \| The entry point for the target. Minimized with sourcemap. IIFE and CommonJS format. \| \| [target].min.mjs <br/> [target].min.mjs.map \| The entry point for the target. Minimized with sourcemap. ESM format. \| \| [target].proxy.mjs \| (if appliable) The proxy ESM module for the target. \| \| [target].proxy.min.mjs <br/> [target].proxy.min.mjs.map \| (if appliable) The proxy ESM module for the target. Minimized with sourcemap. \| </details> <details> <summary><h4>Dynamic Import Explained</h4></summary> - Local Served \| No Proxy: ``` [Bundle or ort.min.js] \| + import()--> [ort-wasm-simd-threaded.mjs] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [ort-wasm-simd-threaded.mjs (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Local Served \| Proxy: ``` [Bundle or ort.min.js] \| + import()--> [ort.proxy.min.mjs] \| + new Worker()--> [ort.proxy.min.mjs (worker)] \| + import()--> [ort-wasm-simd-threaded.mjs] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [ort-wasm-simd-threaded.mjs (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Cross Origin \| No Proxy: ``` [Bundle or ort.min.js] \| + fetch('ort-wasm-simd-threaded.mjs') \| + URL.createObjectURL(res.blob()) \| + import()--> [blob:... (ort-wasm-simd-threaded)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Cross Origin \| Proxy ``` [Bundle or ort.min.js] \| + fetch('ort.proxy.min.mjs') \| + URL.createObjectURL(res.blob()) \| + import()--> [blob:... (ort.proxy)] \| + new Worker()--> [blob:... (ort.proxy) (worker)] \| + fetch('ort-wasm-simd-threaded.mjs') \| + URL.createObjectURL(res.blob()) \| + import()--> [blob:... (ort-wasm-simd-threaded)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` </details>	2024-05-20 09:51:16 -07:00
Scott McKay	8d09baf49f	Clarify when protobuf dependency builds protoc (#20542 ) ### Description <!-- Describe your changes. --> Currently figuring out if the protobuf dependency is building protoc it is a little obtuse and inconsistent * in some places we directly set protobuf_BUILD_PROTOC_BINARIES to OFF to indicate the protobuf dependency is not building protoc * e.g. macOS/iOS/visionOS builds * for a user provided protoc path we don't set protobuf_BUILD_PROTOC_BINARIES, and inside protobuf_function.cmake that determines if `protobuf::protoc` is added as a dependency or not * `0dda8b0c44/cmake/external/protobuf_function.cmake (L40-L45)` To be more consistent/explicit, set protobuf_BUILD_PROTOC_BINARIES to OFF when ONNX_CUSTOM_PROTOC_EXECUTABLE set and valid. Remove outdated script that built and external protoc binary which was used in later builds. The build setup will fetch a pre-built protoc so there's no need for this additional build. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make it easier to figure out if protoc is coming from the protobuf dependency.	2024-05-08 08:30:11 +10:00
Yulong Wang	a457c1df80	upgrade emsdk to 3.1.57 (#20295 ) ### Description upgrade emsdk to 3.1.57	2024-04-19 23:05:18 -07:00
Patrice Vignola	12569626cb	Update DML to 1.14.1 (#20380 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-18 22:43:41 -07:00
Patrice Vignola	745b426c60	[DML] Update DML to 1.14 (#20304 ) I am prefiring this change to pre-run the non-dml checks, and also to give folks the time to review it before DML gets released. When DML 1.14 officially releases, we'll only need to run the DML pipeline to automatically pick up the nuget package. This should save us some valuable time. Note that DML 1.14 is the release needed for ORT 1.17.4, and DML 1.15 will come soon after.	2024-04-18 16:22:57 -07:00
liqun Fu	cd7112f800	Integration with ONNX 1.16.0 (#19745 ) ### Description update with ONNX 1.16.0 branch according to https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md ONNX 1.16.0 release notes: https://github.com/onnx/onnx/releases/tag/v1.16.0 #### Updated ops for CPU EP: - DequantizeLinear(21) - Added int16 and uint16 support + various optimizer tests - Missing int4 and uint4 support - Missing block dequantization support - QuantizeLinear(21) - Added int16 and uint16 support + various optimizer tests - Missing int4 and uint4 support - Missing block quantization support - Cast(21) - Missing int4 and uint4 support - CastLike(21) - Missing int4 and uint4 support - ConstantOfShape(21) - Missing int4 and uint4 support - Identity(21) - Missing int4 and uint4 support - If(21) - Missing int4 and uint4 support - Loop(21) - Missing int4 and uint4 support - Reshape(21) - Missing int4 and uint4 support - Scan(21) - Missing int4 and uint4 support - Shape(21) - Missing int4 and uint4 support - Size(21) - Missing int4 and uint4 support - Flatten(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Pad(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Squeeze(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Transpose(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Unsqueeze(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support #### Unimplemented opset 21 features/ops - int4 and uint4 data type - QLinearMatMul(21) - GroupNormalization(21) - ai.onnx.ml.TreeEnsemble(5) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ### Disabled tests #### ORT Training orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py - test_ort_custom_ops: Potential shape inference bug for custom ops #### Python quantization unit tests test/onnx/python/quantization (shape inference bug) - test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16 - test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16 - test_op_gemm.py: test_quantize_qop_gemm_s8s8 - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3 - test_op_matmul.py: test_quantize_matmul_u8u8_f16 - test_op_matmul.py: test_quantize_matmul_s8s8_f16 - test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy - test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile - test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution - test_op_relu.py: test_quantize_qop_relu_s8s8 #### ONNX tests - test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a maxpool output size bug and added this test. Enable this test when [ORT PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged. Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741). - test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op ai.onnx.ml.TreeEnsemble - test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same - test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same - test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same - test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4 yet - test_cast_INT4_to_INT8_cpu: same - test_cast_UINT4_to_FLOAT_cpu: same - test_cast_UINT4_to_UINT8_cpu: same - test_cast_INT4_to_FLOAT_cuda - test_cast_INT4_to_INT8_cuda - test_cast_UINT4_to_FLOAT_cuda - test_cast_UINT4_to_UINT8_cuda - test_constantofshape_float_ones_cuda: ConstantOfShape(21) not implemented for cuda - test_constantofshape_int_shape_zero_cuda: same - test_constantofshape_int_zeros_cuda: same - test_flatten_axis0_cuda: Flatten(21) not implemented for cuda - test_flatten_axis1_cuda: same - test_flatten_axis2_cuda: same - test_flatten_axis3_cuda: same - test_flatten_default_axis_cuda: same - test_flatten_negative_axis1_cuda: same - test_flatten_negative_axis2_cuda: same - test_flatten_negative_axis3_cuda: same - test_flatten_negative_axis4_cuda: same - test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not implemented in ORT yet - test_qlinearmatmul_2D_int8_float32_cpu: same - test_qlinearmatmul_2D_uint8_float16_cpu: same - test_qlinearmatmul_2D_uint8_float32_cpu: same - test_qlinearmatmul_3D_int8_float16_cpu: same - test_qlinearmatmul_3D_int8_float32_cpu: same - test_qlinearmatmul_3D_uint8_float16_cpu: same - test_qlinearmatmul_3D_uint8_float32_cpu: same - test_qlinearmatmul_2D_int8_float16_cuda: same - test_qlinearmatmul_2D_int8_float32_cuda: same - test_qlinearmatmul_2D_uint8_float16_cuda: same - test_qlinearmatmul_2D_uint8_float32_cuda: same - test_qlinearmatmul_3D_int8_float16_cuda: same - test_qlinearmatmul_3D_int8_float32_cuda: same - test_qlinearmatmul_3D_uint8_float16_cuda: same - test_qlinearmatmul_3D_uint8_float32_cuda: same - test_size_cuda: Size(21) not implemented for cuda - test_size_example_cuda: same - test_dequantizelinear_blocked: Missing implementation for block dequant for DequantizeLinear(21) - test_quantizelinear_blocked_asymmetric: Missing implementation for block quant for QuantizeLinear(21) - test_quantizelinear_blocked_symmetric: Missing implementation for block quant for QuantizeLinear(21) --------- Signed-off-by: liqunfu <liqun.fu@microsoft.com> Signed-off-by: Ganesan Ramalingam <grama@microsoft.com> Co-authored-by: Ganesan Ramalingam <grama@microsoft.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>	2024-04-12 09:46:49 -07:00
Jeff Bloomfield	2f31560430	Enable generic feature level devices in DML EP (#20114 ) ### Description Enable NPUs supporting DXCORE_ADAPTER_ATTRIBUTE_D3D12_GENERIC_ML and D3D_FEATURE_LEVEL_1_0_GENERIC with DML EP. This also begins ingesting DX headers through the DirectX-Headers repo. Note that this includes an update to cgamanifest.json for onnx-tensorrt which is triggered during re-generation due to a prior changes to deps.txt. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-29 14:37:30 -07:00
Changming Sun	dafbef3a21	CMake: support reading dependency zip files from a local mirror (#20005 ) ### Description To test this feature, run ```bat python cmake\deps_update_and_upload.py --root-path mirror ``` Then run build.py as usual. The zip files will be cached local. To avoid being downloaded again and again.	2024-03-21 17:58:59 -07:00
Yufeng Li	15219e2e71	turn on neural_speed by default (#19627 ) ### Description <!-- Describe your changes. --> the crash caused by the neural_speed turns out to be a very corn case. Turn it on by default. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-20 12:49:58 -07:00
Adam Louly	32558134a9	[On-Device-Training] Upgrade Flatbuffers to Support 2GB+ Checkpoints. (#19770 ) ### Description Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers ### Motivation and Context This PR includes changes that will make ort handle 2GB+ checkpoints. To do that we need to upgrade flatbuffers to 23.5.9 - https://github.com/google/flatbuffers/pull/7945 - Modified the commitHash and the hash for the new version - Removed the patch for rust generator's unused variable warning as it is no longer producing this - [Check it out here](`d121e09d89/src/idl_gen_rust.cpp`) - Updated the VerifyField calls with alignment values that were introduced in the new version. --------- Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>	2024-03-14 16:36:24 -07:00
Changming Sun	1fb6cbddee	Add a build patch for Windows ARM64EC (#19898 ) ### Description Add a patch for Windows ARM64EC ### Motivation and Context Will need more changes in onnxruntime/core/common/cpuid_arch_definition.h and onnxruntime/core/common/cpuid_info.cc	2024-03-14 08:50:42 -07:00
Scott McKay	978c40d853	Make partitioning utils QDQ aware so it does not break up QDQ node units (#19723 ) ### Description <!-- Describe your changes. --> If the EP handles QDQ node units, we need to make sure we do not split those into different partitions. Update the partitioning utils to be QDQ aware. If there are node units we process the logical nodes they represent instead of individual nodes. This ensure we process all nodes in a QDQ node unit at the same time so that they are always in the same partition. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix one of the issues in #19590 --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-03-12 10:55:49 +10:00
Changming Sun	a0521f899e	Enable CPUINFO for all Windows build (#19655 ) ### Description It was disabled in PR #9065. And the reason was: " api-ms-win-core-kernel32-legacy-*.dll wasn't available in Windows 8 and was added in Windows 10, so cpuinfo breaks our Windows 8 support. I'm disabling it again." We no longer support Windows 8. Therefore we can add CPUINFO back. ### Motivation and Context To make the code simpler. If in any case the library doesn't work as expected, we can submit a PR to their code base and fix it.	2024-03-01 16:23:20 -08:00
Maximilian Müller	c20ced4132	Use CMake's find package for CUDA libs (#19673 ) ### Description Answers issue #19640 More details are in the issue, basically I am changing all the include directory and link directory usage to CMake's `CUDA::*` targets	2024-02-27 11:26:48 -08:00
Changming Sun	1007d8f3d1	Revert "Revert NeuralSpeed code for x64 MatMulNBits (#19382 )" (#19474 ) This reverts commit `0d10c7f3c1`.	2024-02-09 09:24:54 -08:00
luoyu-intel	0d10c7f3c1	Revert NeuralSpeed code for x64 MatMulNBits (#19382 ) ### Description <!-- Describe your changes. --> Revert PR#19016 https://github.com/microsoft/onnxruntime/pull/19016 Revert PR#17669 https://github.com/microsoft/onnxruntime/pull/17669	2024-02-07 13:04:37 -08:00
Scott McKay	debd1cab10	Add coremltools 7.1 as a dependency (#19389 ) ### Description <!-- Describe your changes. --> Setup usage of coremltools via dependencies instead of copying files. Pull in some changes from https://github.com/microsoft/onnxruntime/pull/19347 in preparation for supporting ML Program and enabling building the ML Model on all platforms to make development and testing of CoreML EP code easier. - Update to coremltools 7.1 - Add patch for changes required for cross platform build of ML Program related code - Generate coreml proto files on all platforms - mainly to test these changes work everywhere, as the proto files will be used on all platforms when #19347 is checked in - rename onnxruntime_coreml_proto target to coreml_proto as it contains purely coreml protobuf code with no ORT related chagnes ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve setup.	2024-02-03 09:42:21 +10:00
He Li	1bdd7d9499	Update oneDNN to v3.0.1 in order to support gcc 13 (#19344 ) ### Description Update the dependency of `oneDNN` to v3.0.1, which fixes a minor bug hindering gcc 13. ### Motivation and Context Referring to [oneDNN-1548](https://github.com/oneapi-src/oneDNN/issues/1548). - When building with `--use_dnnl` using gcc 13.x, it will fail due to this upstream issue. - This is fixed in `v3.0.1` [tag](https://github.com/oneapi-src/oneDNN/tree/v3.0.1) by [this commit](`1d7971ce48`).	2024-02-01 15:39:03 -08:00
Tianlei Wu	8b4517218b	Remove USE_CUTLASS flag (#19271 ) ### Description Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for onnxruntime CUDA build), there is no need to have a flag to disable cutlass. Changes: (1) Reverted https://github.com/microsoft/onnxruntime/pull/18761 (2) remove the condition to build cutlass. (3) Fix a few build errors or warnings during testing CUDA 11.4 build. Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later. Flash attention and cutlass fused multihead attention will not be built for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if you want to support latest GPUs. It is better to include it in 1.17.0 (otherwise, the release branch might encounter build failure with CUDA 11.4). Tests: (1) Build with flash attention and efficient attention off: passed (2) Build with CUDA 11.4: passed Example build command used in Ubuntu 20.04: ``` export CUDA_HOME=/usr/local/cuda-11.4 export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/ export CUDACXX=/usr/local/cuda-11.4/bin/nvcc sh build.sh --config Release --build_shared_lib --parallel --use_cuda --cuda_version 11.4 \ --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \ --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \ --disable_types float8 ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-25 16:57:58 -08:00
Phoebe Chen	4477f57ee3	Enable RISC-V 64-bit Cross-Compiling Support for ONNX Runtime on Linux (#19238 ) ### Description This pull request introduces the necessary changes to enable RISC-V 64-bit cross-compiling support for the ONNX Runtime on Linux. The RISC-V architecture has gained popularity as an open standard instruction set architecture, and this contribution aims to extend ONNX Runtime's compatibility to include RISC-V, thereby broadening the reach of ONNX models to a wider range of devices. ### Motivation and Context RISC-V is a free and open-source instruction set architecture (ISA) based on established RISC principles. It is provided under open licenses without fees. Due to its extensibility and freedom in both software and hardware, RISC-V is poised for widespread adoption in the future, especially in applications related to AI, parallel computing, and data centers. ### Example Build Command ``` ./build.sh --parallel --config Debug --rv64 --riscv_toolchain_root=/path/to/toolchain/root --skip_tests ``` ### Documentation Updates Relevant sections of the documentation will be updated to reflect the newly supported RISC-V 64-bit cross-compilation feature. https://github.com/microsoft/onnxruntime/pull/19239 --------- Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>	2024-01-24 16:27:05 -08:00

1 2 3 4 5 ...

472 commits