onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-23 22:13:38 +00:00

Author	SHA1	Message	Date
kailums	1b38c05544	change ci docker image to rocm6.1 (#21296 ) ### Description <!-- Describe your changes. --> There is a bug for kernel running on rocm6.0, so change ci docker image to rocm6.1 For the torch installed in the docker image, change to rocm repo when it is not 6.0 version. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-18 14:50:01 +08:00
Ranjit Ranjan	6c7562b097	Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. (#21133 ) ### Description Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. ### Motivation and Context changes in this PR contains: 1. Enablement code for building onnxruntime on AIX operating system. 2. while testing the build on AIX, we found issues related to big endian platform . More details about few of those issues can be found in [Big endian issue: Graph Transformation Attention Fusion tests are failing #12921](https://github.com/microsoft/onnxruntime/issues/12921) Below are list of files and the description about the change. 1. cmake/CMakeLists.txt [BUILDING on AIX issue] check for "IBMClang" is added for handling -Wno-unused-parameter 2. cmake/external/onnxruntime_external_deps.cmake [BUILDING on AIX issue]Enabling gtest_disable_pthreads for AIX 3. cmake/onnxruntime.cmake [BUILDING on AIX issue] o Blocking codes for AIX which generates generated_source.c and further requires some symbol files. o Putting NO AIX check for non-supported linker flags like --Xlinker o iconv linking 4. cmake/onnxruntime_framework.cmake [BUILDING on AIX issue]Putting NO AIX check for -Wl,-rpath='$ORIGIN' 5. cmake/onnxruntime_mlas.cmake [BUILDING on AIX issue]POWER10 releated macro/function definition . 6. cmake/onnxruntime_providers_cpu.cmake [BUILDING on AIX issue]Putting NO AIX check for non-supported linker flags like --Xlinker 7. cmake/onnxruntime_unittests.cmake [BUILDING on AIX issue] o Putting NO AIX check for non-supported linker flags like --Xlinker o Adding required libraries for AIX linker under applicatiion like onnxruntime_shared_lib_test ,onnxruntime_logging_apis etc 8. cmake/patches/flatbuffers/flatbuffers.patch [BUILDING on AIX issue] Handling of TypeCode in include/flatbuffers/flatbuffers.h under AIX + clang 9. onnxruntime/contrib_ops/cpu/murmur_hash3.cc [Big endian issue] Byte-Conversion handlling in compute() and getblock() routines 10. onnxruntime/contrib_ops/cpu/quantization/matmul_nbits_impl.cc [Big endian issue] Handling of test failures . Byte swapping for quant_value. 11. onnxruntime/core/framework/tensorprotoutils.cc [Big endian issue] Implementation of SetRawDataInTensorProto , ConvertRawDataInTensorProto . o SetRawDataInTensorProto : Wrapper for set_raw_data(). Calling ConvertRawDataInTensorProto() in big-endian system o ConvertRawDataInTensorProto : function used mainly on big-endian system for byte-swapping of tensor raw_data 12. onnxruntime/core/framework/tensorprotoutils.h [Big endian issue] Declaration of SetRawDataInTensorProto, ConvertRawDataInTensorProto 13. onnxruntime/core/graph/graph.cc [Big endian issue] o Call ConvertRawDataInTensorProto for SPARSE_TENSOR type o Call ConvertRawDataInTensorProto for SaveToOrtFormat 14. onnxruntime/core/mlas/lib/platform.cpp [BUILDING on AIX issue] POWER10 released enablement for AIX 15. onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp [BUILDING on AIX issue]Handling of __vector under AIX+clang 16. onnxruntime/core/mlas/lib/qgemm.h [BUILDING on AIX issue] Adding _AIX flag 17. onnxruntime/core/mlas/lib/qlmul.cpp [BUILDING on AIX issue] Handling of __vector under AIX+clang 18. onnxruntime/core/optimizer/attention_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 19. onnxruntime/core/optimizer/compute_optimizer/shared_utils.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 20. onnxruntime/core/optimizer/constant_folding.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 21. onnxruntime/core/optimizer/embed_layer_norm_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 22. onnxruntime/core/optimizer/nchwc_transformer.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 23. onnxruntime/core/optimizer/qdq_transformer/avx2_weight_s8_to_u8.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 24. onnxruntime/core/optimizer/qdq_transformer/qdq_s8_to_u8.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 25. onnxruntime/core/optimizer/qdq_transformer/s8_to_u8.h [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 26. onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 27. onnxruntime/core/optimizer/reshape_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 28. onnxruntime/core/optimizer/stft_decomposition.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 29. onnxruntime/core/optimizer/transpose_optimization/ort_optimizer_api_impl.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 30. onnxruntime/core/platform/path_lib.h [BUILDING on AIX issue] Moving to normal function call, instead of template 31. onnxruntime/core/platform/posix/env.cc [BUILDING on AIX issue]Blocking syscall.h in AIX 32. onnxruntime/core/session/inference_session.cc [Big endian issue] Removing ORT_RETURN_IF_NOT, FLATBUFFERS_LITTLEENDIAN 33. onnxruntime/test/flatbuffers/flatbuffer_utils_test.cc [Big endian issue] Call ConvertRawDataInTensorProto in CreateInitializer and ExternalWriteReadWithLoadInitializers 34. onnxruntime/test/framework/sparse_kernels_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 35. onnxruntime/test/framework/tensorutils_test.cc [Big endian issue] Helper method ConvertEndianessForVector and call this from required place. 36. onnxruntime/test/framework/test_tensor_loader.cc o. [BUILDING on AIX issue] Handling of getcwd for AIX o. [Big endian issue] Bytes Swapping in run_external_data_test 37. onnxruntime/test/onnx/main.cc [Big endian issue] including <thread> for AIX 38. onnxruntime/test/onnx/tensorprotoutils.cc [Big endian issue] Bytes swapping in UnpackTensorWithRawData 39. onnxruntime/test/optimizer/graph_transform_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 40. onnxruntime/test/optimizer/graph_transform_test_builder.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 41. onnxruntime/test/optimizer/graph_transform_test_builder.h [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 42. onnxruntime/test/optimizer/initializer_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 43. onnxruntime/test/optimizer/nchwc_optimizer_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 44. onnxruntime/test/providers/base_tester.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 45. onnxruntime/test/providers/cpu/generator/random_test.cc [BUILDING on AIX issue] Adding AIX check in MultinomialGoodCase --------- Co-authored-by: Vamshikrishna Thatikonda <vamshikrishna@in.ibm.com>	2024-07-17 12:37:06 -07:00
Tianlei Wu	0f4c39ec47	[ROCM] adjust test_flash_attn_rocm test tolerance (#21379 ) The test_flash_attn_rocm.py from https://github.com/microsoft/onnxruntime/pull/21032 failed frequently. For example, I saw two failed jobs today: E Max absolute difference: 0.002167 E Max absolute difference: 0.002686 Adjust the abs threshold from 0.002 to 0.005, and use default relative tolerance rtol=0.001.	2024-07-17 07:35:12 -07:00
vraspar	fa287042ca	Add ML Program support for transpose op (#21364 ) ### Description Add support for transpose op ### Motivation and Context Enable support for Autodesk model	2024-07-16 16:34:58 -07:00
Tianlei Wu	760a31c848	Exclude blkq4_fp16_gemm_sm80_test in cuda 12.5 build (#21373 ) There is build errors when build with CUDA 12.5 and `--cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=ON onnxruntime_ENABLE_CUDA_EP_INTERNAL_TESTS=ON`. Temporally exclude blkq4_fp16_gemm_sm80_test to unblock cuda 12.5 build.	2024-07-16 15:58:11 -07:00
Yueqing Zhang	dcc04367b7	[VitisAI] fix graph save (#21293 ) ### Description <!-- Describe your changes. --> Revert the wrong change in https://github.com/microsoft/onnxruntime/pull/20920 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> It would save the data at a wrong position	2024-07-16 13:48:29 -07:00
Jing Fang	5df4ddd1c3	matmul 4bit tool chain support qdq (#21362 ) ### Description This is a partial change ported from fajin/qdqmatmulnbitstoolchain. That branch has issues resolving the web CI. MatMulNBits is a heavily optimized matmul operation. Currently a MatMul can be converted to MatMulNBits to speed up the model inference. However, MatMulNBits is an ORT only op. To make the graph compatible with ONNX ops and utilize MatMulNBits at the same time, we introduce Q/DQ support for MatMulNBits. To convert MatMul ops in a model to MatMulNBits: use matmul_4bits_quantizer.py to convert MatMul to DQ + MatMul using QDQ mode. In ORT session, DQ + MatMul is fused to MatMulNBits #### Note MatMulNBits assume B weight is uint4. When no zp is provided, zp defaults to 8, which is different from DQ. DQ defaults zp to 0 when no zp provided. And DQ supports int4. Therefore some conversions are introduced during DQ + MatMul --> MatMulNBits step. #### Perf Using QDQ format will increase the model initialization time and memory consumption. With current implement, model init time increased from ~4s to ~9s, and memory consumption increased from ~2.8GB to ~4.8GB. The memory increase is due to 1. in optimizer, after transpose the B weight, a in-memory tensor proto is created using protobuf's arena. 2. in finalize step, when saving initializer and prepacking, ORT arena is used to create buffers for initializers. The memory allocated by arenas cannot be fully deallocated. If disable ORT arena memory allocation, the memory consumptions of both QDQ format and original format are ~2.2GB. The time increase is mainly due to multiple memory copy, but can be further optimized. ### Motivation and Context Please see description for details.	2024-07-16 10:34:19 -07:00
Changming Sun	8568a67673	Fix a build error when CUDA is enabled and onnxruntime_DISABLE_CONTRIB_OPS is ON (#21285 ) Resolve #21204 To reproduce the issue, build the code with ``` python3 tools/ci_build/build.py --build_dir /tmp/build13 --config Debug --skip_submodule_sync --build_shared_lib --parallel --use_binskim_compliant_compile_flags --build_csharp --enable_onnx_tests --update --build --build_wheel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/local/cuda --cmake_extra_defines onnxruntime_DISABLE_CONTRIB_OPS=ON onnxruntime_BUILD_UNIT_TESTS=OFF --skip_tests ``` Then run the following python script: ```python #!/usr/bin/python3 import onnxruntime as ort providers = [("CUDAExecutionProvider")] ort_sess = ort.InferenceSession('/data/onnx/opset17/test_gemm_default_no_bias/model.onnx', providers=providers) ``` Without this fix, you will see an error: Failed to load library libonnxruntime_providers_cuda.so with error: /tmp/build18/Debug/onnxruntime/capi/libonnxruntime_providers_cuda.so: undefined symbol: _ZN11onnxruntime4cuda21BuildKernelCreateInfoINS0_57kCudaExecutionProvider_GridSample_kOnnxDomain_ver16_floatEEENS_16KernelCreateInfoEv	2024-07-16 10:05:33 -07:00
vraspar	218301403d	Add ML Program support for basic activation ops (#21326 ) ### Description Add support for: - Sigmoid - Relu - Tanh ### Motivation and Context Enable support for Autodesk model	2024-07-15 22:30:20 -07:00
George Wu	4005d12ed4	add vitisai ep build stage to Windows CPU Pipeline (#21361 ) We need to prevent VitisAI EP build breaks, add a stage in Windows CPU CI Pipeline to build Vitis AI EP on Windows. There are no external dependencies for builds. Tests have to be disabled though as the EP has external SW/HW dependencies. This will at least allow us to prevent build breaks which has happened on multiple occasions recently. tested https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1432346&view=results and it seems to run fine.	2024-07-15 19:34:08 -07:00
Adrian Lizarraga	cf565e955d	Revert "Fix ETW Sink Initialize unproperly locking" (#21360 ) Reverts microsoft/onnxruntime#21226 Causes any onnxruntime app to hang on Windows ARM64. Our pipelines do not have the same ETW environment, so we couldn't catch it. ![image](https://github.com/user-attachments/assets/80edbf7d-be50-4cb0-a016-f390b81dc798) The call to TraceLoggingRegisterEx() recursively calls back into LazyInitialize(): LazyInitialize() -> TraceLoggingRegisterEx() -> ORT_TL_EtwEnableCallback() -> Instance() -> LazyInitialize() The original code got out of the recursive loop by checking the `initialized_` flag.	2024-07-15 17:56:08 -07:00
Jing Fang	50170c697e	[Optimizer] DQ + MatMul to MatMulNBits support: kernel changes (#21342 ) Description: ### Description This is a partial change ported from fajin/qdqmatmulnbitstoolchain. That branch has issues resolving the web CI. MatMulNBits is a heavily optimized matmul operation. Currently a MatMul can be converted to MatMulNBits to speed up the model inference. However, MatMulNBits is an ORT only op. To make the graph compatible with ONNX ops and utilize MatMulNBits at the same time, we introduce Q/DQ support for MatMulNBits. To convert MatMul ops in a model to MatMulNBits: 1. use matmul_4bits_quantizer.py to convert MatMul to DQ + MatMul using QDQ mode. 2. In ORT session, DQ + MatMul is fused to MatMulNBits #### Note MatMulNBits assume B weight is uint4. When no zp is provided, zp defaults to 8, which is different from DQ. DQ defaults zp to 0 when no zp provided. And DQ supports int4. Therefore some conversions are introduced during DQ + MatMul --> MatMulNBits step. #### Perf Using QDQ format will increase the model initialization time and memory consumption. With current implement, model init time increased from ~4s to ~9s, and memory consumption increased from ~2.8GB to ~4.8GB. The memory increase is due to 1. in optimizer, after transpose the B weight, a in-memory tensor proto is created using protobuf's arena. 2. in finalize step, when saving initializer and prepacking, ORT arena is used to create buffers for initializers. The memory allocated by arenas cannot be fully deallocated. If disable ORT arena memory allocation, the memory consumptions of both QDQ format and original format are ~2.2GB. The time increase is mainly due to multiple memory copy, but can be further optimized. ### Motivation and Context Please see description for details.	2024-07-15 15:25:40 -07:00
Jian Chen	c03e6fff4c	Combining android build and test step into one job (#21340 ) ### Description Combining android build and test step into one job ### Motivation and Context Reduce runtime by removing additional machine allocation, and artifact uploading and downloading. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-07-15 14:44:03 -07:00
Yifan Li	db9ee35963	[TensorRT EP] c4996 suppression to build with trt10.2ga on Windows (#21358 ) ### Description <!-- Describe your changes. --> Supress C4996 deprecated api warning as errors as a walkaround to build ORT with TRT10.2GA on Windows ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Four apis were recently declared as deprecated, which are being used by core code of TRT EP. Temporally suppress deprecated api warnings before updating these apis	2024-07-15 14:30:02 -07:00
Changming Sun	e5f18ba2c1	Change libonnxruntime.so's SONAME: remove the minor and patch version. (#21339 ) ### Description Resolve #21281 and #10589 . 1. Change libonnxruntime.so's SONAME: remove the minor and patch version. By default when creating an ELF shared object, linker will set the file's internal DT_SONAME field to the specified name which is the file name plus SOVERSION . For example, the file name for our library is libonnxruntime.so. And by default SOVERSION is the lib's VERSION number, which is something like 1.19.0. So the DT_SONAME field in libonnxruntime.so is something like libonnxruntime.so.1.18.0. You can use readelf tool to examine it. ``` readelf -d libonnxruntime.so \| grep SONAME 0x000000000000000e (SONAME) Library soname: [libonnxruntime.so.1.18.0] ``` When an executable is linked with a shared object which has a DT_SONAME field, then when the executable is run the dynamic linker will attempt to load the shared object specified by the DT_SONAME field rather than using the file name(which is libonnxruntime.so) given to the linker. After this change, the SONAME will be shorten to "libonnxruntime.so.1" instead. 2. Set default version strings for Windows DLLs, to resolve #10589	2024-07-15 14:21:34 -07:00
Edward Chen	9c2b85ad58	Fix Android build on Windows (#21304 ) - Pass a list of files instead of path separator-delimited string to project.files(). See this issue: https://github.com/gradle/gradle/issues/19817 - Check for host (instead of target) being Windows when using fallback patch program.	2024-07-15 12:29:02 -07:00
Changming Sun	dfaf18928a	Fix a path problem in onnxruntime_perf_test (#21341 ) ### Description Resolve #21267 . onnxruntime_perf_test does not work properly if the input model path url is just a single filename without any path separator. For example, ``` ./onnxruntime_perf_test -t 10 model.onnx ``` The problem was introduced in #19196 by me.	2024-07-15 10:47:02 -07:00
glen-amd	281ed8c12d	VitisAI EP Context Model (#20926 ) # Why so many commits - Runtime debugging - which is necessary - Three different approaches to EP context model - as a result testing back and forth - Windows compatibility issues - this development has been done on Linux for convenience # "Open" (?) questions - Full offloading to a specific EP - Dumping EP context models by EPs vs [by ONNXRT](`e2abba18ea/onnxruntime/core/framework/graph_partitioner.cc (L725)`) - [Node name to pick nodes](`e2abba18ea/onnxruntime/core/framework/graph_partitioner.cc (L654)`) # VitisAI EP made three variant implementations that have respective pros and cons (and of course we can combine them) ## Serialize and cache the list of compute capabilities and the original ONNX model itself ## In `ComputeCapability()`, serialize and cache the backend compilation cache and the related necessary cache info such as cache dir and cache key ## In `Compile()`, serialize and cache the backend compilation cache and the related necessary cache info such as cache dir and cache key # EP context model creation - Precondition Session option configuration `kOrtSessionOptionEpContextEnable` (aka "ep.context_enable") is enabled. - Approach 1 - Steps 1. EP creates an ONNX model whose main graph has EP context nodes (i.e., node type is "EPContext"). 2. EP implements/overrides `IExecutionProvider::GetEpContextNodes()` method. 3. ONNXRT core creates an EP context model and saves/dumps it. - `CreateEpContextModel()` in the file "graph_partitioner.cc" - In `get_ep_context_node()`, `Node::Name()` is used to check whether a node is an EP context node. This limits that EP model creation can only happen in `IExecutionProvider::Compile()`. - The workaround is (1) not implementing `IExecutionProvider::GetEpContextNodes()` and (2) dumping the EP context model by EP itself. 4. Optionally, EP can also dump the EP context model it created by iteself. - Examples - `QNNExecutionProvider` - `VitisAIExecutionProvider` - Approach 2 - Steps 1. EP creates an ONNX model whose main graph has EP context nodes (i.e., node type is "EPContext"). 2. EP does NOT implement `IExecutionProvider::GetEpContextNodes()` at all. 3. EP dumps the EP context model it created. - Examples - `TensorrtExecutionProvider` - UPDATES - TRT EP is switching to leveraging `IExecutionProvider::GetEpContextNodes()` - `OpenVINOExecutionProvider` (?) # What to cache in EP context nodes - Non Compilation based EPs - Examples - `VitisAIExecutionProvider` - Characteristics - Heavy lifting work happens in `IExecutionProvider::GetCapability()`. - Preconditions - `IExecutionProvider::GetCapability()` is only called once by ONNXRT. - Cache content - Serialization of a list of `ComputeCapability` - Not EP-specific - Serialized using `onnx::FunctionProto` - EP-specific cache - Compilation based EPs - Examples - `QNNExecutionProvider` - `TensorrtExecutionProvider` - `MIGraphXExecutionProvider` - `OpenVINOExecutionProvider` - Cache content - EP-specific cache # Requirements - Offline / AOT compilation of ONNX models with EP context cache - Compile somewhere, run everywhere - Pseudo code with brief explanation ``` GenerateCache(original_onnx_file, cache_onnx_file) model_buffer = load(original_onnx_file) --> Load the original ONNX model file model_buffer = decrypt(model_buffer) session_options = { kOrtSessionOptionEpContextEnable: true, kOrtSessionOptionEpContextFilePath: temp_file } --> Set necessary configs Ort::CreateSessionFromArray(model_buffer, session_options) --> The new ONNX model with EP context is created and dumped into the user specified file "temp_file" temp_buffer = encrypt(temp_file) write(temp_buffer, cache_onnx_file) --> Write the encypted context of "temp_file" into the "cache_onnx_file" file InitializeInferenceSession(cache_onnx_file) model_buffer = load(cache_onnx_file) --> Load the ONNX model with EP context from the file generated in the previous step model_buffer = decrypt(model_buffer) session_options = { } Ort::CreateSessionFromArray(model_buffer, session_options) --> Create and initalize an session with the EP context model ``` - Python code with comments - EP context model creation ```python import onnxruntime as onnxrt # Session options for creating an ONNX model with EP context cache. sess_opts = onnxrt.SessionOptions() # Verbose. sess_opts.log_severity_level = 0 # This is REQUIRED. sess_opts.add_session_config_entry("ep.context_enable", "1") # This is OPTIONAL. # Either an absolute path (preferred for now) or a relative path (WIP) is okay. # sess_opts.add_session_config_entry("ep.context_file_path", "/some/path/to/original_model_ctx.onnx") # This is OPTIONAL. sess_opts.add_session_config_entry("ep.context_embed_mode", "1") orig_model_location = "/some/path/to/original_model.onnx" sess = onnxrt.InferenceSession(orig_model_location, sess_opts, providers=["VitisAIExecutionProvider"], provider_options=[]) ``` - Inference run with an EP context model ```python import onnxruntime as onnxrt # Session options for creating an ONNX model with EP context cache. sess_opts = onnxrt.SessionOptions() # Default EP context model path. # ep_ctx_model_location = "/some/path/to/origina_model.onnx_ctx.onnx" # User configured EP context model path. ep_ctx_model_location = "/some/path/to/origina_model_ctx.onnx" sess = onnxrt.InferenceSession(ep_ctx_model_location, sess_opts, providers=["VitisAIExecutionProvider"], provider_options=[]) model_inputs = {} run_opts = onnxrt.RunOptions() # Verbose. run_opts.log_severity_level = 1 sess.run(None, model_inputs, run_opts) ``` --------- Co-authored-by: Glen Cao <glen@Glens-MacBook-Air.local>	2024-07-12 21:22:58 -07:00
Xu Xing	92a8407b39	[js/webgpu] Remove unnecessary initialization of var (#21312 ) This var has been initialized to 0 in tint, so no need extra loop to do it again: ``` float tint_symbol_52[1][4] = (float[1][4])0; { for(int tint_symbol_53 = 0; (tint_symbol_53 < 1); tint_symbol_53 = (tint_symbol_53 + 1)) { { for(int tint_symbol_54 = 0; (tint_symbol_54 < 4); tint_symbol_54 = (tint_symbol_54 + 1)) { tint_symbol_52[min(uint(tint_symbol_53), 0u)][min(uint(tint_symbol_54), 3u)] = 0.0f; } } } } ``` ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-12 12:34:34 -07:00
Yi Zhang	f2ebd1cd6b	[Fix] Exception in iosDynamicFramework Post-Merge workflow (#21262 ) ### Description the exception was caused by `3dd6fcc089` Why I add skip_macos_test because there's new an exception in https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1425579&view=logs&j=c90c5af3-67d5-5936-5a62-71c93ebfca65&t=01038f35-8e78-5801-1aa1-d9647bb65858 ``` 2024-07-05T14:41:09.3864740Z mkdir -p /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest/Contents/Frameworks 2024-07-05T14:41:09.3933430Z mkdir: /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest: Operation not permitted 2024-07-05T14:41:09.3996760Z /var/folders/0f/b0mzpg5d31z074x3z5lzkdxc0000gn/T/tmp97ycvwq5/apple_package_test/Pods/Target Support Files/Pods-macos_package_testUITests/Pods-macos_package_testUITests-frameworks.sh: line 7: realpath: command not found 2024-07-05T14:41:09.4003170Z :18: error: Unexpected failure 2024-07-05T14:41:11.1323470Z error: Sandbox: mkdir(72212) deny(1) file-write-create /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest (in target 'macos_package_testUITests' from project 'apple_package_test') 2024-07-05T14:41:11.1325620Z 2024-07-05T14:41:11.8731110Z 2024-07-05T14:41:11.8733040Z Test session results, code coverage, and logs: 2024-07-05T14:41:11.8734820Z /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Logs/Test/Test-macos_package_test-2024.07.05_14-40-38-+0000.xcresult 2024-07-05T14:41:11.8735530Z 2024-07-05T14:41:11.8906210Z Testing failed: 2024-07-05T14:41:11.8911060Z Sandbox: mkdir(72212) deny(1) file-write-create /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest 2024-07-05T14:41:11.8912570Z Unexpected failure 2024-07-05T14:41:11.8913690Z Testing cancelled because the build failed. 2024-07-05T14:41:11.8914380Z 2024-07-05T14:41:11.8914970Z TEST FAILED 2024-07-05T14:41:11.8915480Z 2024-07-05T14:41:11.8915780Z 2024-07-05T14:41:11.8916750Z The following build commands failed: 2024-07-05T14:41:11.8919280Z PhaseScriptExecution [CP]\ Embed\ Pods\ Frameworks /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Intermediates.noindex/apple_package_test.build/Debug/macos_package_testUITests.build/Script-059136A7770CA5376C30F2FD.sh (in target 'macos_package_testUITests' from project 'apple_package_test') 2024-07-05T14:41:11.8922180Z (1 failure) ``` And I find macos test is skipped in `9ef28f092f/tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml (L119-L127)` as well. Maybe it is an known issue.	2024-07-12 09:24:12 -07:00
Ted Themistokleous	4ac4cd2668	Migraphx ep windows build (#21284 ) ### Description Repeat of #21084 with removal of policy CMP0144 to suppress warnings which uses CMake 3.27.0. ### Motivation and Context Already approved PR: https://github.com/microsoft/onnxruntime/pull/21084 Removed the added policy from CMake 3.27.0.	2024-07-11 21:21:38 -07:00
mingyueliuh	42b7cedb06	[VitisAI] custom op support multiple outputs (#21280 ) ### Description The implementation inside EP requires registering some custom ops which are only used in the model compilation phase. Currently only single output is supported. ### Motivation and Context Now the demand upgrade requires support for multiple outputs, so the shaper infer of ep custom op needs to be extended to support multiple outputs --------- Co-authored-by: liumingyue <mingyue@xilinx.com> Co-authored-by: mingyue <mingyue@amd.com>	2024-07-11 16:04:18 -07:00
Qingnan Duan	80b56feb41	Implement FlashAttention for CPU (#20805 ) ### Description Implement [FlashAttention](https://arxiv.org/pdf/2205.14135) and [FlashAttention-2](https://arxiv.org/pdf/2307.08691) for MultiHeadAttention on CPU. ### Motivation and Context Accelerate the execution of MultiHeadAttention. Current performance: 10ms vs 16ms (com.microsoft.MultiHeadAttention) on my Linux machine and 10ms vs 38ms (com.microsoft.MultiHeadAttention) on my Windows machine. May need further optimizations. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Qingnan Duan <qiduan@microsoft.com>	2024-07-11 14:19:59 -07:00
Edward Chen	33e7c7f6ec	Enable Android CI build stages to run in parallel. (#21314 ) Enable Android CI build stages to run in parallel to possibly reduce total build time.	2024-07-11 10:09:09 -07:00
Yi Zhang	41ea47be1e	Move QNN nuget package stages out of the big Nuget packaging pipeline. (#21306 ) ### Description 1. remove QNN stages from the big packaging pipeline 2. Add publish nightly package in the current [QNN Nuget pipeline](https://dev.azure.com/aiinfra/Lotus/_builddefinitionId=1234]) ### Motivation and Context Reduce the complexity of the big Nuget packaging pipelines. --------- Co-authored-by: Yi Zhang <your@email.com>	2024-07-11 09:07:23 -07:00
pengwa	88336ffa92	Fix typos - 1st Wave (#21278 ) ### Description There are so many typos reported by the review dog, [Optional Lint] actions (example: https://github.com/microsoft/onnxruntime/actions/runs/9864564489/job/27239732367), this PR is to fix some of them. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-07-11 13:35:08 +08:00
pengwa	0a1178add9	Fix lint C++ actions (#21303 ) ### Description <!-- Describe your changes. --> `83e0c6b96e` is the last commit having Lint C++ actions pass. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/96bf005e-5815-46d0-ac17-c6094200957c) `4a7eaff1d9` is the first commit let it fail. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/72a9271e-7b4b-40f8-83a5-f28b82c5e726) Reviewdog/action-cpplint@master changed since that day. https://github.com/reviewdog/action-cpplint/pull/42/files make action-cpplint starts using reviewdog release https://github.com/reviewdog/reviewdog/releases/tag/v0.19.0. Optional Lint also failed with many typos, should be also related to the same reason. Let's fix that in different prs. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-11 09:46:41 +08:00
Changming Sun	fe6ef404b5	Enable LTO for Android build (#21243 ) ### Description Enable LTO for Android build, which can reduce binary size by 6%.	2024-07-10 18:44:17 -07:00
Sheil Kumar	28af544278	[DirectML] Broadcast NC-dims for Tensors A&B in DynamicQuantizeMatMul (#21298 ) ### Description [DirectML] Broadcast NC-dims for Tensors A&B in DynamicQuantizeMatMul The DynamicQuantizeMatMul allows input tensors in NCHW format, and DirectML requires that input tensors share the same batch and channel dimensions. Tensors A and B should be broadcast (if possible) to the corresponding output NC dims. ### Motivation and Context Certain models which use DynamicQuantizeMatMul hit a crash when the NC dims are intended to be broadcast. --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2024-07-10 17:35:47 -07:00
Edward Chen	20cd3394fc	[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation (#21193 ) Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., computing the output in 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B. Also moved some code around as it was getting big for a single file.	2024-07-10 15:39:26 -07:00
Changming Sun	8749fa381e	Update absl (#21300 ) ### Description Our macOS pipeline are failing because of a build error in absl. However, the bug fix we need is not available in the latest ABSL release. Here is the issue: https://github.com/abseil/abseil-cpp/pull/1536 And here is the fix: `779a3565ac` GTests uses ABSL. But this ABSL target also depends on GTest. So, it is a circular dependency. We should be able to avoid that by avoid building tests for ABSL. However, the version we are using has a problem with that: it has cmake target that still depends on GTest even when testing is disabled. It's strange that we suddenly hit this problem and it only happens on macOS.	2024-07-10 11:14:15 -07:00
Adrian Lizarraga	5753f8da8c	[QNN EP] Initial INT4 support (#21171 ) ### Description - Adds support for int4 quantized weights (per-tensor and per-channel) on QNN EP - Adds test script that creates an INT4 qdq model with a Conv - Adds a unit tests demonstrating accuracy issues. ### Motivation and Context This is the next step in being able to run models that use 4-bit quantized weights on QNN EP.	2024-07-10 10:03:53 -07:00
Pavan Goyal	1b82d835d8	[Fix] InterOpNumThreads Session Option for ONNX ReactNative Package (#21263 ) ### Description This PR resolves a bug related to setting the interOpNumThreads session option when creating an ORTSession. Currently, when the interOpNumThreads option is passed from React Native, the native module incorrectly sets intraOpNumThreads instead of interOpNumThreads. ### Motivation and Context Since this is a bug, users of the Onnx React Native package may believe that they are setting interOpNumThreads correctly, So this change is required. Refer to the code snippet below for details <img width="634" alt="Screenshot 2024-07-05 at 9 28 58 PM" src="https://github.com/microsoft/onnxruntime/assets/88655321/70a8f216-553a-4f4c-9481-e6871f0e37e6">	2024-07-10 07:00:18 -07:00
Ștefan Talpalaru	1b19045afa	[build] allow MPI on Unix when NCCL is disabled (#21175 ) ### Description CMake logic fixed to allow enabling MPI while NCCL is disabled. ### Motivation and Context MPI is also used on the CPU backend, not only with CUDA, so it makes sense to decouple it properly from NCCL (which is for dealing with multiple Nvidia GPUs).	2024-07-09 21:21:40 -07:00
Hann Wang	d28c26a919	[ROCm] fix: obtain AMD GPU memory info through rocm_smi library (#21190 ) ### Description Previously ROCMExecutionProvider uses `hipMemGetInfo` to obtain the sizes of total memory and available memory. However, this API has been broken since ROCm 5.7. In this PR, we use `rocm_smi` library instead of `hipMemGetInfo`. ### Motivation and Context `hipMemGetInfo` API has been broken since ROCm 5.7 and inference with ROCMExecutionProvider will lead to following errors: ``` HIP failure 1: invalid argument ; GPU=0 ; hostname=4cc4900475fe ; file=/onnxruntime/onnxruntime/core/providers/rocm/rocm_execution_provider.cc ; line=229 ; expr=hipMemGetInfo(&free, &total); ``` MIOpen has a brute-force fix for this (`911e671895/src/hip/handlehip.cpp (L72)`). Instead of hard-coding available memory to 16GB, I suppose we could obtain memory info through `rocm_smi` library as in this PR.	2024-07-09 20:35:26 -07:00
Chen Feiyue	fffd430091	[VSINPU]Code improvement && Slice/Dropout OP support (#21217 ) ### Description - Refactor codes to meet line length limit and guard missing warning - Add slice/dropout op support - Move vsinpu ep's cmake settings from onnxruntime_providers.cmake to a separate file - Modify apis with param onnxruntime::Path because this kind is replaced by std:filesystem::path by #20920	2024-07-09 20:14:46 -07:00
Maximilian Müller	cc0de0d526	[Build] Propagate build option for CUDA minimal to TRT (#20695 ) ### Description Extend cuda minimal option to TRT provider, as with TRT 10 no linking to cuDNN is required anymore . Besides that with the new engine dump feature it is also possible to embed an engine in to an ONNX and not ship a builder lib. In addition to that this has roughly the same deserialization time/session setup time that using TRT standalone has. ### Motivation and Context ``` exe_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable\|1 trt_timing_cache_enable\|1 trt_dump_ep_context_model\|1 trt_weightless_engine_enable\|1' model.onnx exe_no_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable\|1 trt_timing_cache_enable\|1 trt_dump_ep_context_model\|1 trt_weightless_engine_enable\|1' model_ctx.onnx ```	2024-07-09 14:40:04 -07:00
Edward Chen	307b34a820	[NNAPI EP] Track skipped initializer usage (#21286 ) Track skipped initializer usage in NNAPI EP to account for usage by other nodes.	2024-07-09 13:43:22 -07:00
Xiang Zhang	1ab162fbca	Fix ETW Sink Initialize unproperly locking (#21226 ) ### Description ETW trace logger is fakely registered as initialized_ is marked as true before the registration is done, causing crashing issue for Lenovo camera application. [Bug 42610244](https://microsoft.visualstudio.com/OS/_workitems/edit/42610244): [Watson Failure] caused by SVCHOSTGROUP_Camera_INVALID_POINTER_READ_c0000005_onnxruntime.dll!onnxruntime::logging::Logger::Log	2024-07-09 10:55:41 -07:00
Jian Chen	d1c19e79ea	Update OpenVino CI Ubuntu to 22.04 (#21127 ) ### Description [Update OpenVino CI Ubuntu to 22.04](`312fab5b3f`) ### Motivation and Context Ubuntu 22.04 is needed for linux C++20	2024-07-09 09:56:44 -07:00
Wanming Lin	eeb8fc0931	[WebNN EP] Release WebNN MLGraphBuilder after Compile to free memory (#21200 ) This would help release the constants bound by the MLGraphBuilder.	2024-07-09 08:49:58 -07:00
Changming Sun	2c53b4a534	Remove core/common/gsl.h (#20894 ) ### Description It might be easier if we just directly include the original gsl headers. "core/common/gsl.h" is an indirection that doesn't provide extra help.	2024-07-08 18:09:39 -07:00
Enrico Galli	4c3c809bdb	[js/webnn] Enable user-supplied MLContext (#20600 ) ### Description This PR enables the API added in #20816 as well as moving context creation to JS. ### Motivation and Context In order to enable I/O Binding with the upcoming [MLBuffer](https://github.com/webmachinelearning/webnn/issues/542) API in the WebNN specification, we need to share the same `MLContext` across multiple sessions. This is because `MLBuffer`s are restricted to the `MLContext` where they were created. This PR enables developers to use the same `MLContext` across multiple sessions.	2024-07-08 10:19:39 -07:00
Wanming Lin	cd516a1677	[WebNN EP] Remove constraint for conv ops on CPU backend (#21237 ) Currently WebNN TFLite backend allows the filter of conv2d/convTranspose2d be an input. Remove the constraint and operate necessary transpose/reshape operations for the filter input.	2024-07-08 10:14:43 -07:00
zz002	4a7eaff1d9	[vitisai] Fix build failure introduced by #20920 (#21247 ) ### Description Fix build failure introduced by #20920	2024-07-08 05:44:30 -07:00
Jing Fang	83e0c6b96e	Add MatMulNBits shape infer to SymbolicShapeInference (#21246 ) ### Description Support MatMulNBits shape infer in SymbolicShapeInference MatMulNBits's B input is rank-2, so implicit merge does not apply. ### Motivation and Context [Issue with performing shape inference using symbolic_shape_infer.py with Phi-3 ONNX Models · Issue #21194 · microsoft/onnxruntime (github.com)](https://github.com/microsoft/onnxruntime/issues/21194)	2024-07-05 16:24:57 -07:00
KnightYao	9ef28f092f	[Fix Bug] Fp8Fp8 Run Error (#20911 ) Fix fp8fp8 when input A is e5m2, input B is e4m3 will run error ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-05 17:11:59 +02:00
pengwa	3f6b7430d6	Use cuda memset async (#21216 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-05 17:27:45 +08:00
Baiju Meswani	0bbd061a54	Exclude azure ep from gen_def.cc (#21250 ) Addresses python packaging pipeline failure.	2024-07-04 10:50:27 -07:00
Changming Sun	07c429191e	Delete path.h (#21211 ) ### Description Delete path.h and replace all occurrences of onnxruntime::Path with std::filesystem::path. Previously we couldn't use C++17's std::filesystem because it was not supported in iOS 12(which was released in 2018). Now we dropped the support for iOS 12. ### Motivation and Context To simplify code. For example, if an EP wants to use the Path class, now it can directly use it without going through a wrapper. And the standard implementation can handle various path types better. (We didn't take much consideration on UNC path, "/" as a path separator on Windows, etc).	2024-07-04 15:54:13 +08:00

... 12 13 14 15 16 ...

11997 commits