onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-27 03:11:28 +00:00

Author	SHA1	Message	Date
Ye Wang	aaf32fb1b1	phi2 conversion/optimization script (#19338 ) ### Description <!-- Describe your changes. --> This PR adds onnx conversion script for dynamo exported phi2, optimization script, and inference example script A readme file is added as documentation. https://github.com/microsoft/onnxruntime/tree/wangye/phi2_doc/onnxruntime/python/tools/transformers/models/phi2#readme ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-02-05 10:15:16 -08:00
Scott McKay	debd1cab10	Add coremltools 7.1 as a dependency (#19389 ) ### Description <!-- Describe your changes. --> Setup usage of coremltools via dependencies instead of copying files. Pull in some changes from https://github.com/microsoft/onnxruntime/pull/19347 in preparation for supporting ML Program and enabling building the ML Model on all platforms to make development and testing of CoreML EP code easier. - Update to coremltools 7.1 - Add patch for changes required for cross platform build of ML Program related code - Generate coreml proto files on all platforms - mainly to test these changes work everywhere, as the proto files will be used on all platforms when #19347 is checked in - rename onnxruntime_coreml_proto target to coreml_proto as it contains purely coreml protobuf code with no ORT related chagnes ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve setup.	2024-02-03 09:42:21 +10:00
He Li	1bdd7d9499	Update oneDNN to v3.0.1 in order to support gcc 13 (#19344 ) ### Description Update the dependency of `oneDNN` to v3.0.1, which fixes a minor bug hindering gcc 13. ### Motivation and Context Referring to [oneDNN-1548](https://github.com/oneapi-src/oneDNN/issues/1548). - When building with `--use_dnnl` using gcc 13.x, it will fail due to this upstream issue. - This is fixed in `v3.0.1` [tag](https://github.com/oneapi-src/oneDNN/tree/v3.0.1) by [this commit](`1d7971ce48`).	2024-02-01 15:39:03 -08:00
Yueqing Zhang	1d6f13fb92	[VitisAI] Refactor the VAIEP to use MSFT's standalone API (#19058 ) ### Description <!-- Describe your changes. --> Refactor the VAIEP to use MSFT's standalone API ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Vitis ONNX RT VAI should switch to using the standalone API for ONNX EPs in order to decouple the EP from onnxruntime.dll and the providers.dll. This will help to simplify customer deployment of applications and use cases that need to share their onnxruntime.dll with other applications. --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com> Co-authored-by: zz002 <zhenze.wang@amd.com>	2024-01-31 21:08:26 -08:00
Yi-Hong Lyu	55b60d8fe0	Turn off Neural Speed to avoid slowdowns (#19265 ) Disable Neural Speed to prevent the operation following MatMulNBits from significantly slowing down.	2024-01-31 13:40:25 -08:00
Phoebe Chen	2b361c04d6	Fix Flatbuffer build issue. (#19296 ) ### Description Building on g++ 13.2.0 results in -Wstringop-overread errors on Linux. This commit addresses the flatbuffer build issue with the following changes: 1. Remove the Werror flag in the flarbuffer patch. 2. Add a compilation option to suppress the 'stringop-overflow' error in the Flatbuffers within the xnnpack provider. ### Motivation and Context https://github.com/google/flatbuffers/issues/8119 https://github.com/microsoft/onnxruntime/pull/19239 Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>	2024-01-31 10:12:43 -08:00
Changming Sun	8dad9d92f4	Move einsum's test data to constexpr variables (#19320 ) ### Description emscripten's C++ compiler has difficulty on compiling einsum_test.cc because the file has too many local variables. So I moved them to constexpr.	2024-01-30 15:59:37 -08:00
Changming Sun	a92802f940	Disable a few tests for wasm build (#19316 )	2024-01-30 08:16:57 -08:00
Tianlei Wu	8b4517218b	Remove USE_CUTLASS flag (#19271 ) ### Description Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for onnxruntime CUDA build), there is no need to have a flag to disable cutlass. Changes: (1) Reverted https://github.com/microsoft/onnxruntime/pull/18761 (2) remove the condition to build cutlass. (3) Fix a few build errors or warnings during testing CUDA 11.4 build. Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later. Flash attention and cutlass fused multihead attention will not be built for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if you want to support latest GPUs. It is better to include it in 1.17.0 (otherwise, the release branch might encounter build failure with CUDA 11.4). Tests: (1) Build with flash attention and efficient attention off: passed (2) Build with CUDA 11.4: passed Example build command used in Ubuntu 20.04: ``` export CUDA_HOME=/usr/local/cuda-11.4 export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/ export CUDACXX=/usr/local/cuda-11.4/bin/nvcc sh build.sh --config Release --build_shared_lib --parallel --use_cuda --cuda_version 11.4 \ --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \ --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \ --disable_types float8 ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-25 16:57:58 -08:00
PeixuanZuo	1c92e56dc0	[Cuda] Refactor GroupNorm (#19146 ) Split GroupNorm implementation into multiple files, to make ROCm EP can reuse cuda code. Related PR: https://github.com/microsoft/onnxruntime/pull/19158 --------- Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2024-01-25 22:28:47 +08:00
Phoebe Chen	4477f57ee3	Enable RISC-V 64-bit Cross-Compiling Support for ONNX Runtime on Linux (#19238 ) ### Description This pull request introduces the necessary changes to enable RISC-V 64-bit cross-compiling support for the ONNX Runtime on Linux. The RISC-V architecture has gained popularity as an open standard instruction set architecture, and this contribution aims to extend ONNX Runtime's compatibility to include RISC-V, thereby broadening the reach of ONNX models to a wider range of devices. ### Motivation and Context RISC-V is a free and open-source instruction set architecture (ISA) based on established RISC principles. It is provided under open licenses without fees. Due to its extensibility and freedom in both software and hardware, RISC-V is poised for widespread adoption in the future, especially in applications related to AI, parallel computing, and data centers. ### Example Build Command ``` ./build.sh --parallel --config Debug --rv64 --riscv_toolchain_root=/path/to/toolchain/root --skip_tests ``` ### Documentation Updates Relevant sections of the documentation will be updated to reflect the newly supported RISC-V 64-bit cross-compilation feature. https://github.com/microsoft/onnxruntime/pull/19239 --------- Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>	2024-01-24 16:27:05 -08:00
Changming Sun	bc54ad3f03	Update abseil to a release tag and register neural_speed (#19255 ) ### Description Update abseil to a release tag and register neural_speed to CG. ### Motivation and Context Now we are using a non-relesed version of abseil. Using a tag is better.	2024-01-24 14:37:39 -08:00
Jeff Daily	b2aec41a83	[ROCm] enable hipGraph (#18382 ) This ports the cudaGraph support from the CUDA EP to the ROCM EP's hipGraph.	2024-01-23 11:17:04 +08:00
snadampal	77da2ef278	[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 (#17031 ) ### Description This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to implement matrix multiplication with bfloat16 SIMD instructions (bfmmla) and MatMul operator changes to invoke the Sbgemm kernel. To enable Sbgemm kernel, set the following session option: "kOrtSessionOptionsGemmFastMathMode" The PR also adds new test cases for mlas and ort. ### Motivation and Context This is to improve MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x -1.76x performance improvement compared to sgemm (fp32) kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` And the unit test precision results are matching to sgemm kernel results. `./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync `	2024-01-22 14:43:06 -08:00
Edward Chen	c8ce83967e	Download protoc for all Apple host builds, remove protoc build from iOS packaging pipeline. (#19209 )	2024-01-19 15:30:09 -08:00
luoyu-intel	459c750b03	Update x64 template kernel library for 'sqnbitgemm' (#19016 ) ### Description <!-- Describe your changes. --> 1. Make JBLAS codes an external module of ORT. 2. Move q4 gemm code to contrib_ops. 3. Update template kernel library to v0.1 release. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We found that the current LLM model performance is far below our expectations. Here is some performance data collected on Mistral-7B model with Xeon-8480: 8 threads \| prompt length=32 past_len=32 \| prompt length=1 past_len=32 -- \| -- \| -- ORT-main \| 1220ms \| 263ms Neural-speed \| 564ms \| 87ms ORT-this PR\|597ms\|120ms Although `Neural-speed` and `ORT-this PR` use the same int4 kernel code, there is a 33ms(87ms vs. 120ms) latency gap between the two frameworks. Through some statistics analysis, the summary latency of `MatMulNBits` is 86.7ms The summary latency of all int4 GEMMs in `Neural-speed` is 84.8ms. So other OPs introduce an extra 30ms latency. The performance of MatMulNBits in this PR meets our expectations. ### Remain Issues 1. For hybrid CPUs, like core 12900K, the ONNXRuntime thread pool uses TaskGranularityFactor to scale its number of threads. This is not expected in our code design. It may slow down the hybrid CPU performance by 30~40%. 2. Prepack uses a single thread which is very slow to init a session. 3. MatMulNBits with zero points will fall through to COMP_FP32 even accuracy_level=4. Our COMP_INT8 IGemmCore with zero points process is not optimized for now. It will be updated in the future. So, for an int4 model with zero points, whether the accuracy_level is 0 or 4 will be no difference.	2024-01-18 13:16:34 -08:00
Maximilian Müller	bc219ed553	[TensorRT EP] Enable a minimal CUDA EP compilation without kernels (#19052 ) Adresses https://github.com/microsoft/onnxruntime/issues/18542. I followed the advice given by @RyanUnderhill [here](https://github.com/microsoft/onnxruntime/pull/18731#issuecomment-1848261925) and went with a minimal CUDA EP for now.	2024-01-17 11:33:34 -08:00
Wanming Lin	07d3aed3aa	[WebNN EP] Fixed build issue with disable_rtti (#19173 ) Previously building webnn ep with --disable_rtti will throw unboundTypeError since unbound type names are illegal with RTTI disabled in Embind API, we can fix it by adding a -DEMSCRIPTEN_HAS_UNBOUND_TYPE_NAMES=0 flag.	2024-01-16 21:35:13 -08:00
Changming Sun	e2e488d6f8	Revert "iOS packaging pipeline stability" (#19135 ) Reverts microsoft/onnxruntime#19097 because it broken Android CI pipeline.	2024-01-16 09:18:35 -08:00
Jeff Bloomfield	8d4369b77e	Update DirectML nuget version to 1.13.1 (#19122 ) ### Description Update DML version to 1.13.1 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-15 19:04:41 -08:00
Edward Chen	e1e45901e2	iOS packaging pipeline stability (#19097 ) - Remove protoc build step which sometimes times out. Download protoc instead. - Use macOS-12 image in the set variables stage. It seems more stable.	2024-01-13 19:27:44 -08:00
Edward Chen	150c4cb8fe	[MLAS AArch64] SQNBitGemm CompInt8 kernel (#18953 ) Implement ARM NEON SQNBitGemm kernel that first block quantizes A to int8 and then does int8 multiplication.	2024-01-12 17:58:08 -08:00
Guenther Schmuelling	96dbac6e4b	update to emsdk-3.1.51 (#18844 )	2024-01-12 16:04:33 -08:00
Preetha Veeramalai	c340bf08f6	Openvino EP code changes for 1.17 update (#19023 ) ### Description Introduce AppendExecutionProvider_OpenVINO_V2 API and support for OV 2023.3. ### Context - The API is added to facilitate customers in using published official Microsoft onnxruntime libraries with OVEP libraries. - Add support for OpenVINO 2023.3 official release. - Extend operator coverage - GH fixes --------- Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>	2024-01-12 13:20:51 -08:00
Changming Sun	0e8d4c3d21	Enable Address Sanitizer in CI (#19073 ) ### Description 1. Add two build jobs for enabling Address Sanitizer in CI. One for Windows CPU, One for Linux CPU. 2. Set default compiler flags/linker flags in build.py for normal Windows/Linux/MacOS build. This can help control compiler flags in a more centralized way. 3. All Windows binaries in our official packages will be built with "/PROFILE" flag. Symbols of onnxruntime.dll can be found at [Microsoft public symbol server](https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/microsoft-public-symbols). Limitations: 1. On Linux Address Sanitizer ignores RPATH settings in ELF binaries. Therefore once Address Sanitizer is enabled, before running tests we need to manually set LD_LIBRARY_PATH properly otherwise libonnxruntime.so may not be able to find custom ops and shared EPs. 4. On Linux we also need to set LD_PRELOAD before running some tests(if the main executable, like python, is not built with address sanitizer. On Windows we do not need to. 5. On Windows before running python tests we should manually copy address sanitizer DLL to the onnxruntime/capi directory, because python 3.8 and above has enabled "Safe DLL Search Mode" that wouldn't use the information provided by PATH env. 6. On Linux Address Sanitizer found a lot of memory leaks from our python binding code. Therefore right now we cannot enable Address Sanitizer when building ONNX Runtime with python binding. 7. Address Sanitizer itself uses a lot of memory address space and delays memory deallocations, which is easy to cause OOM issues in 32-bit applications. We cannot run all the tests in onnxruntime_test_all in 32-bit mode with Address Sanitizer due to this reason. However, we still can run individual tests in such a way. We just cannot run all of them in one process. ### Motivation and Context To catch memory issues.	2024-01-12 07:24:40 -08:00
PeixuanZuo	5f3113ecd6	[ROCm] Fix hipify error: fast_divmod.h: No such file or directory (#19060 ) Fix error: ``` [ 48%] Built target onnxruntime_optimizer In file included from /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_stream_handle.cc:5: /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_common.h:11:10: fatal error: core/providers/rocm/shared_inc/fast_divmod.h: No such file or directory 11 \| #include "core/providers/rocm/shared_inc/fast_divmod.h" \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. ``` This error is due to onnxruntime_optimizer missing dependencies on hipify generated files.	2024-01-10 14:49:19 +08:00
Milos Puzovic	37ac9d391c	Enable Arm Compute Library 23.08 (#17672 ) ### Description This PR enables onnxruntime to build with the most recent release of Arm Compute Library ### Motivation and Context The latest version of Arm Compute Library that onnxruntime can build is 20.02 which is more than 3 years old.	2024-01-09 14:10:25 -08:00
Changming Sun	68c29ece23	In a Linux or Android build check if the compiler support bfloat16 and float16 (#18813 ) ### Description Restrict clang version because we have an upcoming change that requires clang version >=16 , which will mainly affect Android build.	2024-01-08 19:46:33 -08:00
Jeff Bloomfield	55a669409a	Merge pull request #18983 from microsoft/WindowsAI Merge WindowsAI to main	2024-01-04 17:21:19 -08:00
Wei-Sheng Chin	658e30eb33	Remove DORT since it's in PyTorch main now (#18996 ) Main code are removed and tests are modified to use DORT directly from PyTorch.	2024-01-04 12:59:47 -08:00
Xavier Dupré	889b1ef2d1	Fix schema type constraint for custom operators (#17497 ) ### Description onnxruntime may raise an error "type inference failed" but when a custom operator sets IsHomogeneous to false in its schema. This change make sure that TypeInferenceFunction and schema type constraints are aligned to prevent that from happening. --------- Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>	2024-01-04 20:27:46 +01:00
Yulong Wang	b18abaaa2c	[js/web] wait for threadpool initialization (#18952 ) ### Description a replacement of #18683. try to resolve #18689. By specifying "-s PTHREAD_POOL_SIZE" flag in emscripten, it forces the threadpool to initialize before the webassembly instance is available.	2024-01-04 08:06:55 -08:00
Sheil Kumar	107d7492b9	[DirectML EP] Add DML EP registration for Col2Im (#17786 ) ### Description [DirectML EP] Add DML EP registration for Col2Im operator ### Motivation and Context Add Col2Im support for opset 18. This operator is implemented as the DirectML Fold operator. --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com> Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>	2024-01-03 16:13:14 -08:00
Jeff Bloomfield	c3d96a7b35	Update DML version to 1.13.0 (#18978 ) Update DML nuget version to 1.13.0	2024-01-03 16:09:55 -08:00
Sheil Kumar	dbb8680bdc	Delay load dxcore.dll in addition to ext-ms-win-dxcore-l1-1-0.dll (#18913 ) Delay load dxcore.dll in addition to ext-ms-win-dxcore-l1-1-0.dll Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-12-26 12:33:42 -08:00
pengwa	37f743680a	Fix build when flash attention and memory efficient attention are disabled (#18761 ) ### Fix build when flash attention and memory efficient attention are disabled On a customer env with lower version of CUDA < 11.6. Both flash attention and memory efficient attention is turned OFF according to `e8f33b54ba/cmake/CMakeLists.txt (L701)`. So `e8f33b54ba/cmake/external/cutlass.cmake (L1)` condition check return false. No cutlass lib is built. ``` Turn off flash attention since CUDA compiler version < 11.6 ``` While, the kernels in https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/moe/ft_moe are depending on cutass for its build, so we get error like this: ``` [ 77%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17: /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory 23 \| #include "cutlass/array.h" \| ^~~~~~~~~~~~~~~~~ compilation terminated. In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17: /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory 23 \| #include "cutlass/array.h" \| ^~~~~~~~~~~~~~~~~ compilation terminated. In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17: /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory 23 \| #include "cutlass/array.h" \| ^~~~~~~~~~~~~~~~~ compilation terminated. In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17: /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory 23 \| #include "cutlass/array.h" \| ^~~~~~~~~~~~~~~~~ compilation terminated. fatal : Could not open input file /tmp/tmpxft_00044da3_00000000-11_moe_gemm_kernels_fp16_fp16.compute_60.cpp1.ii make[2]: * [CMakeFiles/onnxruntime_providers_cuda.dir/build.make:6290: CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o] Error 1 make[2]: * Waiting for unfinished jobs.... make[1]: * [CMakeFiles/Makefile2:2210: CMakeFiles/onnxruntime_providers_cuda.dir/all] Error 2 make: * [Makefile:166: all] Error 2 Traceback (most recent call last): File "/tmp/onnxruntime/tools/ci_build/build.py", line 2746, in <module> sys.exit(main()) File "/tmp/onnxruntime/tools/ci_build/build.py", line 2639, in main build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target) File "/tmp/onnxruntime/tools/ci_build/build.py", line 1527, in build_targets run_subprocess(cmd_args, env=env) File "/tmp/onnxruntime/tools/ci_build/build.py", line 824, in run_subprocess return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env) File "/tmp/onnxruntime/tools/python/util/run.py", line 49, in run completed_process = subprocess.run( File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, ``` ### Motivation and Context To summarize, there are two cases we will have build failure for Linux CUDA build: 1. User use cuda version < 11.6 2. User disabled Flash attention and memory efficient attention explictly with onnxruntime_USE_FLASH_ATTENTION and onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION	2023-12-26 08:57:58 +08:00
luoyu-intel	5f00bc9931	Integrate high-performance x64 gemm library to MLAS (#17669 ) ### Description Improve MLAS to support high-performance x64 INT4 kernels ### Motivation and Context 1. improve LLM inference performance on Intel CPUs. 2. support more 4bit quantization types: nf4, fp4 3. support dynamic block size: block size aligned with kernel's tiling size(e.g. 4 for VNNI kernel), per channel on N dimension 4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni, amx_bf16, amx_int8, avx512_fp16 5. support MatMulNBits' data format ### Tasks - [x] support block_size: 32, 128, -1(per channel) - [x] get weight pack size without memory allocation - [x] use ort's thread pool for parallelism - [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8 ### Benchmark Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 47613 \| 47401 \| 12970 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 6347792 \| 6317562 \| 109 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 11814014 \| 11757847 \| 59 Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 50222 \| 50031 \| 13759 Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 2038222 \| 2028743 \| 341 Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 3792832 \| 3774485 \| 191 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 58717 \| 58501 \| 11467 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 1360846 \| 1354598 \| 543 Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 2564232 \| 2551365 \| 266 Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 57929 \| 57694 \| 12047 Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5495330 \| 5465810 \| 126 Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10676240 \| 10617817 \| 66 Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 68305 \| 68047 \| 10026 Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5504862 \| 5476215 \| 126 Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 11758623 \| 11697337 \| 66 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 67713 \| 67451 \| 10298 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5508325 \| 5480237 \| 126 Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10738528 \| 10681656 \| 64 Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 60708 \| 60486 \| 11321 Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5523784 \| 5495736 \| 126 Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10829633 \| 10772161 \| 67 Reference: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time \| 53088 \| 52911 \| 13364 Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time \| 6268981 \| 6230335 \| 110 Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time \| 11701237 \| 11632339 \| 59 Win11+12900K 8 cores: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time \| 215976 \| 211295 \| 2884 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time \| 60960590 \| 60937500 \| 10 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time \| 1.18E+08 \| 1.19E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time \| 470377 \| 453059 \| 1414 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time \| 1.54E+08 \| 1.53E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time \| 3.18E+08 \| 3.13E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time \| 569072 \| 559398 \| 1229 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time \| 1.54E+08 \| 1.52E+08 \| 4 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time \| 3.22E+08 \| 3.28E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time \| 1486055 \| 1473325 \| 403 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time \| 4.14E+08 \| 4.14E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time \| 8.88E+08 \| 8.59E+08 \| 1 --------- Signed-off-by: Mengni Wang <mengni.wang@intel.com> Co-authored-by: Mengni Wang <mengni.wang@intel.com>	2023-12-19 09:36:31 -08:00
Changming Sun	f52668cc68	Disable mlas unit test in ARM64EC build (#18747 ) ### Description Disable mlas unit test in ARM64EC build because the program has some link errors. We will fix the errors later. This PR only impacts Windows ARM64EC build. It has no impact on the existing build pipelines.	2023-12-15 09:17:47 -08:00
Changming Sun	d795fc636c	FIX: Our cmake script didn't check googletest's hash (#18826 )	2023-12-15 08:48:15 -08:00
Changming Sun	cbad4fe49b	Update absl and googletest (#18827 ) ### Description Update absl and googletest to their latest version to include some cmake changes: 1. A googletest's cmake change that will allow using external absl and re2. 2. Nullability enhancements that will allow our clang-based static analysis detecting many kinds of null pointer errors. ### Motivation and Context To fix a C4744 link warning in our Windows pipelines. ``` LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<bool>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\parse.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\parse.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\usage.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<bool>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<int>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] ```	2023-12-14 16:15:07 -08:00
Yueqing Zhang	b42d4b8ea6	[VitisAI] 1. api compatbile 2. dynamic load onnx (#18470 ) ### Description <!-- Describe your changes. --> 1. Add a backward-compatible API for compiling model. 2. Run-time load vitisai-ep.dll ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Yueqing Zhang <yueqingz@amd.com> Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>	2023-12-14 14:43:41 -08:00
Suryaprakash Shanmugam	0723dcb8b5	OpenVINO Execution Provider with 2023.2 support (#18596 ) - Add support for OpenVINO 2023.2 - num_of_threads provider option is mapped to the CPU device property inference_num_threads of the CPU plugin, so users can control the #threads used for inference by the CPU - Logging in Debug mode now includes the runtime properties set for devices - Fix issue in using external weights through OpenVINO --------- Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>	2023-12-13 15:56:43 -08:00
Adrian Lizarraga	81796a3081	[QNN EP Quantization] Add fusion preprocessing to QNN quantization (#18719 ) ### Description - Adds graph fusions to preprocessing step that can be called before creating a QDQ model for QNN EP. - Fuse Erf sequence to Gelu (adapted from [optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_gelu.py)). Required by QNN EP. - Fuse ReduceMean sequence to LayerNormaliation (adapted from [optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_layernorm.py)). Not required by QNN EP. - Fuse ReduceL2 sequence to LpNormalization (new, specific to QNN EP). Required by QNN EP. Example use: ```python3 from quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model # Added by this PR: model_updated = qnn_preprocess_model("model.fp32.onnx", "model.fp32.preprocessed.onnx", fuse_layernorm=True) model_to_quantize = "model.fp32.preprocessed.onnx" if model_updated else "model.fp32.onnx" # Quantize model ... qnn_config = get_qnn_qdq_config(model_to_quantize, data_reader, activation_type=QuantType.QUInt16) quantize(model_to_quantize, "model.qdq.onnx", qnn_config) ``` ### Motivation and Context Allow more models to be quantized for use with QNN EP --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2023-12-12 08:43:04 -08:00
Changming Sun	c7799d7058	Build fixes for Windows ARM32 desktop build (#18752 ) ### Description Fix a link error: ``` onnxruntime_common.lib(cpuid_info.obj) : error LNK2019: unresolved external symbol __imp_RegGetValueA referenced in function "privat e: void __cdecl onnxruntime::CPUIDInfo::ArmWindowsInit(void)" (?ArmWindowsInit@CPUIDInfo@onnxruntime@@AAAXXZ) [C:\Users\snnn\src\on nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventRegister referenced in function "pub lic: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\src\on nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventUnregister referenced in function "p ublic: virtual __cdecl onnxruntime::WindowsTelemetry::~WindowsTelemetry(void)" (??1WindowsTelemetry@onnxruntime@@UAA@XZ) [C:\Users\y ilyu\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventSetInformation referenced in functio n "public: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\ src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventWriteTransfer referenced in function _tlgWriteTransfer_EventWriteTransfer [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\RelWithDebInfo\onnx_test_runner.exe : fatal error LNK1120: 5 unresolved ex ternals [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] ```	2023-12-08 12:45:06 -08:00
Changming Sun	bf33919afb	Update absl and gtest to fix an ARM64EC build error (#18735 ) ### Description Update absl and gtest to fix an ARM64EC build error ### Motivation and Context We need to get an important fix into ORT. The fix is: `8028a87c96`	2023-12-07 15:55:17 -08:00
junchao-loongson	4abec9749e	[mlas] add loongarch lsx and lasx optimize code (#17937 ) ### Description Hello we(@lixing-star) are the developers of loongson team. We add 128 (lsx), 256 (lasx) vector optimization code for the loongarch architecture [100% tests passed, 0 tests failed out of 7](https://cloud.a-boat.cn:2021/api/public/dl/6831z1Bi?inline=true) ### Development Environments1 ``` CPU: Loongson-3C5000L uname -a: Linux localhost.localdomain 4.19.190-6.4.lns8.loongarch64 #1 SMP Thu Jul 14 12:08:04 CST 2022 loongarch64 loongarch64 loongarch64 GNU/Linux ``` ### LonngArch Documents - [LoongArch Reference Manual - Volume 1: Basic Architecture: This manual describes the basic part of the LoongArch architecture.](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html) - [LoongArch ELF psABI: This manual describes the LoongArch ELF psABI.](https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html) - [more](https://loongson.github.io/LoongArch-Documentation/README-EN.html)	2023-12-07 11:15:59 -08:00
moyo1997	9479ba525b	Build onnxruntime.dll as arm64x (#18633 ) Build onnxruntime.dll as arm64x Added a .cmake file to generate a link repro of the onnxruntime.dll during arm64 build. This provides us a directory containing all the arm64 objs, def file and libs to link to when it is time to building arm64x onnxruntime.dll during the arm64ec build by passing the /machine:arm64x flag to the linker along with the arm64 artifacts. If other dlls wanted to be built as x, setting the ARM64X_TARGETS variable in the toplevel cmakelists.txt to include these other targets is all that will be needed. Added build_arm64x.bat as a wrapper for the multiple (rm64, then arm64ec) cmake calls needed to build as arm64x. AB#22533	2023-12-06 16:49:00 -08:00
Ye Wang	c012e41f93	MoE with Expert Slicing (#18565 ) ### Description <!-- Describe your changes. --> Registered Sharded MoE op under contrib_op/cuda/collective with expert slicing. The broadcast process happens just before adding second bias(if has) and permutation undoing. Tensor slicing is planned but not included in this PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-05 16:56:38 -08:00
Adrian Lizarraga	e066fca777	[Quantization] Tensor quant overrides and QNN EP quantization configuration (#18465 ) ### Description #### 1. Adds `TensorQuantOverrides` extra option Allows specifying a dictionary of tensor-level quantization overrides: ``` TensorQuantOverrides = dictionary : Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For per-channel quantization, the list contains a dictionary for each channel in the tensor. Each dictionary contains optional overrides with the following keys and values. 'quant_type' = QuantType : The tensor's quantization data type. 'scale' = Float : The scale value to use. Must also specify `zero_point` if set. 'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set. 'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also set `scale` or `zero_point`. 'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also set `scale` or `zero_point`. 'rmax' = Float : Override the maximum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. 'rmin' = Float : Override the minimum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. ``` - All of the options are optional. - Some combinations are invalid. - Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale` are also specified. Example for per-tensor quantization overrides: ```Python3 extra_options = { "TensorQuantOverrides": { "SIG_OUT": [{"scale": 1.0, "zero_point": 127}], "WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], "BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], }, } ``` Example for per-channel quantization overrides (Conv weight and bias): ```Python3 extra_options = { "TensorQuantOverrides": { "WGT": [ { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.0, "rmax": 2.5, "reduce_range": True, }, { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.2, "rmax": 2.55, "reduce_range": False, }, ], "BIAS": [ {"zero_point": 0, "scale": 0.000621}, {"zero_point": 0, "scale": 0.23}, ], }, } ``` #### 2. Adds utilities to get the default QDQ configs for QNN EP Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method that inspects the model and returns suitable quantization configurations. Example usage: ```python3 from quantization import quantize, QuantType from quantization.execution_providers.qnn import get_qnn_qdq_config qnn_config = get_qnn_qdq_config(input_model_path, data_reader, activation_type=QuantType.QUInt16, weight_type=QuantType.QUInt8) quantize(input_model_path, output_model_path, qnn_config) ``` ### Motivation and Context Make it possible to create more QDQ models that run on QNN EP. --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2023-12-04 17:54:58 -08:00
snadampal	05a9c95764	[DNNL] add Arm Compute Library (ACL) backend for dnnl execution provider (#15847 ) Add ACL as the DNNL runtime option for aarch64 platforms. Update makefile and the python wheel build script. ### Description <!-- Describe your changes. --> Add ACL as the DNNL runtime option for aarch64 platforms. Update makefile and the python wheel build script. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to enable the optimized ACL gemm kernels for dnnl execution provider on aarch64 platform.	2023-12-01 09:16:44 -08:00
George Wu	5c67a00d8e	Revert "remove full protobuf requirement for tensorrt ep" (#18626 ) Reverts microsoft/onnxruntime#18413 there's a timing issue here. we eventually want to get this change merged in but we need to update OSS onnx-tensorrt first.	2023-11-29 22:27:51 -08:00
Edward Chen	14a343441d	Fix Objective-C static analysis build (#18606 ) - Patch abseil to fix a compile error about not finding `cxxabi.h`. - Fix some static analysis warnings.	2023-11-28 17:14:20 -08:00
Rachel Guo	288b80d363	Add MacOS build to ORT C Pod (#18550 ) ### Description <!-- Describe your changes. --> As title. 1. Add macos build as an optionally enabled arch for pod and changes to exsiting build_ios_framework/assemble_c_pod scripts. 2. Enable macos build arch in ios packaging pipeline (currently for variants other than Mobile) and check the output artifacts are correct. 3. Write MacOS Test Target scheme in the test app and integrate into ios packaging CI testing pipeline. Currently the changes only apply to onnxruntime-c pod. as the original request was from ORT SPM which consumes the onnxruntime-c pod only as the binary target. TODO: could look into adding macos platform to objc pod as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable macos platform support in cocoapods. and also potentially produce binary target for enabling macos platform in SPM as well. Replace https://github.com/microsoft/onnxruntime/pull/18334 --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-28 10:11:53 -08:00
Chen Fu	05046e5452	Adding unit test for sm80 prepack (#18514 ) ### Description Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware ### Motivation and Context Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs. This kernel requires sophisticated quantized weight rearrangement to speedup loading data to tensor-core. To facilitate the addition, this change includes the following: 1. matrix_layout.h A new layout lib that facilitate iterating matrix elements and tiles that balance memory safety and performance. 2. prepack_sm80.h Code for rearranging quantized weight, scales and offsets (aka. prepacking) 3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the memory safety and correctness of the prepacking code. Currently the prepacking code runs on CPU with single threaded code. We run this on CPU in order to minimize GPU memory fragmentation. On the other hand, hopefully we get around to parallelize this part of the code. Should be straight forward with the unit tests in place.	2023-11-28 10:01:09 -08:00
Sheil Kumar	0b7048e7d6	Update winml to use #cores - #soc cores by Default as the number of intraopthreads (#18384 ) Update winml to use #cores - #soc cores by Default as the number of intraopthreads --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-11-28 09:26:48 -08:00
cloudhan	6f3c1f9dc9	[ROCm] Update ck for GemmFloat8 (#18487 )	2023-11-23 12:06:19 +08:00
pengwa	43a5147e01	Memory optimization refactor and refinement (#17481 ) ### Memory optimization refactor and refinement Currently memory optimizer runs graph transformations and print recompute opportunities in INFO level, while ORT backend has many many INFO level logs making users hard to find those information. So we are looking for a Python binding API to retrieve the memory optimization opportunities instead of depending on the MemoryOptimizer's default logging. Then we can print ORTModule feature statistics using this information. Also, with such an API, we can create an ORT session created, where allocation plan is done, the analysis will consider buffer reuse as well. This can void giving some recomputation subgraphs that are reusing other subgraphs' output buffers. Check https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md for the new flow using `MemoryOptimizer`. This pull requests made following refactoring: 1. Print the log in ORTModule Python script, along with ORTModule feature enabling stats. This is implemented by exposing an API `get_serialized_ortmodule_memory_stat` to retrieve the memory optimization opportunities. 2. We are analyzing memory optimization opportunities considering ORT memory planning. This is done by firstly creating the execution graph without enabling MemoryOptimizer, then we call `execution_agent.get_serialized_ortmodule_memory_stat` which internally will consider the session memory allocation planner when analyzing memory optimization opportunity. As a direct result, the memory optimization opportunities can show those stashed activations that are reusing other buffers. 3. Move recompute analysis logic from memory_optimizer.h/cc to recompute_analysis.h/cc. 4. Abstract optimization strategies for their own implementation. This will make introducing new strategies (for example compression and decompression ) easier. New logging matrix (INFO Level), in WARNING level, the details will NOT show. ``` 2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] - *** ONNX Runtime Training (ORTModule) is accelerating your model *** ORTModule is enabled with following features ON/OFF for [training] mode: ATen Executor : ON : Dispatch ATen operators to ORT's ATen executor Cast Propagation : ON : Level 1 enabled Custom Function : ON : Support custom torch.autograd.Function export and execution Memory Optimizer : ON : RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs: Config Freq Saving(B) Saving Symbolic(Bytes) - Plan 1 : ON : Reshape+Where+BiasSoftmax+:1:-1 5 671,088,640 640.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 2 : ON : Cast+:1:-1 6 402,587,648 inputs_input_ids_dim0inputs_input_ids_dim1(384.0inputs_input_ids_dim1 - 64.0) - Plan 3 : OFF : Reshape+Where+:1:-1 1 134,217,728 128.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 4 : OFF : BiasSoftmax+:1:-1 1 134,086,656 128.0inputs_input_ids_dim0inputs_input_ids_dim1(inputs_input_ids_dim1 - 1) - Plan 5 : OFF : BiasGelu+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 6 : OFF : FusedMatMul+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 7 : OFF : FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 5 26,214,400 25600.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 8 : OFF : Add+:1:-1 1 5,237,760 5120.0inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) - Plan 9 : OFF : Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 1 4,096 4.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 10 : OFF : Cast+:2:-1 1 2,048 2.0inputs_input_ids_dim0inputs_input_ids_dim1 Compute Optimizer : ON : Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0 - FLOPReduction : ON : Reduce FLOPs by upstreaming shrinking-sized ops Auto Fallback : ON : Fallback to PyTorch when encountering unsupported ops TritonOp Enabled : OFF : ORT will switch to Triton for executing some ops to further accelerate training. ZeRO Stage3 Support : OFF : Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0 Total ORT initialization overhead is 10.73s where export takes 8.39s. Other overhead details: graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0 Note 1: use comma to enable multiple plans at the same time. export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,... Note 2: saving is calculated based on the 1st batch symbolic dim values: inputs_input_ids_dim0=1, inputs_input_ids_dim1=1024, inputs_attention_mask_dim0=1, inputs_attention_mask_dim1=1024, inputs_labels_dim0=1, inputs_labels_dim1=1024, ************************************************************************ ``` If DEVINFO level is enabled, then more details about the memory optimizations are printed. ``` MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1 ========================================================================================================================================== \|Freq \| Memory Optimization Opportunities (Clustered by node-level activation patterns) \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|3 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(3), \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(2), \| \| \| - Output 0 : [ x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=2 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \| \| \| \| \|>>Option 2 : RecomputeWithCompromise subgraph Cast+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| ========================================================================================================================================== Note: use comma as a separator for enabling more than one subgraphs. *********************************************************************** ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-23 11:39:00 +08:00
Dmitri Smirnov	cc542024ce	Create edges with arg positons correctly accounting for non-existing args (#18462 ) ### Description Truncate traling non-existing arguments. Make sure we do not skip on the non-existing arguments in the middle, because shape inferece relies on their proper position. This also affects the argument position in the Edges that must be properly rebuilt each time If node branch is inlined. Make sure that when we rename Defs in subgraphs, new renamed defs are created in those subgraphs instead of pointing to outer scope defs. Add unit test. ### Motivation and Context This is a follow up for https://github.com/microsoft/onnxruntime/pull/18105 Currently, the non-trailing arguments are simply ignored and the edges are created with potentially incorrect positions.	2023-11-20 14:49:09 -08:00
Akshay Sonawane	97cc40d75a	Add fusion patterns for conformer-transducer model (#18461 ) ### Description Add conformer-transducer model type to optimizer. This PR adds pattern matches for attention shown below: Unfused attention: ![ct_unfused](https://github.com/microsoft/onnxruntime/assets/111780983/46c71ed8-67e0-4607-85b1-bcadba5a2956) Fused attention: ![ct_fused](https://github.com/microsoft/onnxruntime/assets/111780983/fbb91c96-0d4b-4f0b-8674-1ae3b9b9a92e)	2023-11-18 23:39:04 -08:00
Ashwini Khade	02333293de	Removed all the deprecated python training code and related tests and utils (#18333 ) ### Description Motivation for this PR is code cleanup. 1. Remove all deprecated python code related to orttrainer, old checkpoint, related tests and utils 2. Cleanup orttraining_pybind_state.cc to remove all deprecated bindings.	2023-11-17 18:19:21 -08:00
George Wu	d73073d491	remove full protobuf requirement for tensorrt ep (#18413 ) tensorrt can work with protobuf lite.	2023-11-16 20:44:27 -08:00
Yulong Wang	6f9f653ada	[wasm] increase test max memory from 2G to 4G (#18459 ) ### Description increase max memory from 2G to 4G for onnxruntime_test_all in WebAssembly build.	2023-11-15 17:51:04 -08:00
Edward Chen	0a4d76d98b	MLAS AArch64 quantized int4 Gemm kernel (#18031 ) - Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs. - Connect MatMulNBits contrib op to MLAS function.	2023-11-15 09:31:54 -08:00
Ye Wang	f9af94009b	onboard MoE (#18279 ) ### Description <!-- Describe your changes. --> 1. Introduce MoE CUDA op to ORT based on FT implementation. 2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows. Remove patch file for cutlass 3.0.0. 3. Sharded MoE implementation will come with another PR limitation: __CUDA_ARCH__ >= 700 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-14 16:48:51 -08:00
PeixuanZuo	a62a500ae1	[ROCm] Update CK version (#17628 ) update ck version	2023-11-13 15:43:38 -08:00
Scott McKay	8d298f6f78	Fix xnnpack compile error on arm32 (#18291 ) ### Description <!-- Describe your changes. --> Use different march flag to workaround what appears to be a clang issue. See https://github.com/tensorflow/tensorflow/issues/59970 for links to various relevant pieces of info/discussions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-12 08:59:20 +10:00
Scott McKay	64c91d790b	Fix ability to use patch on Windows CI machines (#18356 ) ### Description <!-- Describe your changes. --> Add 32-bit patch binary and infra to fallback to it. The Azure devops Windows CIs are missing patch.exe from their git install for some reason so the default `find_package(Patch)` fails as that is where it expects to find it. Remove Eigen patch. Underlying issue was fixed in source 3 years ago by `c6c84ed961` and the patch command is invalid (args are for git apply not patch). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make usage of patch consistent across all CIs Fix https://github.com/microsoft/onnxruntime/issues/15248	2023-11-11 07:32:14 +10:00
Bart Verhagen	87744e55fa	fix reference to Microsoft.GSL::GSL in CMake build scripts when enabling cuda (#17843 ) ### Description Some CMake scripts reference Microsoft.GSL::GSL. Most of the time, the GSL package that is found on the system is used. However, when cuda is enabled, it is downloaded and patched. Most CMake scripts rely on the first case and forget about the second. This patch makes the second case behave like the first case. ### Motivation and Context This is an issue that occurs 'in the wild'. For example, I had to patch this to be able to enable the CUDA provider for the onnxruntime conan package (see https://github.com/conan-io/conan-center-index/pull/20392).	2023-11-10 10:46:45 -08:00
Changming Sun	812532592e	Add a build validation for Linux ARM64 cross-compile (#18200 ) ### Description 1. Add a build validation for Linux ARM64/ARM32 cross-compile to catch issues listed in #18195 . 2. Revert eigen's commit id back to what we had before. ### Motivation and Context To catch cross-compile issues. Added a TODO item for fixing the compile warnings in Linux ARM32 build: AB#21639	2023-11-08 13:03:18 -08:00
Dmitri Smirnov	a37e6a503b	Update Abseil raw_flat_hash visualization (#18329 ) ### Description <!-- Describe your changes. --> Fix the broken pieces due to the latest Abseil update. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? Make the debugging bearable.	2023-11-08 11:19:45 -08:00
Wei-Sheng Chin	fb6737e893	Distributed Squeeze and Distributed Unsqueeze (#18269 ) Implementat DistributedSqueeze & DistributedUnsqueeze for llama 2.	2023-11-06 20:11:35 -08:00
Yi Zhang	b7b8b5b2ce	Fix Eigen-3.4.0 URL and hash (#18290 ) ### Description Add CI changes for #18287 Install onnx explicitly to pass windows GPU+dml stage. ### Motivation and Context 'eigen-3.4' was refering to a branch, not to a tag. There is now an Eigen 3.4.1 on that branch, and thus the hash has changed. See https://github.com/microsoft/onnxruntime/issues/18286#issuecomment-1793683416	2023-11-06 09:19:51 -08:00
Chi Lo	dfafcb58aa	[TensorRT EP] Properly set CUDA_INCLUDE_DIR for onnx-tensorrt (#18274 ) https://github.com/microsoft/onnxruntime/pull/17468 The above PR didn't fully fix the issue for some environments. This PR fixes this.	2023-11-03 20:04:10 -07:00
Scott McKay	4f2096be38	Update XNNPACK to latest version (#18038 ) ### Description <!-- Describe your changes. --> Update XNNPACK to latest version - adds fp16 kernels and various other improvements - requires pthreadpool update as well Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API - 'setup' is split into 'reshape' and 'setup' - some ops use a workspace buffer - copied workspace allocation from XNNPACK unit test code - some suffixes changed Added wrapper for XNNPACK caches to base XNNPACK EP kernel - simplifies usage - XNNPACK split out the code and weights caches, but the code cache isn't currently usable via the public API - we could use the internal types if we think it's required for performance reasons. non-trivial though as we'd need to propagate ifdef values from the XNNPACK build up to the ORT build. - using XNNPACK internals would also mean we would not be able to support using a pre-build XNNPACK package - not an issue currently Fixed opset registration for internal NHWC domain - was not being tied to the ONNX version, so nodes inserted by layout transformation had the incorrect opset - a number of other places needed updating once this issue was fixed Remove support for NCHW Resize from XNNPACK EP so it's NHWC only - we only supported NCHW for fp32, - doing so adds complexity in multiple places (XNNPACK EP kernel implementation, layout transformation and transpose optimization) - unclear if that complexity provides any benefit. can add back if required by production scenario ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We're looking at enabling fp16 support for CoreML and NNAPI. If we do that we need a good fallback story if the CPU EP will be used. The XNNPACK fp16 kernels will hopefully provide that. NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That can be done as required in separate EPs and should be relatively simple to do.	2023-11-03 09:04:28 -07:00
Scott McKay	016b75260b	Pre-link when creating static library for apple framework (#18241 ) ### Description <!-- Describe your changes. --> Pre-link with `ld -r` to apply symbol visibility when the static library is created to replicate XCode's Single Object Pre-link. Current builds set the visibility flags but that doesn't get applied until the static library is linked into something else, which can be too late. Pre-linking fixes this. The pre-link uses the .o files from the ORT static libraries and the .a files from external libraries. This combination limits the symbols included from the .a files to things required by the ORT .o files. In order to minimize changes elsewhere in the build we extract the .o files from the ORT static libraries using `ar -x`. Re-ordered the pieces use to build the Apple framework to make it a little more readable. Fixed a couple of misc issues with missing symbols from the minimal build that show up when pre-linking is applied. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Will hopefully address #17722	2023-11-03 23:38:29 +10:00
aciddelgado	178f7caaeb	GQA Memory Efficient Kernel (#17920 ) Implement Cutlass Memory Efficient Attention Kernel into Group Query Attention Operator. ### Motivation and Context Before this change, Group Query Attention Operator was supported only by Flash-Attention. While this is the most efficient kernel for the operation, it only supports sm >= 80. Cutlass Memory Efficient Attention Kernel supports sm >= 53, allowing us to support a broader range of GPU hardware.	2023-11-01 20:04:22 -07:00
Wei-Sheng Chin	9e8ad39847	Distributed Reduction (#18206 ) This PR implements distributed reduciton for llama 2. This version doesn't consider any cases requring re-sharding because we haven't seen any use cases. Intutive examples: - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[0]) -> [1,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[1]) -> [2,1,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [not supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[2]) -> [2,4,1]-tensor with spec=RRS[0] and device_mesh=[0,1] Algorithm: When the reduced axes are not sharded, each device can call reduction directly. The output sharding spec will be identical to input sharding spec. We currently throw when input and output sharding specs are different. Review guideline: - Check 97b8d2f for new op's schema and how new op is registered. - Read tests in 2450f93 to get faimilar with the behavior of these ops. - Check the implementation details in 753d9af.	2023-11-01 08:49:33 -07:00
Preetha Veeramalai	d87216bcb1	Openvino ep ort 23.1 (#17911 ) ### Description Integration to OpenVINO 2023.1 ### Motivation and Context - Alignment with latest OpenVINO Version. - Device name change from VPUX to NPU and Remove from supported list until official public support is available. --------- Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com> Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com>	2023-11-01 08:39:39 -07:00
liqun Fu	20f2dd8b6b	use onnx rel-1.15.0, update cgman, cmake/external and requirement hash (#18177 )	2023-10-31 14:58:21 -07:00
Wei-Sheng Chin	24f9c1afe3	Distributed Expand (#18126 ) This PR implements DistributedExpand for llama 2. Representative Examples of DistributedExpand: - [shard on non-expanded axis] `input tensor (shape=[8, 1], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[8, 2] -> output tensor (shape=[8, 2], spec=S[0]R, device_mesh=[0,1])` - [sharding expanded axis is invalid since it must have dim=1 and axis with dim=1 cannot be sharded] `input tensor (shape=[1, 8], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[2, 8] -> output tensor (shape=[2, 8], spec=S[0]R, device_mesh=[0,1])` From those examples, we observe a few important behaviors. - The output sharding spec is always the same to the input sharding spec. - Expanding always happen on axis with dimension=1. Otherwise, it will violate the broadcasting rule. - No communication is needed since all computation can happen locally. Let's consider the first example again. If you put the first half tensor (shape: [4, 1]) on device 0 and the second half (shape: [4, 1]) on device 1, then `Expand` it with target shape [4, 2] , these two local tensors (shape: [4, 2]) are exactly the same as the one described by output sharding spec. Algorithm: - Compute logical (i.e., unsharded) shapes of input and output. - Compute sharded output shape from logical output. - Call Expand to broadcast local input to sharded output shape. How to review? - Start with [changes in onnxruntime_test_distributed.py](`ea33392f37`). Those tests are good examples for using this op. - [Read expand.h/expand.cc](`e4c49987f5`). Theose changes are for exposing functionalities in Expand to DistributedExpand. - Read distributed_expand.h/distributed_expand.cc. It follows the algorithm described above. The commit `68ac301bba` first sketches the definition of DistributedExpand. The next commit `0eb9330c3b` adds real implementation.	2023-10-28 00:44:02 -07:00
Xavier Dupré	b5f242e978	GemmFloat8 as a contrib ops (#16051 ) ### Description Add support for Gemm with float 8 as a contrib op. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com> Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-10-27 14:33:55 +02:00
Wei-Sheng Chin	9c32310673	Distributed Reshape Implementation (#18068 ) This DistributedReshape aims at supporting all sharding patterns encountered in llama 2. All patterns found are tested in `TestDistributedReshape` in `onnxruntime_test_distributed.py`. This PR implements algorithms to compute the categories below. - All inputs and outputs are replica, so it's computed like a normal Reshape. - Two-axis fusion (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch, seq, hidden] -> [batch x seq, hidden]`. - Two-axis decomposition (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch x seq, hidden] -> [batch, seq, hidden]`. Review guideline: - Ignore the changes in sharding_spec.h and sharding_spec.cc since they come from another PR #18025. - First, read onnxruntime_test_distributed.py to get familiar with the input/output of DistributedReshape. - Second, check the new APIs in reshape.h/reshape.cc to expose CUDA Reshape kernel to DistributedReshape. - For DistributedReshape, check its `ComputeInternal` for the 3 categories mentioned above.	2023-10-26 22:33:42 -07:00
Vincent Wang	b7408f7389	[ORTModule] ATen Efficient Attention and Triton Flash Attention (#17959 ) This PR is to support efficient attention and flash attention in ORTModule, including: - Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable. - Integrate Triton Flash attention, which requires triton==2.0.0.dev20221202. Need A100 or H100. ORTMODULE_USE_FLASH_ATTENTION=1 to enable. - A python transformer tool to match sub-graph by config and write transformer quickly. Current transformers supports attention mask for both efficient attn and flash attn, and dropout for efficient attn only. To support more training scenarios (such as causal mask in GPT2), more transformers need to be added. The feature is guarded by system environment variables, it won't effect any current behavior if not enabled. Since it requires specific PyTorch/Triton versions, related tests is not added for now.	2023-10-27 10:29:27 +08:00
Chi Lo	455a9ce614	[TensorRT EP] Use latest onnx-tensorrt parser (#18067 ) Use latest onnx-tensorrt to fix compile error. Please see the issue https://github.com/microsoft/onnxruntime/issues/18029	2023-10-26 13:55:12 -07:00
Jambay Kinley	d30d4d372a	Add MatMul FP4 and NF4 Support (#18066 ) ### Description Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to support quantization on weight. This PR adds: - schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating point) and NF4 (4-bit NormalFloat) quantization on weight. - a naive implementation for MatMulBnb4 on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for MatMulBnb4 and related benchmark tool. - tool to quantize model to FP4 or NF4.	2023-10-25 15:34:58 -07:00
snadampal	d88d52eead	[aarch64] Remove mmla kernel support from apple (#18082 ) ### Description <!-- Describe your changes. --> The mmla kernels require additional ISA flags and are currently supported only on Linux ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> more context is in https://github.com/microsoft/onnxruntime/pull/15270 cc: @skottmckay , @chenfucn , @snnn	2023-10-25 11:34:57 -07:00
snadampal	780ee186d7	[aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160 ) ### Description <!-- Describe your changes. --> This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This covers (i) symmetric quantization (zero point is Zero) (ii) asymmetric quantization (zero point is non zero) (iii) per channel as well as per tensor quantization (iv) Signed weights (U8S8 Gemm) (v) Unsigned weights (U8U8 Gemm) and (vi) Signed activations and weights (S8S8 Gemm) scenarios I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM` support MMLA QGEMM kernels are enabled for all the devices that support I8MM instructions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to improve INT8 quantized MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed up to 1.33x performance improvement compared to the optimized UDOT qgemm kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` I have also run the unit tests, and made sure all are passing ``` ./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync ```	2023-10-24 07:49:04 +10:00
liqun Fu	020824ed50	Update ONNX to 1.15.0rc1 (#17914 )	2023-10-20 15:08:25 -07:00
Hariharan Seshadri	9356986730	Fix AMD builds and enable testing NHWC CUDA ops in one GPU CI (#17972 ) ### Description This PR: (1) Fixes AMD builds after #17200 broke them (Need to remember to run AMD builds while trying to merge external CUDA PRs next time) (2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time spent in building a few more files and running a few more tests will not be much. Test Linux GPU CI run : https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770 ### Motivation and Context Keep the NHWC CUDA ops tested (https://github.com/microsoft/onnxruntime/pull/17200) and guard against regressions	2023-10-17 09:23:52 -07:00
Maximilian Müller	7c17e33c07	Make CUDA a NHWC EP (#17200 ) ### Description CUDA inference speed heavily relies on Tensor Cores. To have tensor cores achieve the optimal throughput they require the data layout to be NHWC rather than NCHW. ### Motivation and Context Especially for convolutional networks this is very important. I will illustrate this using a very simple network: ``` import torch import torch.nn as nn class Net1(nn.Module): def __init__(self): super(Net1, self).__init__() # 1 input image channel, 6 output channels, 5x5 square convolution # kernel self.m = nn.ModuleList([ nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1), nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1), nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), ]) def forward(self, x): for module in self.m: x = module(x) return x if __name__ == "__main__": dtype = torch.half device = "cuda" dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device) model = Net1().to(dtype=dtype, device=device) input_names = ["input1"] output_names = ["output1"] torch.onnx.export(model, dummy_input, "test.onnx", input_names=input_names, output_names=output_names) ``` I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test -e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges. Current master launches below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/81655fce-0f8e-4f78-9335-b858a8c8977b) If I add the introduced `-l` flag we see below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/fceb5d6f-c12d-442b-b15a-948797630008) Notice the missing NCHW<>NHWC kernels per operation. The layout optimizer introduced a transpose op as first and last op of the whole network. The `op_generic_tensor_kernel` shows the bias used which should also be optimized out next. Measured across some very basic models: \| CUDA EP \| NCHW [ms] \| NHWC [ms] \| Speedup \| \|:------------------------\|--------------------------------------:\|-----------------------------------------:\|------------------:\| \| \| -e cuda -t 5 -q \| -e cuda -t 5 -q -l \| \| \| resnet101-v2-7_bs8_fp16 \| 18.33 \| 13.07 \| 1.4 \| \| resnet101-v2-7_bs8 \| 21.8 \| 12.06 \| 1.81 \| \| test \| 102.07 \| 73.62 \| 1.39 \| Average speedup: 1.53 ## Outlook Next the mission will be to first write a templated unit test to check for correctness of NHWC vs NCHW ops. After that we have to transition more ops to measure perf improvements on a broader range of models. Currently this is not easily possible as we can do not support all ops in the NHWC domain. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2023-10-16 10:16:37 -07:00
Chi Lo	8abaa7b753	[TensorRT EP] Fix cmake install (#17923 ) We removed tensorrt_provider_factory.h in the [PR](https://github.com/microsoft/onnxruntime/pull/17617). Need to remove the copy of this file when cmake install.	2023-10-16 09:16:24 -07:00
Yufeng Li	11af34440a	Add MatMul 4bits support on GPU (#17890 ) ### Description <!-- Describe your changes. --> Add a contrib op MatMulNBits and related toolchain to support quantization on weight. This PR only adds support for 4bits. It: - add schema for contrib op MatMulNBits which can support 1-7 bits quantization on weight. - a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for 4bits MatMulNBits and related benchmark tool - tool to quantization model with 4bits. Next: - add general and more efficient kernels for 4bits MatMulNBits on CPU and GPU	2023-10-13 16:55:30 -07:00
Jeff Daily	07317316cc	CUDA EP vs ROCM EP hipify audit (#17776 ) Migrate most CUDA EP improvements and changes to ROCM EP. The process involves using hipify against all CUDA EP files (i.e. do not exclude any files from onnxruntime_rocm_hipify.cmake) then vimdiff compare them against the ROCM EP files that are under source control and pull in most changes. These changes include functional as well as formatting and makes comparing CUDA EP and ROCM EP easier, though it makes the PR diff somewhat less obvious due to formatting changes. - hipify audit of onnxruntime/core/providers/rocm, enable ops - Loop - Scan - hipify audit of onnxruntime/contrib_ops/rocm - fix contrib ops search implementation - enable more contrib ops - Affine - ComplexMul - ConvTransposeWithDynamicPads - Crop - DynamicSlice - FFT [Rfft, Irfft] - GreedySearch - ImageScaler - ParametricSoftplus - ScaledTanh - ThresholdRelu --------- Co-authored-by: cloudhan <cloudhan@outlook.com>	2023-10-13 10:13:53 +08:00
Tang, Cheng	ca8cab29cd	distributed slice (#17761 ) ### Description Support DistributedSlice kernel in Cuda EP. mainly support following cases: 1. input data is sharded or replica for all axes (including slice axes) 2. slice axes is sharded across different devices. starts / ends / steps sharded across different devices are not supported yet. --------- Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com>	2023-10-12 14:28:00 -07:00
Maximilian Müller	74a8acf405	Set default value for NVCC threads (#17866 ) Without doing this CMake gives a miscellaneous error on windows when checking if NVCC is functional. It will be missing a number after `--threads`. Currently it is only possible to configure through the python build scripts and not CMake only configure - which is what I am usually doing through CLion.	2023-10-11 22:46:40 -07:00
Numfor Tiapo	b8f373b0ae	Add API for NPU Device Selection in the DML EP (#17612 ) Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-10-11 14:53:00 -07:00
pengwa	0e2782438a	Support inplace update for PythonOp/Grad (#17687 ) ### Support inplace update for PythonOp/Grad This PR is based on another PR https://github.com/microsoft/onnxruntime/pull/17685's branch, to make it easier to review. With PR: PR https://github.com/microsoft/onnxruntime/pull/17685, By default all PythonOp inputs/outputs are assumed to not be inplaced, if during run, we found some inplace update happens (by checking output data address with all inputs data address), we add clone before set it as PythonOp/Grad's outputs. In this case, results are correct, but implicit copies overheads are introduced. This PR allow users to define output input reuse map, to let ORT know how to do the reuse map, avoid such unnecessary copies.	2023-10-10 21:36:45 -07:00
Changming Sun	05ac9f6f2a	Split onnxruntime_providers.cmake to multiple (#17853 ) ### Description Split onnxruntime_providers.cmake to multiple files, for easier editing. No other change was made in this PR.	2023-10-09 20:33:44 -07:00
Baiju Meswani	9c716f4557	Add noexcep_operators to onnxruntime internal libraries (#17850 )	2023-10-09 16:29:41 -07:00
cloudhan	c2bd5b70b2	Fix enable_training and use_migraphx (#17827 )	2023-10-08 11:43:27 +08:00
MistEO	faf9a0f6c7	Fix runtime installation error (#17828 )	2023-10-07 11:50:02 -07:00
JiCheng	3878011ce2	Remove MPI dependency (#17624 ) ### Description <!-- Describe your changes. --> Support launch multi-GPU without MPI ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-06 15:33:18 +08:00
George Wu	b306b02a86	[QNN EP] fixed input for InstanceNormU8 unit test and update copy lib paths (#17806 ) -update InstanceNormU8 with fixed input. With this input, it fails consistently using QNN 2.15.1 -update QNN lib paths (target is deprecated) and additionally copy V73 skel file	2023-10-05 22:17:15 -07:00
Justin Chu	be7541ef4a	[Linter] Bump ruff and remove pylint (#17797 ) Bump ruff version and remove pylint from the linter list. Fix any new error detected by ruff. ### Motivation and Context Ruff covers many of the pylint rules. Since pylint is not enabled in this repo and runs slow, we remove it from the linters	2023-10-05 21:07:33 -07:00
Wei-Sheng Chin	faef9c32fa	ONNX-Native Tensor Parallel: Using Distributed MatMul as Example (#17695 ) This PR introduces - New data structure to represent kernel-level (aka node-level or op-level) tensor sharding informaiton. I consider it as the fundamentaion of ONNX distribtued inference. - Building blocks for distribtued kernels implementation especially stateless implementation for communication ops. - Implementation of DistributedMatMul and its tests. Code structure: - sharding.h/.cc: Function to shard and reshard tensors (calling into NCCL). - sharding_spec.h/.cc: Representation of how a tensor is sharded. - distributed_matmul.h/.cc: Implementation of tensor parallel MatMul. Inputs and outputs are sharded across devices. - onnxruntime_test_distributed.py: distributed operator tests. Example of specifying sharding information ```python @onnxscript.script() def matmul_rs_sr_rr(tensor_x: FLOAT, tensor_w: FLOAT) -> FLOAT: # Run MatMul by sharding x along column axis and w along row axis on # 2 GPUs. return MICROSOFT_OPSET.DistributedMatMul( tensor_x, tensor_w, device_mesh_shape=[2], device_mesh_elements=[0, 1], input_shard_specs=["RS[0]", "S[0]R"], output_shard_specs=["RR"], ) onnx_model = matmul_rs_sr_rr.to_model_proto( input_types=[FLOAT[2, "s"], FLOAT["s", 2]], output_types=[FLOAT[2, 2]], ) ``` In this example, the device mesh can be visualized as 1-D tensor, `[0, 1]`. The 2nd axis of `tensor_x` is sharded across `[0, 1]` (i.e., the 0-axis of the device mesh). Similarly, the 1st axis of `tensor_w` is sharded across `[0, 1]` as well. C++ classes to represent tensor sharding (copied from sharding_spec.h): ```cpp class DeviceMesh { public: // [Device Mesh and Tensor Sharding for Tensor Parallel] // Device mesh is a tensor of device indices. // A tensor can then be partitioned along specific mesh axes. // // Assume we have 4 GPUs indexed by 0, 1, 2, and 3. // Let's consider some examples. // 1. 1D device mesh [0, 1, 2, 3]. In this case, // device_mesh_shape is [4] and device_mesh_elements // is [0, 1, 2, 3]. // If we want to shard a 2-D tensor along its axis 1, the // corresponding sharding spec is a string "RS[0]". // 2. 2D device mesh [[0, 1], [2, 3]]. In this case, // device_mesh_shape is [2, 2] and device_mesh_elements // is [0, 1, 2, 3]. // If we want to shard a 2-D tensor's // rows along mesh axis 1 and // columns along mesh axis 0, the // corresponding sharding spec is a string "S[1]S[0]". // If that 2-D tensor's value is np.array([[5, 6], [7, 8]]), // GPU 0/1/2/3 owns 5/7/6/8. Below is a visualization the sharding // proccess. // - Start with a 2-D device mesh [[0, 1], [2, 3]] and // a 2-D tensor [[5, 6], [7, 8]] // - GPU: [[0, 1], [2, 3]], Tensor: [[5, 6], [7, 8]] // - Split GPU mesh along axis 1 and tensor along // axis 0 for "S[1]" in "S[1]S[0]" // - GPU: [[0], [2]], Tensor: [[5, 6]] // GPU: [[1], [3]], Tensor: [[7, 8]] // - Split GPU mesh along axis 0 and tensor along // axis 1 for "S[0]" in "S[1]S[0]" // - GPU: [[0]], Tensor: [[5]] // - GPU: [[2]], Tensor: [[6]] // - GPU: [[1]], Tensor: [[7]] // - GPU: [[3]], Tensor: [[8]] // Actual shape of device mesh represented by `device_mesh_elements`. std::vector<int64_t> device_mesh_shape; // Flattened device mesh. std::vector<int64_t> device_mesh_elements; }; class AxisPartitionSpec { // [Device Mesh and Tensor Sharding for Tensor Parallel] // This class is the in-memory representation of // 1. if a tensor is sharded or not (aka replica), and // 2. which tensor axis is shard by which device mesh axis. // Let's consider sharding 2-D tensor along column axis on // device mesh [0, 1] as an example. // The required sharding spec RS[0] can be represented by // - AxisPartitionSpec(Condition::Replica, -1) // - AxisPartitionSpec(Condition::Shard, 0) public: // Status of a tensor axis. // A tensor axis can be either sharded or replicated // along a device mesh axis. enum class Condition { Replica, Shard }; // This field tells if a tensor axis is sharded or not. Condition cond; // If a tensor axis is sharded, this field tells which device // mesh axis to distribute the shards along. // If a tensor axis is not sharded, this field is ignored. int device_mesh_axis; // A helper to construct a replica spec for a tensor axis. static AxisPartitionSpec CreateReplica() { return AxisPartitionSpec(Condition::Replica, -1); } // A helper to construct a sharding spec for a tensor axis. // This tensor axis is sharded along `device_mesh_axis` in device mesh. static AxisPartitionSpec CreateShard(int device_mesh_axis) { return AxisPartitionSpec(Condition::Shard, device_mesh_axis); } }; class TensorPartitionSpec { // [Device Mesh and Tensor Sharding for Tensor Parallel] // TensorPartitionSpec holds a collection of AxisPartitionSpec and an // associated DeviceMesh. It is responsible for determining how a tensor // should be partitioned across a device mesh. // // Example 1: RS[0] // In this scenario, `axis_specs` would contain two `AxisPartitionSpec` objects. // - The first object is a Replica, denoting that the first axis of the tensor is // not sharded but is instead replicated. // - The second object is a Shard along the 0-th axis of the device mesh. It denotes // that the second axis of the tensor is sharded along the first axis of the // device mesh. // // Example 2: S[0]RR // In this scenario, `axis_specs` would contain three `AxisPartitionSpec` objects. // - The first object is a Shard along the 0-th axis of the device mesh, indicating // that the first axis of the tensor is sharded along the first axis of the // device mesh. // - The second and third objects are Replicas, indicating that the second and third // axes of the tensor are not sharded but are instead replicated. public: // axis_specs[i]: AxisPartitionSpec for tensor axis i. For a 2-D tensor, // axis_specs[0] is for row axis and axis_specs[1] is for // column axis. axis_specs[i].device_mesh_axis = j means that // tensor axis i is sharded along device mesh axis j. std::vector<AxisPartitionSpec> axis_specs; // device_mesh: DeviceMesh for sharding the associated tensor. // Read [Device Mesh and Tensor Sharding for Tensor Parallel] in DeviceMesh's comment. DeviceMesh device_mesh; }; ```	2023-10-05 14:22:25 -07:00
Edward Chen	1bc115719c	Unify handling of public headers in onnxruntime.cmake. (#17779 ) The changes in PR #8919 overwrote the PUBLIC_HEADER property value of the `onnxruntime` target with a list that did not include EP-specific headers. We should probably be using a consistent set of header files across packages anyway.	2023-10-04 08:55:08 -07:00
Changming Sun	14d349e290	Enable backtrace in unit tests (#17655 ) ### Description Google test can be built either with absl/re2 or not. This PR enables the build option so that google test framework can print out a nice stacktrace when something went wrong. It helps locate test errors in CI build pipelines. Also, Google test will remove the build option and make it always ON. So sooner or later we must make this change.	2023-09-29 12:32:56 -07:00
MistEO	870b0bc305	Fix typo of cmake (#17715 ) This caused a cmake configuration error.	2023-09-27 11:48:46 -07:00
Mustafa Ateş Uzun	13b0f8a6ce	fix: supported typo (#17216 )	2023-09-27 10:45:27 -07:00
liqun Fu	2be4dc6d04	ONNX 1.15 integration (#17125 ) ### Description this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released ### Motivation and Context Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-09-26 14:44:48 -07:00
Jian Chen	0141e27ca1	Enabling c++ 20 in MacOS build (#16187 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-26 11:27:02 -07:00
Tianlei Wu	730fab3050	Refactor Attention cuda kernel (#17578 ) * Break QkvToContext into small functions. Each fused and unfused kernel will have separated function. * Move DecoderAttention kernel to separated file * Move KV cache related kernel to attention_kv_cache.cu ### Motivation and Context To make the code easier to maintain.	2023-09-19 09:49:21 -07:00
Tianlei Wu	adb0be45d3	Refactoring of attention cuda kernel: move prepare qkv and concat_past_to_present (#17559 ) To avoid a huge cu file and make code more readable: - Move PrepareQKV to separate cu file (attention_prepare_qkv.cu) - Move ConcatPastToPresent to attention_concat.cu - Add default value for AttentionData - Add a data structure QkvData to track Q, K and V pointers and track QKV format.	2023-09-15 10:57:29 -07:00
Changming Sun	5af6279440	Fix Android build (#17540 ) ### Description The new cpuinfo library doesn't use clog on Android. Newer XNNPack versions have removed the dependency on clog, but the one we use still has it. So I cherry-pick the XNNPack to our patch file.	2023-09-14 07:36:01 -07:00
Changming Sun	24a3c740c0	Revert "[ROCm][MIGraphX] for googletest dep, set OVERRIDE_FIND_PACKAGE (#16715 )" (#17523 ) This reverts commit `bb136f86c8`, then re-implement it in a different way. I reverted the original change, then added a version constraint to the find_package args. If you still found it picks up wrong gtest version after this change, you may disable `find_package` by setting 'FETCHCONTENT_TRY_FIND_PACKAGE_MODE' to NEVER. For example, the latest gtest version is 1.14.0. If at a later time Google releases a new version of gtest and that one is incompatible with the ONNX Runtime source code you get today and your dev environment already installed the new version and you do not want to create a new clean build environment that is without the package, you can add `--cmake_extra_defines FETCHCONTENT_TRY_FIND_PACKAGE_MODE=NEVER` to your build command to solve the problem.	2023-09-12 22:39:31 -07:00
Chi Lo	b827ab0efc	[TRT EP] Fix build error for building oss onnx-tensorrt parser (#17468 ) If building ORT TRT with `--use_tensorrt_oss_parse` (meaning ORT wil include [oss onnx-tensorrt parser](https://github.com/onnx/onnx-tensorrt/blob/main/CMakeLists.txt#L82) and build it from source) ,the cmake CUDA_INCLUDE_DIR variable is needed. if not, you will encounter following [ build error](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1133937&view=logs&j=7536d2cd-87d4-54fe-4891-bfbbf2741d83&t=39e3f98f-7fe5-578c-20bd-5ae5a4590bda): CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: /build/Release/_deps/onnx_tensorrt-src/CUDA_INCLUDE_DIR Note: Not quite sure why in the past when CI still tested with oss parser won't hit this issue. probably the CUDA_INCLUDE_DIR was defined somewhere back then.	2023-09-08 20:34:57 -07:00
Caroline Zhu	dcc93909b4	Add training WASM generation to Web CI pipeline (#17319 ) ### Description [Successful pipeline run](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1123141&view=results) Added flag to build the training artifacts & updated the pull-wasm-artifacts script to pull the training artifacts as well. Bundled into this PR are minor formatting fixes + naming fixes. ### Motivation and Context [This PR](https://github.com/microsoft/onnxruntime/pull/16521) extended the WASM API wrapper to build training WASM artifacts as well. The ORT training WASM artifacts are required to support ORT training web bindings.	2023-09-08 15:49:47 -07:00
Changming Sun	bc84f52633	Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470 ) ### Description Update C/C++ dependencies abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint to newer versions per request of @ mayeut. He created the following PRs to update the deps: https://github.com/microsoft/onnxruntime/pull/15432 https://github.com/microsoft/onnxruntime/pull/15434 https://github.com/microsoft/onnxruntime/pull/15435 https://github.com/microsoft/onnxruntime/pull/15436 https://github.com/microsoft/onnxruntime/pull/15437 However, our build system needs to fetch the dependencies from an internal mirror that only Microsoft employees have write access to. So I closed his PRs and created this one. This PR also updates abseil to a newer version. This is to prepare for upgrading re2.	2023-09-08 13:35:04 -07:00
Yulong Wang	110a2d0b73	[build][wasm] add js_internal_api.js to link dependency (#17407 ) ### Description add js_internal_api.js to link dependency. Now changes to js_internal_api.js will correctly trigger re-link of ort-wasm.wasm	2023-09-05 20:40:40 -07:00
Changming Sun	c6b0d185b4	Update cmake to 3.27 and upgrade Linux CUDA docker files from CentOS7 to UBI8 (#16856 ) ### Description 1. Update docker files and their build instructions. ARM64 and x86_64 can use the same docker file. 2. Upgrade Linux CUDA pipeline's base docker image from CentOS7 to UBI8 AB#18990	2023-09-05 18:12:10 -07:00
Lennart Hannink	e3bb2a0cdd	Fix git working dir for ORT_BUILD_INFO (fixes #17197 ) (#17198 ) ### Description Git commands producing `git-commid-id` and `git-branch` are always run in `CMAKE_CURRENT_SOURCE_DIR` (i.e. `onnxruntime/cmake`) ### Motivation and Context Please refer to corresponding issue [#17197](https://github.com/microsoft/onnxruntime/issues/17197).	2023-09-05 09:20:49 -07:00
cloudhan	6ea3908db4	Add ck's streamk and splitk gemm impl (#17280 )	2023-09-04 11:49:07 +08:00
aciddelgado	44101e8771	Flash Attention v2 MHA (#17227 ) ### Description Integrate Flash Attention V2 to PackedMultiHeadAttention, MultiHeadAttention and Attention operators. Flash Attention v2 source code is from https://github.com/Dao-AILab/flash-attention/tree/main/csrc/flash_attn/src. We did some change to remove dependency on Torch, then removed backward and bfloat16 related code. Add benchmark script (see benchmark_mha.sh) to compare different attention kernels for MultiHeadAttention operator. Current limitations for Flash Attention in PackedMultiHeadAttention, MultiHeadAttention and Attention operators: * Relative Position Bias is not supported * Different hidden size for Q and V is not supported * Only float16 is supported * Padding/attention mask is not supported * For MultiHeadAttention, when there is past or present input, bias shall be provided to activate flash attention * For Attention, past or present inputs will deactivate flash attention * Causal is not supported Some limitations (like attention mask and causal) might be removed later. Currently, Flash Attention v2 only works in Linux. For Windows, we will enable later with Cutlass 3.2. Two environment variables can be used for testing purpose: (1) `ORT_DISABLE_FLASH_ATTENTION` to disable flash attention. Default value is 0 (enable). Set it to "1" to disable it. (2) `ORT_MIN_SEQ_LEN_FLASH_ATTENTION_PACKED_QKV`. Default value is "513", which means that we only enable flash attention when sequence length is larger than 512 for packed QKV format. Set it to "0" if you want to use flash attention v2 whenever possible. ### Speedup The following result is from Standard_ND96amsr_A100_v4 VM (A100-SXM4-80GB GPU) using benchmark_mha.sh. The metric is TFLOPs per second for MultiHeadAttention operator. There are 3 input formats: * `Q,K,V` means separated inputs query, key and value of BxSxNH * `Q,KV` means packed KV, where key is 5D: BxSxNx2xH * `QKV` means packed QKV, where query is 5D: BxSxNx3xH Note that flash attention cannot use packed QKV format, so extra Transpose is needed. We found that TensorRT kernel is faster for sequence length <= 512 for packed QKV. The reason might be no transpose is needed for TensorRT kernel in this format. We also notice that, TensorRT kernel is faster for stable diffusion 512x512 image (see seq_len=4096, heads=8, head_dim=40 below), while flash attention v2 is faster for 1024x1024 image (see seq_len=16384, heads=8, head_dim=40 below). input format \| batch size \| sequence length \| heads \| head dim \| flash_v2 (TFLOPs/s) \| TensorRT (TFLOPs/s) \| Memory Efficient Attention (TFLOPs/s) -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Q,K,V \| 32 \| 512 \| 64 \| 32 \| 78.1 \| 60.0 \| 39.3 Q,K,V \| 32 \| 512 \| 128 \| 16 \| 46.8 \| 44.1 \| 21.7 Q,K,V \| 16 \| 1024 \| 64 \| 32 \| 99.0 \| 72.8 \| 44.3 Q,K,V \| 16 \| 1024 \| 128 \| 16 \| 54.7 \| 49.2 \| 23.4 Q,K,V \| 8 \| 2048 \| 64 \| 32 \| 113.8 \| 81.2 \| 47.8 Q,K,V \| 8 \| 2048 \| 128 \| 16 \| 59.7 \| 51.9 \| 24.7 Q,K,V \| 4 \| 4096 \| 64 \| 32 \| 122.5 \| 85.6 \| 49.7 Q,K,V \| 4 \| 4096 \| 128 \| 16 \| 62.5 \| 53.3 \| 25.3 Q,K,V \| 2 \| 8192 \| 64 \| 32 \| 127.4 \| 87.5 \| 50.7 Q,K,V \| 2 \| 8192 \| 128 \| 16 \| 64.0 \| 54.2 \| 25.6 Q,K,V \| 1 \| 16384 \| 64 \| 32 \| 129.5 \| 91.0 \| 51.2 Q,K,V \| 1 \| 16384 \| 128 \| 16 \| 64.7 \| 54.5 \| 25.8 Q,K,V \| 1 \| 4096 \| 8 \| 40 \| 51.0 \| 43.6 \| 36.8 Q,K,V \| 1 \| 4096 \| 8 \| 80 \| 97.7 \| 77.0 \| 55.5 Q,K,V \| 1 \| 4096 \| 8 \| 160 \| 120.0 \| 39.7 \| 57.8 Q,K,V \| 4 \| 4096 \| 8 \| 40 \| 89.0 \| 84.4 \| 49.2 Q,K,V \| 4 \| 4096 \| 8 \| 80 \| 133.0 \| 92.2 \| 63.2 Q,K,V \| 4 \| 4096 \| 8 \| 160 \| 164.8 \| 42.7 \| 63.8 Q,K,V \| 1 \| 16384 \| 8 \| 40 \| 96.9 \| 91.3 \| 52.1 Q,K,V \| 1 \| 16384 \| 8 \| 80 \| 142.9 \| 101.5 \| 65.6 Q,K,V \| 1 \| 16384 \| 8 \| 160 \| 177.4 \| 44.2 \| 65.7 Q,K,V \| 128 \| 128 \| 12 \| 64 \| 29.0 \| 26.9 \| 25.7 Q,K,V \| 64 \| 128 \| 12 \| 64 \| 23.1 \| 10.8 \| 21.3 Q,K,V \| 128 \| 384 \| 12 \| 64 \| 83.5 \| 60.8 \| 55.7 Q,K,V \| 64 \| 384 \| 12 \| 64 \| 72.6 \| 40.5 \| 52.8 Q,K,V \| 128 \| 512 \| 12 \| 64 \| 98.9 \| 77.9 \| 62.1 Q,K,V \| 64 \| 512 \| 12 \| 64 \| 94.7 \| 75.6 \| 60.4 Q,KV \| 32 \| 512 \| 64 \| 32 \| 85.9 \| 41.1 \| 41.1 Q,KV \| 32 \| 512 \| 128 \| 16 \| 47.1 \| 21.6 \| 21.6 Q,KV \| 16 \| 1024 \| 64 \| 32 \| 104.4 \| 45.8 \| 45.8 Q,KV \| 16 \| 1024 \| 128 \| 16 \| 54.7 \| 23.6 \| 23.6 Q,KV \| 8 \| 2048 \| 64 \| 32 \| 116.8 \| 48.5 \| 48.5 Q,KV \| 8 \| 2048 \| 128 \| 16 \| 59.8 \| 24.7 \| 24.7 Q,KV \| 4 \| 4096 \| 64 \| 32 \| 124.2 \| 50.1 \| 50.1 Q,KV \| 4 \| 4096 \| 128 \| 16 \| 62.6 \| 25.3 \| 25.3 Q,KV \| 2 \| 8192 \| 64 \| 32 \| 128.5 \| 50.8 \| 50.9 Q,KV \| 2 \| 8192 \| 128 \| 16 \| 64.1 \| 25.6 \| 25.6 Q,KV \| 1 \| 16384 \| 64 \| 32 \| 129.4 \| 51.2 \| 51.2 Q,KV \| 1 \| 16384 \| 128 \| 16 \| 64.8 \| 25.8 \| 25.8 Q,KV \| 1 \| 4096 \| 8 \| 40 \| 67.5 \| 37.7 \| 37.5 Q,KV \| 1 \| 4096 \| 8 \| 80 \| 101.3 \| 56.7 \| 56.6 Q,KV \| 1 \| 4096 \| 8 \| 160 \| 124.0 \| 58.6 \| 58.6 Q,KV \| 4 \| 4096 \| 8 \| 40 \| 90.8 \| 49.8 \| 49.8 Q,KV \| 4 \| 4096 \| 8 \| 80 \| 135.6 \| 63.8 \| 63.8 Q,KV \| 4 \| 4096 \| 8 \| 160 \| 166.3 \| 64.5 \| 64.5 Q,KV \| 1 \| 16384 \| 8 \| 40 \| 97.5 \| 52.3 \| 52.3 Q,KV \| 1 \| 16384 \| 8 \| 80 \| 143.5 \| 65.9 \| 65.8 Q,KV \| 1 \| 16384 \| 8 \| 160 \| 178.4 \| 65.9 \| 65.8 Q,KV \| 128 \| 128 \| 12 \| 64 \| 26.8 \| 48.1 \| 30.9 Q,KV \| 64 \| 128 \| 12 \| 64 \| 28.0 \| 38.9 \| 25.0 Q,KV \| 128 \| 384 \| 12 \| 64 \| 97.7 \| 61.1 \| 61.0 Q,KV \| 64 \| 384 \| 12 \| 64 \| 89.5 \| 57.8 \| 57.9 Q,KV \| 128 \| 512 \| 12 \| 64 \| 111.9 \| 66.7 \| 66.9 Q,KV \| 64 \| 512 \| 12 \| 64 \| 107.2 \| 64.9 \| 64.8 QKV \| 32 \| 512 \| 64 \| 32 \| 77.2 \| 84.7 \| 39.3 QKV \| 32 \| 512 \| 128 \| 16 \| 43.4 \| 53.1 \| 20.9 QKV \| 16 \| 1024 \| 64 \| 32 \| 98.8 \| 87.4 \| 44.6 QKV \| 16 \| 1024 \| 128 \| 16 \| 52.0 \| 54.1 \| 23.2 QKV \| 8 \| 2048 \| 64 \| 32 \| 113.1 \| 89.0 \| 47.9 QKV \| 8 \| 2048 \| 128 \| 16 \| 58.2 \| 54.6 \| 24.5 QKV \| 4 \| 4096 \| 64 \| 32 \| 120.6 \| 89.7 \| 49.7 QKV \| 4 \| 4096 \| 128 \| 16 \| 61.7 \| 54.6 \| 25.2 QKV \| 2 \| 8192 \| 64 \| 32 \| 125.9 \| 89.5 \| 50.7 QKV \| 2 \| 8192 \| 128 \| 16 \| 63.6 \| 54.8 \| 25.5 QKV \| 1 \| 16384 \| 64 \| 32 \| 128.5 \| 92.0 \| 51.2 QKV \| 1 \| 16384 \| 128 \| 16 \| 64.6 \| 54.8 \| 25.7 QKV \| 1 \| 4096 \| 8 \| 40 \| 60.2 \| 69.8 \| 38.1 QKV \| 1 \| 4096 \| 8 \| 80 \| 101.6 \| 75.2 \| 56.7 QKV \| 1 \| 4096 \| 8 \| 160 \| 130.2 \| 41.2 \| 58.4 QKV \| 4 \| 4096 \| 8 \| 40 \| 90.6 \| 91.0 \| 49.5 QKV \| 4 \| 4096 \| 8 \| 80 \| 133.6 \| 98.1 \| 62.8 QKV \| 4 \| 4096 \| 8 \| 160 \| 165.3 \| 43.7 \| 63.9 QKV \| 1 \| 16384 \| 8 \| 40 \| 97.2 \| 92.8 \| 52.1 QKV \| 1 \| 16384 \| 8 \| 80 \| 143.0 \| 103.1 \| 65.6 QKV \| 1 \| 16384 \| 8 \| 160 \| 177.6 \| 44.5 \| 65.7 QKV \| 128 \| 128 \| 12 \| 64 \| 31.1 \| 65.9 \| 27.6 QKV \| 64 \| 128 \| 12 \| 64 \| 26.1 \| 49.8 \| 23.5 QKV \| 128 \| 384 \| 12 \| 64 \| 84.6 \| 88.5 \| 56.1 QKV \| 64 \| 384 \| 12 \| 64 \| 79.1 \| 80.3 \| 53.5 QKV \| 128 \| 512 \| 12 \| 64 \| 97.3 \| 114.2 \| 62.2 QKV \| 64 \| 512 \| 12 \| 64 \| 95.9 \| 110.7 \| 60.6 QKV \| 4 \| 2048 \| 32 \| 128 \| 125.26 \| 44.72 \| 78.15 QKV \| 4 \| 4096 \| 32 \| 128 \| 141.62 \| 46.29 \| 85.84 QKV \| 8 \| 2048 \| 32 \| 128 \| 127.40 \| 45.49 \| 78.75 QKV \| 8 \| 4096 \| 32 \| 128 \| 144.24 \| 46.60 \| 86.95 ### Known Issues NVCC uses huge memory while compiling flash attention CUDA kernel. Linux build with CUDA might fail when machine has limited memory while number of CPUs is large. Walkaround is to use a build machine with larger memory, or use argument like `--nvcc_threads 1` to limit nvcc threads in build. ### Motivation and Context Increases speed and efficiency of MHA or Packed MHA. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>	2023-08-31 13:52:21 -07:00
Wanming Lin	3a53836836	[WebNN EP] Fix compilation with newer flatbuffers (#17367 )	2023-08-31 10:22:15 -07:00
Artem Shilkin	6e60dba726	Fix compilation with newer flatbuffers (#17164 ) In flatbuffers@v23.5.9 was broken forward declaration for FlatBufferBuilder. Trying to compile onnxruntime falls with the following error: ``` flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder') typedef FlatBufferBuilderImpl<false> FlatBufferBuilder; ^ onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here class FlatBufferBuilder; ``` This PR removes these declarations and puts includes instead	2023-08-29 10:28:26 -07:00
Caroline	228db24317	Add training API functions to WASM API (#16521 ) ### Description * Created `wasm/training_api` source and header files & modified WebAssembly CMake to include training flags * The `wasm/training_api` files use an `OrtTrainingManager` handle which is a struct of an OrtCheckpointState and an OrtTrainingSession, rather than creating a CheckpointState handle & a separate TrainingSession handle. * This is so that the TypeScript side only has to manage one handle that will be passed between TrainingSession & CheckpointState representations, rather than the TypeScript side managing separate CheckpointStateHandle and TrainingSessionHandle. ### Motivation and Context WASM API needs to be updated with ORT training API function calls so that ORT training web bindings can be added for on-device training. --------- Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: carzh <carolinezhu@microsoft.com> Co-authored-by: Ashwini Khade <askhade@microsoft.com>	2023-08-28 11:05:02 -07:00
Arthur Islamov	c262879214	Added DML and CUDA provider support in onnxruntime-node (#16050 ) ### Description I've added changes to support CUDA and DML (only on Windows, on other platforms it will throw an error) ### Motivation and Context It fixes this feature request https://github.com/microsoft/onnxruntime/issues/14127 which is tracked here https://github.com/microsoft/onnxruntime/issues/14529 I was working on StableDiffusion implementation for node.js and it is very slow on CPU, so GPU support is essential. Here is a working demo with a patched and precompiled version https://github.com/dakenf/stable-diffusion-nodejs ---------	2023-08-25 16:57:06 -07:00
Yulong Wang	79c4ed9a45	[js/webgpu] support error pop and kernel name (#17260 ) ### Description This PR contains changes to support error pop and kernel name. - Add a function `JsepGetNodeName` to allow reading kernel name from JS to C++ - When in debug mode ( `env.debug = true;` ) or in profiling mode ( `env.webgpu.profilingMode = 'default';` ), kernel name will be read from ORT; otherwise use the kernel pointer ( a number ) as kernel name to save calls from JS to C++. - When in debug mode, WebGPU validation errors will be recorded and if any error occurs, `inferenceSession.run()` will fail (Promise get rejected). Behavior when not in debug mode is not changed. This is because recording errors are not zero-overhead, and GPU validation errors should occur consistently in and not in debug mode. - Add `jsepOnRunStart()` and `jsepOnRunEnd()` hook to: - allow implementation of the features mentioned above. - pass session ID to backend.	2023-08-25 08:08:15 -07:00
mindest	735cc8e6c8	[ROCm] enable If op for ROCm EP. (#17279 ) ### Description Enable If op for ROCm EP.	2023-08-25 17:49:49 +08:00
Baiju Meswani	fca81cc5d5	ConvTransposeGrad CUDA Kernel (#17201 )	2023-08-24 09:08:06 -07:00
cloudhan	87bef1f3f2	Move composable_kernel to deps.txt (#17245 )	2023-08-23 17:39:16 -07:00
kunal-vaishnavi	edac3ef150	Add LLaMA scripts (#17020 ) ### Description This PR adds the following scripts for LLaMA: - LLaMA conversion (support for TorchScript and Dynamo exporters) - LLaMA parity - LLaMA benchmark - LLaMA quantization - LLaMA integration with [Hugging Face Optimum](https://github.com/huggingface/optimum) ### Motivation and Context This PR adds scripts for using LLaMA. There is a [follow-up PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding scripts for Whisper.	2023-08-22 18:05:11 -07:00
Edward Chen	bd8a488f4b	Enable verbose logging in unit test program with environment variable. (#17133 ) Enable verbose logging in unit test program with environment variable. E.g., `ORT_UNIT_TEST_MAIN_LOG_LEVEL=0 ./onnxruntime_test_all --gtest_filter="<test that I want to see more logs for>"`.	2023-08-22 12:13:52 -07:00
cloudhan	4e6cec4d09	Update ck and enable test (#16383 ) Apply the fix in https://github.com/ROCmSoftwarePlatform/composable_kernel/issues/728 Introduce more kernel instances and allow the introduction of streamk and splitk.	2023-08-22 11:08:55 +08:00
Sheil Kumar	cbaa008391	Bump DirectML version from 1.12.0 to 1.12.1 (#17225 ) Bump DirectML version from 1.12.0 to 1.12.1 Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-08-20 09:55:38 -07:00
Changming Sun	3cec88bd12	FIX: memory leak checker is incompatible with std::stacktrace (#17209 ) ### Description When I worked on PR #17173, I didn't notice that onnxruntime\core\platform\windows\debug_alloc.cc also needs to call dbghelp functions like SymInitialize. So, if we use vc runtime's stacktrace functionality, vc runtime will initialize/uninitialize the dbghelp library independently and vc runtime's stacktrace helper DLLs get unloaded before our memory leak checker starts get work. Then we call SymSetOptions, it crashes. More details: In VC runtime the C++23 stacktrace functions are implemented on top of dbgeng.dll. In C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\crt\src\stl\stacktrace.cpp, you can see it has: ``` dbgeng = LoadLibraryExW(L"dbgeng.dll", nullptr, LOAD_LIBRARY_SEARCH_SYSTEM32); ``` The dbgeng.dll is a wrapper around dbghelp.dll. It calls SymInitialize and SymCleanup. dbgeng.dll gets unloaded before our memory leak check starts to run. In theory we should be able to call SymInitialize again if the previous user who called SymInitialize has also called SymCleanup. However, users can use SymRegisterCallback/SymRegisterCallback64/SymRegisterCallbackW64 to register callback functions to dbghelp.dll. These callback functions need to be alive when SymSetOptions(and some other dbghelp APIs) get called. ### Motivation and Context	2023-08-18 17:10:33 -07:00
Changming Sun	ee09a5ff35	Add DISABLE_CUSPARSE_DEPRECATED flag to CUDA build (#17207 ) This is to suppress a warning and make Windows CUDA 12.2 build work.	2023-08-18 10:25:49 -07:00
Chi Lo	2fb148dd88	Temporarily enforce "Debug build" TRT EP with trt oss parser on Windows (#17059 ) This PR handles two changes: 1. There is an issue when running "Debug build" TRT EP with "Release build" TRT builtin parser on Windows. Enforce use oss parser for Debug build. Note: args.config in build.py is an array, for example ["Debug", "Release"...]. The code will be much mess if we made the change there. 2. Update to use latest commit of oss parser. Please see the https://github.com/microsoft/onnxruntime/issues/16273	2023-08-17 12:17:25 -07:00
Changming Sun	5249b7ab7c	Re-implement stacktrace (#17173 ) ### Description Re-implement stacktrace. The new implementation doesn't directly use Windows API, hence can avoid problems regarding to initialize/uninitialize the dbghelp library. ### Motivation and Context	2023-08-16 16:07:49 -07:00
Dmitri Smirnov	f45eef399e	Fix visualization issues with Attribute/Tensor protos (#17188 ) ### Description Protobuf Natvis	2023-08-16 13:56:51 -07:00
RandySheriffH	3dd2c1b4d7	EP context for custom op (#16454 ) Implement infrastructures to allow EP resources surfaced to custom ops. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-08-16 13:03:40 -07:00
Maximilian Müller	7b9d1f18c7	NVTX windows include and link fixes (#16831 ) ### Description For windows headers are not duplicated to the normal cuda include. For linux they are: ``` (base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include/nvtx3 \| grep nvTool nvToolsExt.h nvToolsExtCuda.h nvToolsExtCudaRt.h nvToolsExtOpenCL.h nvToolsExtSync.h (base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include \| grep nvTool nvToolsExt.h nvToolsExtCuda.h nvToolsExtCudaRt.h nvToolsExtOpenCL.h nvToolsExtSync.h ``` Is the preference via those added defines or should the include just be changed to be `nvtx3/` ? Also there is no library linking needed on Windows and the library is not even present.	2023-08-16 11:53:58 -07:00
Changming Sun	8e203efc69	Cleanup cmake file (#17154 ) ### Description 1. Clean up cmake files. Remove some unused code 2. Remove the "Semmle" task from tools/ci_build/github/azure-pipelines/templates/win-ci.yml. Semmle is deprecated and replaced by CodeQL.	2023-08-15 10:51:33 -07:00
Matthieu Darbois	5e971bc51a	Rework WIL dependency retrieval/usage (#17130 ) ### Description 1. `onnxruntime_fetchcontent_makeavailable` works around unconditional install commands so that can be used instead of `FetchContent_Populate` 2. This dependency is Windows specific, mark it as such. ### Motivation and Context 1. This simplifies `cmake/external/wil.cmake` not to do anything specific wether WIL was fetched or found 2. Given it's specific to Windows, it might not be available on other OS in specific air-gapped environment such as [conan-center-index](https://github.com/conan-io/conan-center-index). This allows downstream builds not to require specific patches for something not required by the build in the first place.	2023-08-15 09:11:46 -07:00
Wenbing Li	d052c8a45c	Remove the extensions submodule (#17097 ) ### Description Remove the onnxruntime-extensions submodule since it now was used via cmake FetchContent ### Motivation and Context The submodule relies on an outdated version of the extensions, and the build instructions should be updated to eliminate any confusion.	2023-08-14 10:16:33 -07:00
Yulong Wang	5704e71b89	update onnx.patch to apply wasm build break fix (#17104 ) ### Description This PR fixes build break for WebAssembly introduced in `6986981482` (`435ad2b1d8`). This change updates onnx.patch in onnxruntime repo. the corresponding PR in onnx repo is: https://github.com/onnx/onnx/pull/5495. It may takes a while for the next onnx version bump.	2023-08-11 15:00:39 -07:00
Changming Sun	4728f20f9a	Fix CI build (#17118 ) ### Description Some pipelines are failing. It is because PR #16325 set ONNX version to `rel-1.14.1` . It is a branch name, not a commit or tag name. It means whenever the branch got a new commit, we will auto pick it and use it.	2023-08-11 10:56:38 -07:00
Yulong Wang	9cd4e5af68	[wasm] upgrade emsdk to 3.1.44 (#17069 ) ### Description This change upgrade emsdk to 3.1.44. Because backend is upgraded to LLVM 16, so need to fix a lot of build failures caused by "-Wshorten-64-to-32". most of the build failures comes from generated `onnx.pb.h`, and this can be fixed by including "core/graph/onnx_protobuf.h", which detects and ignore shorten-64-to-32 warnings.	2023-08-10 16:08:36 -07:00
Bowen Bao	6986981482	Bump ONNX version (#16325 ) ### Description Bump ONNX version to https://github.com/onnx/onnx/tree/rel-1.14.1 to include a fix for segfault when shape inferencing nested onnx functions. ### Motivation and Context Resolves #16170	2023-08-10 11:27:28 -07:00
Jeff Daily	dbbfc249f7	[ROCm] update header and binary search paths used by cmake (#17083 ) This is in preparation for planned ROCm 6.0 changes that are not backward compatible. However, the adjustments made by this PR to the current onnxruntime cmake files will work with ROCm 5.x and 6.x.	2023-08-10 11:05:21 +08:00
Changming Sun	7d340256f1	Add "windows_sdk_version" build arg and fix SCA build pipeline (#17062 ) ### Description 1. Add "--windows_sdk_version" argument to build.py 2. Fix Windows Static Analysis build pipeline. It is failing because it picks up a different Windows SDK version after a build machine image update. If we can explicitly specify Windows SDK version, we can avoid such things happening again. 3. Remove --enable_training from Windows Static Analysis build pipeline because PR #16993 makes it incompatible with "no_rtti". AB#18315	2023-08-09 14:01:16 -07:00
sfatimar	2c5d4dce77	Openvino ep ort 5.1 (#17042 ) OpenVINO EP ORT 5.1 Branch Changes for the new API to take in OpenVINO Provider Options and compatibility with OV 2023.1 ### Motivation and Context The change is required for the new API to take in OpenVINO Provider Options and make it seamless. --------- Signed-off-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: saurabhintel0 <saurabh1.kale@intel.com> Co-authored-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>	2023-08-09 11:50:10 -07:00
Dmitri Smirnov	07dfe34714	Fix FunctionProto visualization (#17063 ) ### Description Title ### Motivation and Context Need to debug function protos	2023-08-09 11:05:52 -07:00
Baiju Meswani	249917a093	Add mac and windows python packages for onnxruntime-training (#16993 )	2023-08-07 20:32:55 -07:00
Chen Fu	3c10f027de	4b quantization for weights of LLMs (#16833 ) ### Description Blockwise 4b quantization for LLMs. 1. Introduce 4b block-wise quantization for linear layer weights. 2. Implements matrix multiplication kernel for fp32 x int4 3. Implements special operator MatMulFpQ4 4. Implements quantization tool, that convert MatMul operator to MatMulFpQ4, when the right hand side is 2D const tensor. ### Motivation and Context Compress and accelerate LLMs \|Benchmark \| Time(ns)\| \|-------------\|----------\| \|Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8\| 218054\| \|Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8\| 35830155\| \|Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8\| 73479790\| \|Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8\| 270152\| \|Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8\| 35826721\| \|Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8\| 73021200\| \|Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8\| 213832\| \|Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8\| 36749874\| \|Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8\| 72618120\| \|Benchmark \| Time(ns)\| \|-------------\|----------\| \|SGEMM/LLM/M:1/N:4096/K:4096/Threads:8\| 522610\| \|SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8\| 39237689\| \|SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8\| 75983467\| --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-08-07 12:23:55 -07:00
Dmitri Smirnov	d5e4bdbe7d	Fix protobuf TaggedStringPtr display (#17008 ) ### Description <!-- Describe your changes. --> Adjust nativs to display tagged strings. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Hard to debug without seeing names.	2023-08-04 17:51:01 -07:00
pengwa	a6887f171f	Refactor schema extraction and output unflattening (#16894 ) ### Motivation and Context When we handle PyTorch models' inputs in different places (ORTModule or others), it's common for us to flatten a structured data into a 1-D tensor list (required by lib for example torch.onnx.export, torch.autograd.Function.forward or ORT inference session), then do subsequent work, then unflatten back to original hierarchy as returned values. DeepStage3 hooks support work also need such a lib to do similar things, so I was proposing to extract this pair of APIs in training/utils/, which can be more used more generally. Also a comprehensive set of test data are used for testing unflatten/flatten in unit tests. Let me know if you have any other suggestions. ### Refactor schema extraction and output unflattening Move `_extract_schema` and `unflatten_user_output` in `orttraining/orttraining/python/training/ortmodule/_io.py` . to `extract_data_and_schema` and `unflatten_data_using_schema` in `orttraining/orttraining/python/training/utils/torch_io_helper.py` as shared libs, which can be used later by other features (deepspeed stage 3 hook rewrite). While there are still a few duplicated logic handling flatten with different task by recursively loop the data struct, will change them step by step in case of heavy review efforts.	2023-08-04 13:58:21 +08:00
Jeff Daily	1629a6fa75	[ROCm] add gfx1100 and gfx1101 to CMAKE_HIP_ARCHITECTURES (#16972 ) ### Description Support additional AMD GPU architectures. ### Motivation and Context AMD announced expanding support for additional GPUs. https://community.amd.com/t5/rocm/new-rocm-5-6-release-brings-enhancements-and-optimizations-for/ba-p/614745 This PR is how we will deliver that expanded support to onnxruntime.	2023-08-04 08:38:42 +08:00
Michael Klimenko	07e6648e12	Enable Intel oneAPI DPC++/C++ compiler build (#16587 ) Last week I fixed error #16484 found when trying to build onnxruntime with the icpx compiler. Another thing I found out is that icpx uses -ffast-math flag by default. You can check it by running the compiler with -v flag like following: ```bash # Setup the environment . /opt/intel/oneapi/setvars.sh # Compile any file to see all the implicit flags icpx -v main.cpp ``` This leads to a bunch of warnings during the build like: ```bash In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/cpu/tensor/upsample_op_test.cc:5: In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/provider_test_utils.h:6: In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/checkers.h:10: In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/core/util/math_cpuonly.h:68: In file included from /mnt/f/wsl_home/onnxruntime/build/Linux/RelWithDebInfo/_deps/eigen-src/Eigen/Core:172: /mnt/f/wsl_home/onnxruntime/build/Linux/RelWithDebInfo/_deps/eigen-src/Eigen/src/Core/MathFunctions.h:1019:12: warning: comparison with NaN always evaluates to false in fast floating point modes [-Wtautological-constant-compare] return isnan EIGEN_NOT_A_MACRO (x); ^~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` And some tests are failing as well, usually with infinities involved. To list a few: ```bash # ... 1: [ FAILED ] IsInfTest.test_isinf_float 1: [ FAILED ] IsInfTest.test_isinf_double 1: [ FAILED ] IsInfTest.test_isinf_positive_float 1: [ FAILED ] IsInfTest.test_isinf_positive_double 1: [ FAILED ] IsInfTest.test_isinf_negative_float 1: [ FAILED ] IsInfTest.test_isinf_negative_double 1: [ FAILED ] IsNaNOpTest.IsNaNFloat 1: [ FAILED ] IsNaNOpTest.IsNaNDouble # ... ``` This PR adds a quick global check for the IntelLLVM compiler, as in the way its name is reported by CMake and then, depending on the compiler driver, sets either MSVC-like or GCC-like switch to disable fast-maths. Probably a bit cleaner solution would be to use ```target_compile_options(${TARGET} PRIVATE MEOW)``` instead of a global-wide ```set(CMAKE_CXX_FLAGS MEOW)```, but then we'd be required to add it to all the individual targets and execution providers and this will lead to a lot of code duplication.	2023-08-02 12:50:35 -07:00
Chi Lo	f4faceab28	Ignore deprecated declarations warning for TRT EP build (#16948 ) In additions to `onnxruntime_test_all`, `onnxruntime_shared_lib_test` and `onnxruntime_customopregistration_test` should also add "-Wno-deprecated-declarations" flag to ignore compiler warning	2023-08-02 09:51:58 -07:00
Changming Sun	73ddba964f	Update the MacOS/Linux build scripts that build/install protobuf from source (#16906 ) ### Description 1. As a follow-up of #16761, this PR allows build ORT on iOS/Android without the need to explicitly specify a protoc path. #16761 is for WASM. This one is for iOS/Android 2. Update the MacOS/Linux build scripts that build/install protobuf from source. Make them be more flexible. Add the support for RedHatEnterprise(ubi), which will needed for upgrading the base image from centos:7 to ubi:8. 3. Update tools/ci_build/github/pai/rocm-ci-pipeline-env.Dockerfile : the docker file's base image has preinstalled protobuf in /usr/local, we should uninstall them to avoid conflicts.	2023-07-31 10:51:48 -07:00
Dmitri Smirnov	50764362ac	Update protobuf Natvis visualization (#16911 ) ### Description Protobuf library update broke debug visualization. ### Motivation and Context Hard to debug	2023-07-31 09:35:21 -07:00
satyajandhyala	dd24d52737	[JS/Web] Added Gelu contrib operator support to JSEP (#16909 ) ### Description Added Gelu operator to JSEP ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-07-31 09:18:58 -07:00
Tianlei Wu	742edec5e8	[CUDA] Add PackedMultiHeadAttention operator (#16779 ) ### Description Add new operator for MultiHeadAttention with inputs removed padding. This only supports packed QKV format.	2023-07-28 16:35:38 -07:00
Changming Sun	161a9d1d6d	Add some safety check for conv op (#16839 ) ### Description Add some safety check for conv op. It is to validate if the attributes coming from a conv op are in a valid range. (shouldn't be too large or too small).	2023-07-27 16:37:55 -07:00
Yi Zhang	2e214d6e27	Workaround to upgrade VS2022 for Windows ARM build (#16826 ) ### Description ### Motivation and Context It should be reverted when VS2022 is upgraded to 17.7 or above. ### Vefication https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=331401&view=logs&j=7517abfd-115a-5c61-78a0-7ba3c9e3a88d	2023-07-25 08:35:52 +08:00
Arthur Islamov	210d29b40e	Allow --build_wasm on a mac system (#16761 ) ### Description Changes allow downloading prebuilt protoc compiler when building WebAssebly version on mac systems. Otherwise it tries to build a js/wasm version of protoc and throws an error while executing it: "protoc.js permission denied" ### Motivation and Context I need to switch between my main working computer and a PC to make changes to WebAssebly build. Would like not to do that anymore.	2023-07-21 14:21:37 -07:00
Jeff Daily	bb136f86c8	[ROCm][MIGraphX] for googletest dep, set OVERRIDE_FIND_PACKAGE (#16715 ) Otherwise, an unsupported version of gtest/gmock will be found at /opt/conda/include for ROCm builds. Though this issue was initially found for ROCm builds, the issue is generic. onnxruntime requires a specific version of googletest and should not rely on locating googletest using find_package. The ROCm error was: ``` In file included from /opt/conda/include/gmock/gmock-spec-builders.h:75, from /opt/conda/include/gmock/gmock-generated-function-mockers.h:47, from /opt/conda/include/gmock/gmock-function-mocker.h:39, from /opt/conda/include/gmock/gmock.h:61, from /stage/onnxruntime/onnxruntime/test/util/test_utils.cc:17: /opt/conda/include/gmock/gmock-matchers.h: In instantiation of ‘bool testing::internal::PointwiseMatcher<TupleMatcher, RhsContainer>::Impl<LhsContainer>:: MatchAndExplain(LhsContainer, testing::MatchResultListener*) const [with LhsContainer = const gsl::span<const float>&; TupleMatcher = testing::internal:: FloatingEq2Matcher<float>; RhsContainer = gsl::span<const float>]’: /opt/conda/include/gmock/gmock-matchers.h:2303:10: required from here /opt/conda/include/gmock/gmock-matchers.h:2312:48: error: no type named ‘const_iterator’ in ‘testing::internal::PointwiseMatcher<testing::internal:: FloatingEq2Matcher<float>, gsl::span<const float> >::Impl<const gsl::span<const float>&>::LhsStlContainer’ {aka ‘class gsl::span<const float>’} ```	2023-07-21 00:57:38 +08:00
Edward Chen	f236768d5c	[ios] Enable `--use_extensions` with custom built iOS pod (#16711 ) - Fix link errors by including the needed onnxruntime-extensions libraries in the static framework. - Add Objective-C API to register custom ops from embedded onnxruntime-extensions. Caveat: Not all onnxruntime-extensions build options are working yet. E.g., building with the onnxruntime-extensions OpenCV dependency does not work.	2023-07-14 15:37:16 -07:00
Dmitri Smirnov	853c4ff0a5	[C#, CPP] Introduce Float16/BFloat16 support and tests for C#, C++ (#16506 ) ### Description Introduce `Float16/BFloat16` support for C# and C++ APIs. User should be able to perform conversions from `float` to/from `Float16/BFloat16`, compare values and tests for `NaN, Inifnity, and whether the number is denormalized.` ### Motivation and Context User filed issues such as: https://github.com/microsoft/onnxruntime/issues/14303	2023-07-14 10:46:52 -07:00
Dipanjan Sengupta	a461608409	Amx flag removal (#16527 ) ### Description 1. Replacing AMX intrinsics with machine code macros in QGEMM kernel. 2. Removing AMX build flags for GCC in cmake file. 3. Fixing the link time optimization (LTO) issue introduced with asm .include of an assembly file. I have moved the AMX instruction macro definitions from QgemmU8S8KernelAmxCommon.S to the amx_common.h to fix the LTO issue. Note that I am also pushing the macros defined in QgemmU8S8KernelAmxCommon.S for future reference. A special thanks to @laxmansole who helped in the development of the instruction macro definitions for AMX intrinsics and fixing the LTO issue. ### Motivation and Context The additional AMX flag in cmake adds an extra layer of dependency on GCC version to use the feature.These changes should allow the usage of the AMX feature with just the CPU ID check.	2023-07-13 11:19:49 -07:00
Vincent Wang	c07a3b869c	Triton Codegen for ORTModule (#15831 ) Fuse connected elementwise and reduce Ops to TritonOp and codegen triton code to run the kernel. This PR is co-edited by @wejoncy and @er3x3	2023-07-13 18:17:58 +08:00
mindest	b7fd5af48b	[ROCm] TunableOp: Update rocBLAS get_solutions API (since ROCm5.6) (#16657 ) ### Description - Update existing rocBLAS get_solutions API using `*_get_solutions_by_type` (supported from ROCm5.6); remove the original nested TunableOp logic. - Update kernel_explorer.	2023-07-13 11:20:26 +08:00
cloudhan	3866614519	Avoid cmake repeatly printing DISABLE_FLOAT8_TYPES=ON (#16656 )	2023-07-13 09:29:20 +08:00
Scott McKay	ce68a4c06a	Fix Linux build failure when onnxruntime_DISABLE_ABSEIL=ON (#16373 ) ### Description <!-- Describe your changes. --> Add ort_value.h to session_options.h so OrtValue is defined. Update a unit test binary to add required include paths. Adding ort_value.h pulls in more data type headers. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #16193	2023-07-12 11:23:18 +10:00
mindest	347c963d5c	[ROCm] Add ROCm Triton TunableOp for GroupNorm (#16196 ) ### Description - Refactor existing Triton TunableOp-related code (based on work in #15862) - Add GroupNorm Triton implementation	2023-07-11 13:55:30 +08:00
cloudhan	5fee3f4302	Remove the special min cmake for rocm (#16570 ) #15807 fixed the building error for rocm with cmake 3.26. The specialized relaxation of the cmake version is not needed anymore.	2023-07-10 13:19:48 +08:00
Edward Chen	6be7b03e53	Enable `-Wshorten-64-to-32` warning if available. (#16524 ) - Fix some warnings from Xcode build (`-Wshorten-64-to-32`). - Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet. - Some clean up in build.py including setting CMake generator more consistently.	2023-07-07 08:11:44 -07:00
Scott McKay	697dd12f6e	Re-organize the transpose optimization and layout transformation files. (#16246 ) ### Description <!-- Describe your changes. --> Split out the more basic changes from #15552 for easier review. Re-organize to clarify the structure - Separate out generic base functionality from ORT specific components - pass in handlers for internal ORT ops to Optimize - Split out layout transformation from transpose optimization - Separate out level 1 transpose optimizer - Cleanup some naming to try and clarify things like an optimizer vs. general optimization code Most of the changes are from this movement of code. Two implementation changes: - the extended handlers are queried first in GetHandler - allows the extended handlers to override the default behaviour for an ONNX operator - simplify the Optimize function to remove OptimizerMode. - `can_modify_node` is used instead of `mode` and `ignore_assigned_nodes` and a long description of the current usage is added. I don't _think_ that changes the current behavior and hopefully clarifies what happens and when, and makes the base transpose optimizer implementation more generic. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Create a cleaner separation to support adding EP specific logic next to cleanly handle where an EP has additional layout sensitive behaviour required (e.g. it's Resize implementation only handles one layout).	2023-07-07 08:24:47 +10:00
cao lei	0c5f492493	remove AllocatorMgr class (#16509 ) ### Description Remove AllocatorManager class ### Motivation and Context After the refactor PR #15833 is in, AllocatorManager class is not referenced anymore.	2023-06-28 15:43:19 -07:00
Vrajang Parikh	960e320dff	Objective C Training API: TrainingSession (#16374 ) ### Description - Implement Objective-C binding for `ORTTrainingSession` - Add `ORTUtils` utility class to handle conversion between C++ and Objective-C types - Add test case for saving checkpoint - Add unit test cases for `ORTTrainingSession` ### Motivation and Context This PR is part of implementing Objective-C bindings for training API. It implements objective-c binding for training session. The objective-C API closely resembles the C++ API. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-06-28 09:13:56 -07:00
Baiju Meswani	cbfbe210a8	Fix bug that accidentally disabled training op tests (#16488 )	2023-06-27 18:39:54 -07:00
Yifan Li	e2c214d81f	[TensorRT EP] TRT 8.6 minor version update (#16475 ) ### Description * Minor version update: TRT 8.6.0.12->8.6.1.6 * CI pipeline ymls/dockerfiles are updated * cgmanifest.json/deps.txt/download-deps.yml are updated; Win trt binaries uploaded to [win img 307029](https://aiinfra.visualstudio.com/AI%20Infra%20Management/_build/results?buildId=307029&view=results) * Re-enable unit tests which were failed in 8.6.0 and re-gained support in 8.6.1	2023-06-26 10:44:27 -07:00
Scott McKay	48eff09664	Fix file list for test of build with IO debug (#16474 ) ### Description <!-- Describe your changes. --> Update file list to adjust for recent changes to test infra. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-06-26 16:36:22 +10:00
Chen Fu	5c125b4366	Cfu revertamx (#16455 ) ### Description This is to revert two PRs that aim at reducing AMX toolchain requirements. Unfortunately we still have some pipeline issues. https://github.com/microsoft/onnxruntime/pull/16390 https://github.com/microsoft/onnxruntime/pull/16086 ### Motivation and Context Looks like gcc link time optimization does not work very well with inline assembly in the above PRs.	2023-06-23 09:20:23 -07:00
Baiju Meswani	10ba1e270c	Minimal Build for On-Device Training (#16326 ) 🛠️ __Changes in this pull request:__ This pull request introduces two significant changes to the project: - Changing on device training checkpoint format: The current implementation stores the on device training checkpoint as a sequence of tensors in multiple files inside a checkpoint folder, which can be inefficient in terms of storage and performance. In this PR, I have modified the checkpoint format to utilize the flatbuffer table to save the checkpoint to a single file, providing a more compact and efficient representation. The changes around this are twofold: - Add the checkpoint flatbuffer schema that will generate the necessary checkpoint source files. - Update the checkpoint saving and loading functionality to use the new format. - Adding support for onnxruntime minimal build: To support scenarios where binary size is a constraint, I made changes to ensure that the training build can work well with the minimal build. 🔍 __Open Issues:__ - In order to extract the optimizer type, the existing implementation re-loaded the onnx optimizer model and parsed it. This is no longer possible, since the model format can either be onnx or ort. One idea is to do the same for ort format optimizer model. This needs some investigation. - Changes to the offline tooling to generate ort format training artifacts. - End-to-end training example showcasing the use of the minimal training build. - Add support for export model for inferencing in a minimal build.	2023-06-22 12:27:23 -07:00
RandySheriffH	6e29e185f3	Clean AzureEP logics (#16367 ) Moving out AzureEP invokers out of core runtime. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-06-21 09:38:52 -07:00
Chi Lo	4e3cff60fd	CUDA graph support for TRT EP (#16081 ) CUDA EP already supports [CUDA graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs), also we observed some models can benefit from using CUDA graph with `trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP. The implementation is based on https://github.com/microsoft/onnxruntime/pull/9978 with the same [constraints](https://github.com/microsoft/onnxruntime/pull/9978) as below: - Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. - Usage of CUDA Graphs is limited to models where-in all the model ops (graph nodes) can be partitioned to the TRT EP. - The input/output types of models need to be tensors. - Shapes of inputs/outputs cannot change across inference calls. - IObinding is required.	2023-06-21 09:36:45 -07:00
Yuhong Guo	48e6186b1a	Move tests from core/providers/cuda/test/* to test/providers/cuda/ and refactor CUDA UT (#16161 ) ### Description <!-- Describe your changes. --> 1. Add a new test lib `onnxruntime_providers_cuda_ut` which is similar to `onnxruntime_providers_cuda` but `onnxruntime_providers_cuda_ut` is only built if `onnxruntime_BUILD_UNIT_TESTS` is set. We can call all CUDA UTs through this ut lib without affecting production lib `onnxruntime_providers_cuda`. 2. Move all test cases from `core/providers/cuda/test/` to `test/providers/cuda/`. These test cases are built into lib `onnxruntime_providers_cuda_ut` and run by `./onnxruntime_test_all --gtest_filter="CUDA_EP_Unittest"`. Since the lib is only for test, we can use gtest macros in the test cases. Previous implementation do not support using gtest lib in the CUDA UT cases. 3. The cmake code in `cmake/onnxruntime_providers.cmake` is refactored a bit. A new function `onnxruntime_add_object_library` is to build a object target. The 2 libs `onnxruntime_providers_cuda_ut` & `onnxruntime_providers_cuda` share most of the code, so the object files can be used in both libs, which helps reduce build time. Another function `config_cuda_provider_shared_module` is used to configure all 3 similar targets(onnxruntime_providers_cuda_obj/onnxruntime_providers_cuda/onnxruntime_providers_cuda_ut). 4. Refactored the test to call `testing::InitGoogleTest` & `RUN_ALL_TESTS` in `libonnxruntime_providers_cuda_ut.so`'s `TestAll`. After this change, we can see all the cases running in `CUDA_EP_Unittest.All`: ![image](https://github.com/microsoft/onnxruntime/assets/19584326/8ff80df6-060b-4ef0-90b7-657e68d3db87) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> After https://github.com/microsoft/onnxruntime/pull/13016, there are still test files in test/providers/cuda/ that are not moved to core/providers/cuda/test/ and the test cases are disabled. This PR helps to clean the unfinished TODOs. Even through onnxruntime_shared_lib_test covers some test for CUDA provider. onnxruntime_shared_lib_test works like a coarse grain end-to-end test for CUDA provider. If CUDA unittest can run cases for a single component, this wound be helpful for CUDA developers. --------- Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-06-20 14:54:55 -07:00
Prateek Chokse	12dffef768	added support for cmake "find_package" (#8919 ) Description: Adds support for cmake find_package. Motivation and Context As mentioned in issue #7150 onnxruntime doesn't have support for CMake find_package, this PR adds that and also adds the CMake package version file. Now anyone can link onnxruntime like this: ```cmake find_package(onnxruntime) add_executable(test Source.cpp) target_link_libraries(test PRIVATE onnxruntime::onnxruntime) ``` this also simplifies #3124	2023-06-19 22:20:31 -07:00
Dipanjan Sengupta	35fa6af428	Fix for the build break in AMX feature on Mac OS. (#16390 ) ### Description Fixing the build break issue in Apple pipeline due to AMX flag removal.	2023-06-16 21:00:41 -07:00
Scott McKay	8fdfd20191	Separate out operator vs model testing. (#16228 ) ### Description <!-- Describe your changes. --> Split up OpTester to separate out operator vs model testing. This led to a lot of other cleanups/refactoring. - create BaseTester class and derived OpTester/ModelTester classes to limit APIs to what is applicable for each test type - e.g. adding an attribute isn't relevant to a model test - cleanup structure - don't expose member variables either directly or via public methods returning them - split out checkers so they can be easily re-used - refactor so there's one public Check method for comparing two OrtValue instances containing any data type - refactor the GradientOpTester usage - it required a lot of OpTester internals to be exposed and no other tests needed this - it also returned Status through various parts which prevented the usage of the google test macros which provide better output. change to return void and use the macros. - fix some other minor issues - update some cmake files so all the source files are included - remove some low value helpers (FetchTensor and GetShapeVector) - remove some outdated code to allow unreleased opset versions from when onnx opset 15 wasn't released - move files from test/util/include/test to test/util/include - doesn't seem to be any reason for the additional subdirectory given they're not files use to test the code in test/util - files were moved with no changes ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Cleanup test infrastructure. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-06-17 12:58:57 +10:00
saurabh	a6ce7b339f	Enable model subgraph execution in OVEP and setting the OpenVINO dll's to the path from the OpenVINO pypi packge in OVEP and fix OVEP windows io buffer sample (#16147 ) ### Description This PR enables execution of subgraphs in OVEP and currently, when OVEP developers install the onnxruntime-openvino package on windows from pypi, they would have to additionally download OpenVINO windows binaries and run the setupvars.bat script which sets the environment PATH to locate the OV dll's. Also this PR fixes issues of OVEP windows io buffer sample. ### Motivation and Context Fix: We want to make the user experience easy for OVEP Python developers on windows platform. This fix, introduces a function add_openvino_libs_to_path at the location tools/python/util/add_openvino_win_libs.py. The above function, can be called by OVEP python users in the application code and that takes care of setting the OpenVINO dll's to the path from the OpenVINO pypi packge (openvino) which was installed. This change also makes sure that add_openvino_libs_to_path() function is added to onnxruntime python package only when it is build for OpenVINO Execution Provider for ONNXRuntime and not for default ORT python package builds. New user experience for Python OVEP developers on windows platform: step 1: pip install onnxruntime-openvino step 2: pip install openvino step 3: <Add these 2 lines in the application code> import onnxruntime.tools.add_openvino_win_libs as utils utils.add_openvino_libs_to_path() --------- Signed-off-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>	2023-06-16 19:47:09 -07:00
Silvio Traversaro	4915191e63	Fix build of Python wheel on Windows with single-config generator (#16337 ) ### Description Before this PR, the CMake code assumed that when on Windows a multiple-config CMake generator was used, while on non-Windows there was the assumption of a single-config CMake generator. After this PR this information is obtained from the [`GENERATOR_IS_MULTI_CONFIG`](https://cmake.org/cmake/help/latest/prop_gbl/GENERATOR_IS_MULTI_CONFIG.html) global CMake propery. ### Motivation and Context I discovered this problem when building with Ninja generator on Windows, but I guess this should fix problems also on non-Windows platforms when using a multiple-config generator (such as Xcode on macOS or "Ninja Multi-Config" on all platforms). See https://cmake.org/cmake/help/latest/prop_gbl/GENERATOR_IS_MULTI_CONFIG.html for more info.	2023-06-16 09:17:49 -07:00
Dipanjan Sengupta	681a0d084d	Removing AMX build flag (#16086 ) ### Description 1. Replacing AMX intrinsics with machine code macro instructions in QGEMM kernel. 2. Removing AMX build flags for GCC in cmake file. ### Motivation and Context The additional AMX flag in cmake adds an extra layer of dependency on GCC version to use the feature.These changes should allow the usage of the AMX feature with just the CPU ID check.	2023-06-15 11:22:59 -07:00
satyajandhyala	889f80082f	[js/web] Added Reduce operators support (#16122 ) ### Description Added support for ReduceL1, ReduceL2, ReduceMean, ReduceMin, ReduceMax, ReduceSum, ReduceLogSum, ReduceLogSumExp, ReduceProd and ReduceSquareSum. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com> Co-authored-by: guschmue <guschmue@microsoft.com>	2023-06-12 07:46:27 -07:00
Vrajang Parikh	67f4a4fd16	Objective-C binding for ORT training (#16127 ) ### Description Implement Objective-C binding for `ORTCheckPoint`. Additionally, - Modify `onnxruntime_objectivec.cmake` to only include training header and sources when training flag is enabled - Enable objective-c binding for `orttraining-mac-ci-pipeline` ### Motivation and Context This PR is part of implementing Objective-C bindings for training API. It implements objective-c binding for ORTCheckPoint class. The objective-C API closely resembles the C++ API. Note: The test for saving checkpoint is skipped as it requires use of training session. It will be added when the objective-c binding for `ORTTrainingSession` is added.	2023-06-07 14:01:30 -07:00
Edward Chen	1261d0b8ba	Fix some build issues on MacOS with Xcode 14.3. (#15878 ) - Fix flatbuffers flatc warning, unused-but-set-variable. - Address `-Wshorten-64-to-32` warnings (fix in our code, allow in dependencies' code). - Update CI builds to use Xcode 14.3. - Update minimum iOS version to 12.0. - Update Mac hosted agents to MacOS 13 where possible.	2023-06-07 12:07:11 -07:00
Wanming Lin	a8c2f24ae0	[WebNN EP] Merge support for segment anything into main branch (#16208 ) We implemented a number of new ops and data types to support running segment anything model on Chromium WebNN DML backend (POC) in a forked branch https://github.com/honry/onnxruntime/tree/stable-diffusion In this PR, we migrate the changes in the forked branch to main branch, includes: - 22 new ops - New tensor data types: bool, int32, uint32, uint64, int64, float16 (As JavaScript hasn't shipped Float16Array, we use Uint16Array as a workaound) - Handle empty input tensors and duplicated outputs - Fixed some nits	2023-06-07 09:56:37 -07:00
Changming Sun	7686193c40	Fix DNNL build (#16201 )	2023-06-02 09:46:03 +08:00
Xavier Dupré	e726151b5c	Introduce float 8 types (#14731 ) ### Description The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA API to cast float/half to float8 if CUDA>=11.8, a custom implementation if CUDA<11.8. * It implements, Cast, QuantizeLinear, DequantizeLinear for all types on CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA. * It extends the supported types for control flow operator, Shape, Reshape, Identity, If, Loop, Scan, Reshape * It implements Equal(19). * Cast, QuantizeLinear, DequantizeLinear operators now support a parameter `saturate` only valid for float 8 types. It is true by default. In that case, any value out of range is converted into the maximum float 8 value. If false, it is infinite. * QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA (and ROCm by extension), scale = 1D tensor with one scale per channel ### Motivation and Context Supports latest onnx version. Fixes [AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395) --------- Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>	2023-05-30 13:25:58 -07:00
神楽坂帕琪	abd94b65b7	eigen.cmake use url info from deps.txt (#16129 ) ### Description `eigen.cmake` use url info provided by deps.txt instead of using raw url.	2023-05-30 11:07:20 -07:00
Yuhong Guo	04a8f50674	New configuration to limit the arena extension (#15983 ) Add a configuration `max_power_of_two_extend_bytes ` to limit the arena extension size. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> In our real scenario, we observe that if the model is big enough the BfcArena will extend uncontrollable. As showed by the following figures, if a model uses more than 16GB memory, the BfcArena will totally apply for 32GB memory according to the `kNextPowerOfTwo` strategy. With the new strategy, the extension is limited. The default maximum extension size is 1GB. #### Without the new configuration After loading the model, ORT uses 32G GPU memory. ![image](https://github.com/microsoft/onnxruntime/assets/19584326/42b93c66-b957-4f20-a13b-d34cb390afff) #### With the new configuration After loading the model, ORT uses 23G GPU memory. ![image](https://github.com/microsoft/onnxruntime/assets/19584326/5abffeff-9ca3-4187-a262-37fd2764fe1b) Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-05-25 02:19:07 -07:00
Sumit Agarwal	70d2dc8209	[DML EP] Fix issue with --dml_path build option (#15972 ) ### Description DML_PACKAGE_DIR cmake variable is not getting set properly when dml_path build options is used. ### Motivation and Context - Why is this change required? What problem does it solve? It is required for DML Perf dashboard. <!--- If it fixes an open issue, please link to the issue here. -->	2023-05-24 19:20:40 -05:00
yf711	105f5f0f20	Avoid trt deprecated api warnings shown as errors during ORT-TRT build (#16035 ) ### Description Avoid trt deprecated api warnings shown as errors when building onnxruntime_test_all This issue is only visible when installing trt via binaries, rather than deb/rpm pkg (CI pipelines) The change is similar to existing set_property for onnxruntime_providers_tensorrt `89ea503024/cmake/onnxruntime_providers.cmake (L421)` ### Motivation and Context onnxruntime/test/unittest_main/[test_main.cc](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/unittest_main/test_main.cc#L32) includes nvinfer.h, which includes deprecated trt apis and and generates warnings. When building onnxruntime_test_all, it will show warnings as errors and block the build. ### Doubts Although this issue is visible on trt tar binaries but not on trt deb/rpm pkgs, Their file size&hash are the same (creation time vary), regarding headers/libs installing in different ways. \| tarBin \| pkg \| \| ------------------------------------------------------------ \| ------------------------------------------------------------ \| \| 997284784 Apr 26 15:15 libnvinfer_builder_resource.so.8.6.1 \| 997284784 Apr 26 22:21 libnvinfer_builder_resource.so.8.6.1 \| \| 235369632 Apr 26 15:14 libnvinfer.so.8.6.1 \| 235369632 Apr 26 22:21 libnvinfer.so.8.6.1 \|	2023-05-24 13:19:27 -07:00
PeixuanZuo	2fddc65c8c	[ROCm] add hipblaslt into GemmFastGelu TunableOp (#15945 ) add hipblaslt into GemmFastGelu TunableOp.	2023-05-23 11:07:09 +08:00
RandySheriffH	d35361bf9d	Fix python pipeline for AzureEP without using root (#16023 ) Fix python pipeline for AzureEP without using root, this is for 1.15. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-22 16:38:47 -07:00
Changming Sun	0204594f90	Cleanup WASM cmake code (#15996 ) ### Description Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if (CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code look more nature. For example, ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY) ``` becomes ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten") ```	2023-05-20 18:07:39 -07:00
RandySheriffH	4dfb89b3ad	Implement mutex-free spin lock for task queue (#14834 ) Implemented "lock-free" spinlock to save CPU usage on context switching. The change has been tested on queene service of Ads team, the lock-free version of ort (40 threads) saves CPU usage on gen8 (128 logical processors on 8 numa nodes) windows by nearly half, from 65% to 35%. For 32 cores, the curve is flat: Anubis, 32 vCPU, windows, hugging face models, 95 percentile E2E latency in ms: model \| mutex(ms) \| mutex-free --- \| --- \| --- alvert_base_v2 \| 34.21 \| 34.09 bert_large_uncased \| 116.27\| 117.84 bart_base \| 72.06 \| 71.99 distilgpt2 \| 25.43 \| 25.02 vit_base_patch16_224 \| 37.33 \| 37.76 Anubis, 32 vCPU win, Linux, 1st party models, 95 percentile E2E latency in ms: model \| mutex(ms) \| mutex-free --- \| --- \| --- deepthink_v2 \| 24.35 \| 22.95 bing_feeds \| 36.96 \| 36.48 deep_writes \| 14.46 \| 14.32 keypoints \| 9.34 \| 7.69 model11 \| 1.71 \| 1.66 model12 \| 1.82 \| 1.44 model2 \| 4.21 \| 3.95 model6 \| 1.08 \| 1.05 agiencoder \| 0.99 \| 0.93 geminet_transformer \| 5.32 \| 5.24 --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-19 10:12:10 -07:00
Patrice Vignola	310b22aa0c	[DML EP] Update DirectML version to 1.12.0 (#16011 )	2023-05-18 19:37:12 -07:00
Ashwini Khade	0c815a95b7	android package fix (#15999 ) ### Description This PR adds the training headers to the training android packages. ### Motivation and Context Training headers need to be added as part of the training android packages, however because of the typo in the cmake these headers were not being added. This PR fixes the issue.	2023-05-18 09:21:03 -07:00
Changming Sun	842b1a3472	Revert a change in #15797 : restore the correct version of emsdk (#15995 ) ### Description Revert a change in #15797: restore the correct version of emsdk ### Motivation and Context Without change, when you build it on Windows you will see: ``` 2023-05-17 19:41:30,093 build [INFO] - Activating emsdk... 2023-05-17 19:41:30,093 util.run [INFO] - Running subprocess in 'C:\src\onnxruntime2\cmake\external\emsdk' 'C:\src\onnxruntime2\cmake\external\emsdk\emsdk.bat' activate 3.1.37 error: tool or SDK not found: '3.1.37' ```	2023-05-18 07:41:38 -07:00
kailums	f62f722c70	integrate triton into ort (#15862 ) ### Description In some scenarios, the triton written kernels are more performant than CK or other handwritten kernels, so we implement a framework that onnxruntime can use these triton written kernels. This PR is to integrate triton into ort, so that ort can use kernels that written and compiled by triton. The main change focus on two part: 1. a build part to compile triton written kernel and combine these kernels into libonnxruntime_providers_rocm.so 2. a loader and launcher in c++, for loading and launch triton written kernels. #### Build To compile triton written kernel, add a script `tools/ci_build/compile_triton.py`. This script will dynamic load all kernel files, compile them, and generate `triton_kernel_infos.a` and `triton_kernel_infos.h`. `triton_kernel_infos.a` contains all compiled kernel instructions, this file will be combined into libonnxruntime_providers_rocm.so, using --whole-archive flag. `triton_kernel_infos.h` defines a const array that contains all the metadata for each compiled kernel. These metadata will be used for load and launch. So this header file is included by 'triton_kernel.cu' which defines load and launch functions. Add a build flag in build.py and CMakeList.txt, when building rocm provider, it will call triton_kernel build command, and generate all necessary files. #### C++ Load and Launch On c++ part, we implement load and launch functions in triton_kernel.cu and triton_kernel.h. These two files located in `providers/cuda`, and when compiling rocm, they will be hipified. so this part supports both cuda and rocm. But currently we only call triton kernel in rocm. We also implement a softmax triton op for example. Because there will generate many kernels for different input shape of softmax, we use TunableOp to select the best one. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-17 09:35:28 +08:00
cloudhan	dc383ed4ce	Basic CSharp packaging support for ROCm EP (#15535 ) This PR mainly fixes building errors when trying to build nupkg for ROCm EP. It also slighly improve the packaging logic so that devlopers can produce the nupkg on linux natively.	2023-05-16 07:27:38 +08:00
Dmitri Smirnov	896a963492	Adust GetVersionString() GetBuildInfoString() signatures and move them to OrtApi (#15921 ) ### Description This PR partially reverts changes introduced in https://github.com/microsoft/onnxruntime/pull/15643 We make two API return std::string always in UTF-8. We also move the entry points from OrtApiBase to OrtApi to make them versioned. ### Motivation and Context `GetVersionString` always returns x.y.z numbers that are not subject to internationalization. `GetBuildInfoString` can hold international chars, but UTF-8 should be fine to contain those. We prefix them with u8"" in case the compiler default charset is not UTF-8. Furthermore, creating platform dependent APIs is discouraged. `ORTCHAR_T` is platform dependent and was created for paths only. On non-unix platforms would still produce `std::string` that can only contain UTF-8 The API was introduced after the latest release, and can still be adjusted.	2023-05-13 13:45:07 -07:00
RandySheriffH	7c4e8267e7	Implement openAI endpoint invoker for nuget (#15797 ) Implement openAI audio endpoint, and enable nuget packaging. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-11 22:04:02 -07:00
Jian Chen	1a73d61829	Update eigen to 3.4 and remove the eigen from git submodule (#15875 ) ### Description Update eigen to 3.4 and remove the eigen from git submodule ### Motivation and Context We need to have eigen 3.4 for c++20	2023-05-11 11:56:59 -07:00
Changming Sun	7c58d013aa	Remove Ubuntu 18.04 usages (#15781 ) ### Description Remove Ubuntu 18.04 usages because it will be EOL this month. ### Motivation and Context	2023-05-11 11:44:00 -07:00
sdegrande	cf062dbdb1	FlatBuffers fails to compile with gcc13. (#15787 ) When building the FlatBuffers dependencies, gcc13 emits a stringop-overflow warning. All warnings being turned into errors, that fails the compilation of FlatBuffers, and as a consequence also fails the build of onnxruntime. This commit adds the application of a patch to FlatBuffers's CMakeList.txt, to add -Wno-error=stringop-overflow to the CMAKE_CXX_FLAGS.	2023-05-11 11:20:19 -07:00
liqun Fu	ac9ae9f7c5	update onnx release 1.14 for docker files (#15680 ) ### Description this is for ort 1.15 release to work with onnx 1.14 It shall be merged after onnx 1.14 release and before ort 1.15 release. ### Motivation and Context --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-05-10 13:15:56 -07:00
Sumit Agarwal	b473e3f3c6	[DML EP] Update DirectML version to 1.11.0 (#15858 ) ### Description - Update DML version to 1.11.0 - Disable Gemm+Softmax fusion ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-09 12:48:15 -07:00
Wanming Lin	00b1e79e04	Support WebNN EP (#15698 ) Description: This PR intends to enable WebNN EP in ONNX Runtime Web. It translates the ONNX nodes by [WebNN API](https://webmachinelearning.github.io/webnn/), which is implemented in C++ and uses Emscripten [Embind API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#). Temporarily using preferred layout NHWC for WebNN graph partitions since the restriction in WebNN XNNPack backend implementation and the ongoing [discussion](https://github.com/webmachinelearning/webnn/issues/324) in WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW' layouts. No WebNN native EP, only for Web. Motivation and Context: Allow ONNXRuntime Web developers to access WebNN API to benefit from hardware acceleration. WebNN API Implementation Status in Chromium: - Tracked in Chromium issue: [#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291) - CPU device: based on XNNPack backend, and had been available on Chrome Canary M112 behind "#enable-experimental-web-platform-features" flag for Windows and Linux platforms. Further implementation for more ops is ongoing. - GPU device: based on DML, implementation is ongoing. Open: - GitHub CI: WebNN currently is only available on Chrome Canary/Dev with XNNPack backend for Linux and Windows. This is an open to reviewers to help identify which GitHub CI should involved the WebNN EP and guide me to enable it. Thanks!	2023-05-08 21:25:10 -07:00
Yulong Wang	0457fd0b40	upgrade emsdk to 3.1.37 (#15817 ) ### Description upgrade emsdk to 3.1.37 WIP branch to debug the mystery memory issue in web assembly multi-thread build.	2023-05-08 16:49:47 -07:00
Guenther Schmuelling	5a43828b3d	update ort extensions to 94142d8391c9791ec71c38336436319a2d4ac7a0 (#15688 ) needed to get tokenizers/decode for whisper --------- Co-authored-by: Shalva Mist <shalvamist@microsoft.com>	2023-05-05 09:48:07 -07:00
cloudhan	412d05a1d2	[ROCm] Update cmake (#15807 ) Followup of #15775	2023-05-04 11:20:56 -07:00
Yulong Wang	33d1372729	[wasm] revert emsdk to v3.1.19 (#15793 ) ### Description latest emsdk generated multi-thread version sometimes crash with unknown reason ( error: memory access out of bounds ). we don't want to break existing ort-web users, so revert emsdk back to 3.1.19 (same to what ort v1.14.0 uses)	2023-05-04 01:15:01 -07:00
Baiju Meswani	ba7b83ff3c	Remove onnxruntime_PYBIND_EXPORT_OPSCHEMA definition from onnxruntime (#15776 )	2023-05-03 13:08:35 -07:00
Changming Sun	41c082fdde	Add a Github workflow for Prefast (#15763 )	2023-05-03 11:42:51 -07:00
Changming Sun	328cabb194	Download protoc from Github Release instead of Nuget (#15731 ) ### Description Download protoc from Github Release instead of Nuget to avoid having dependency on nuget.exe on Linux ### Motivation and Context To avoid having dependency on nuget.exe on Linux. Many users' build environment do not have nuget or dotnet.	2023-05-02 12:18:59 -07:00
Changming Sun	5352f6d9b0	Make "--cuda_version" build arg optional (#15758 ) ### Description This change will allow us building CUDA EP without installing CUDA SDK on Windows. ### Motivation and Context Nvidia's CUDA installer comes with a VS extension. In the past, we require installing the extension. It is a little bit inconvenient since: 1. Visual Studio must be installed before CUDA SDK. CUDA's installer will not install the extension if your machine doesn't have Visual Studio. 2. We need to install CUDA SDK on our build machines, instead of just downloading it and using it. After this change, we will not need to install CUDA SDK on our build machines. So it will be easier to add a support for a different CUDA version. Also, fix two PreFast warnings.	2023-05-01 18:00:47 -07:00
Ashwini Khade	0ffae8073b	Creating Nuget and Android packages for Training (#15712 ) ### Description This PR creates Nuget and Android for Training. ### Motivation and Context These packages are intended to be released in ORT 1.15 to enable On-Device Training Scenarios. ## Packaging Story for Learning On The Edge Release ### Nuget Packages: 1. New Native package -> Microsoft.ML.OnnxRuntime.Training (Native package will contain binaries for: win-x86, win-x64, win-arm, win-arm64, linux-x64, linux-arm64, android) 2. C# bindings will be added to existing package -> Microsoft.ML.OnnxRuntime.Managed ### Android Package published to Maven: 1. New package for training (full build) -> onnxruntime-training-android-full-aar ### Python Package published to PyPi: 1. Python bindings and offline tooling will be added to the existing ort training package -> onnxruntime-training	2023-05-01 12:59:56 -07:00
Sumit Agarwal	4c4f688a93	[DML EP] Fix dml_external_project (#15656 ) ### Description While building ORT for DML EP with `dml_EXTERNAL_PROJECT` flag, 2 variables (`DML_SHARED_LIB`, `DML_PACKAGE_DIR`) value is not set properly. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-01 12:02:56 -07:00
Chunye Wang@AMD	d35850c142	[VitisAI]Update VitisAI EP to be compatible with VitisAI 3.5 (#15673 ) ### Description Originally VitisAI EP only works with old version of VitisAI release. ### Motivation and Context Update VitisAI EP so that it works together with the current VitisiAI 3.5 and further version of VitisAI. We try our best to make it forward compatible. --------- Co-authored-by: Wang Chunye <chunywan@xilinx.com> Co-authored-by: mingyue <mingyue@amd.com> Co-authored-by: mingyueliuh <131847423+mingyueliuh@users.noreply.github.com> Co-authored-by: liumingyue <mingyue@xilinx.com> Co-authored-by: moore-ch <129165652+moore-ch@users.noreply.github.com> Co-authored-by: shoucair <shoucai.ren@amd.com> Co-authored-by: zz002 <zhenze.wang@amd.com> Co-authored-by: BoarQing <yuz75@Pitt.edu> Co-authored-by: Yueqing Zhang <yueqingz@amd.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>	2023-05-01 08:28:26 -07:00
Changming Sun	65020d433e	Prefast fixes for CUDA EP (#15726 ) ### Description 1. Adjust cmake flags. Do not modify CMAKE_CXX_FLAGS globally. Only apply the flags to ORT code. 2. Fix some SDL warnings.	2023-04-29 12:43:12 -07:00
Yuhong Guo	41dcf0d32e	Expose build information in dynamic lib (#15643 ) ### Description <!-- Describe your changes. --> 1. Add Build Info API to onnx. 2. Fix compile error while building onnxruntime_benchmark in MacOs. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> 1. When Onnxruntime lib is serving online, we need a way to detect how this lib is built. This PR helps the developer to get the build information using `strings` such as git branch, git commit id, build type and cmake cxx flags, which is showed as follows. ![image](https://user-images.githubusercontent.com/19584326/233794371-b2f95a2c-27fb-4709-a6dd-bf4bb12b0b5b.png) ![image](https://user-images.githubusercontent.com/19584326/233794360-f96f5d2e-332c-405c-83f1-370ccc2b86f8.png) If the build env has no git, there will be no git related infor: ![image](https://user-images.githubusercontent.com/19584326/234558596-298c1b01-9a90-41bf-9372-7259a8f8e5be.png) 3. Fix the following compile error while building benchmark in MacOs. ![image](https://user-images.githubusercontent.com/19584326/233793571-c261ac1f-47b2-434d-a293-7e9edc6c8a66.png) --------- Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-04-28 21:57:31 -07:00
Changming Sun	d3e8d7a70d	Better support for cmake 3.26 and Windows ARM64 (#15704 ) ### Description In #8953 I introduced a change in our onnxruntime_mlas.cmake that it enables "ASM_MASM" cmake language for all Windows build. ```cmake enable_language(ASM_MASM) ``` Before the change, it is only enabled when onnxruntime_target_platform equals to x64. However, cmake 3.26 added a new language: ASM_MARMASM. According to cmake's manual, ASM_MASM is for Microsoft Assembler ASM_MARMASM is for Microsoft ARM Assembler. This one is new in cmake 3.26. We should choose the right one according to ${onnxruntime_target_platform}.	2023-04-27 10:25:45 -07:00
kunal-vaishnavi	cfb8c0e2ca	Add Whisper custom export to wheel (#15685 ) ### Description This PR adds the Whisper custom export scripts to the wheel. ### Motivation and Context This enables access to the custom export scripts in the wheel.	2023-04-26 10:45:52 -07:00
PeixuanZuo	0ecfe83932	[ROCm] add beam search support (#15625 ) add beam search support for ROCm EP.	2023-04-26 17:53:33 +08:00
Yulong Wang	b98317b907	[js/webgpu] following up for JSEP/WebGPU code cleanup (#15666 ) ### Description This PR resolves a part of non-critical comments from code review comments in #14579. - use `USE_JSEP` instead of `USE_JS` in build definition to make it less ambiguous - remove unused util functions from util.ts - fix transpose.h - other misc fixes	2023-04-25 21:20:03 -07:00
sfatimar	ebaafac3f5	Openvino ep ort 5.0 (#15626 ) ### Description The PR adds VPU support to OpenVINO Execution Provider Bug fixes for GPU, CPU. Changes to OpenVINO Backend in Serialized Model API for faster First Inference Latency. Deprecation to HDDL-VADM and MYRIAD, removed code Support OpenVINO 2023.0 Dynamic Shapes Support for iGPU ### Motivation and Context - VPU is an upcoming hardware that can provide AI Acceleration for Client Systems through OpenVINO - If it fixes an open issue, please link to the issue here. --> --------- Signed-off-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>	2023-04-25 20:59:42 -07:00
Changming Sun	9bf08bdb52	Fix iconv link issue (#15592 ) ### Description Fix iconv link issue. The library is used in string_normalizer.cc. ### Motivation and Context Though iconv is part of POSIX standard, some systems may have additional iconv providers, for example GNU iconv, that is not in the standard c runtime library. In these cases we may need to link to additional libraries. However, this change has two caveats: 1. It may silently pull in GNU libraries into libonnxruntime.so, and make the shared library not distributable. 2. The detection of iconv library runs before we add additional include folders to ORT. So the detection may be inaccurate.	2023-04-25 13:28:36 -07:00
Yulong Wang	14cc02c65c	[js/web] WebGPU backend via JSEP (#14579 ) ### Description This change introduced the following new components into ONNX Runtime Web: - JavaScript Execution Provider (JSEP) - Asynchronized inferencing execution powered by Emscripten's Asyncify - WebGPU backend implemented in TypeScript - initial implementation of kernels: - elementwise operators (22) - binary operators (5) - tensor: Shape, Reshape, Transpose, Gemm - nn: Conv, {Global}Maxpool, {Global}AveragePool Code need to be polished. still working on it. ## Q&A What is JSEP? > JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime execution provider that specifically works on Web environment (browsers). JSEP allows JavaScript code to kick in from various places when ONNX Runtime inferences a model. Why JSEP? > JSEP is a hybrid mode EP that contains both C/C++ and TypeScript/JavaScript implementation. There are 2 strong reasons why we introduces JSEP: > 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities as much as possible including graph transformer, optimizers and also the capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to develop and debug much easier in the browser for the kernel implementation. > 2. the requirement of asynchronized execution from JavaScript API (eg. `buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a synchronized context (see "async problem" section below). This is done by using Emscripten's Asyncify. What is WebGPU? > WebGPU is the new GPU API that available in browser. It's one of the only 2 APIs that currently available to access the GPU from browser (the other is WebGL). > WebGPU is designed with more advanced and stronger features comparing to WebGL and is potentially solution that offer the best GPU performance for model inferencing that currently available. What is the async problem and why we have the problem? > The "async problem" is a problem that you cannot call an async function in a synchronous context. Think about the following C++ code: > ```c > // C-style declarations (API) > typedef void (ON_COMPLETE)(PVOID state, DATA data); > void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete); > > // implementation > DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) { > // how to implement? > } > ``` > The answer is, it's impossible to implement this function. Usually we try to find a sync version API, or launch a thread to call the async function and sync-wait on the main thread. Unfortunately, in browser environment, neither is possible. > > WebGPU does not offer any synchronized API for data downloading (GPU to CPU). This is the only operation that MUST be async. As `OrtRun()` will eventually call into DataTransfer for copy data from GPU to CPU, and `OrtRun()` is a synchronized function, this cannot be done in normal way. What is Emscripten? How is the Asyncify feature resolved the problem? > Emscripten is the C/C++ compiler for WebAssembly. It's what we use to compile ORT and generates the WebAssembly artifacts which runs on browsers. > > Asyncify is a [compiler feature](https://emscripten.org/docs/porting/asyncify.html) that allows calling async functions from a synchronized context. In short, it generates code to unwind and rewind call stack to emulate async execution. With this feature, we are able to call the async function inside `OrtRun()` call. ## Design Overview Inter-op JSEP is doing pretty much same thing to just another EP. It exposes an interface for inter-op with JavaScript, which is defined in onnxruntime/wasm/js_internal_api.js: ```js // init JSEP Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) { Module.jsepBackend = backend; Module.jsepAlloc = alloc; Module.jsepFree = free; Module.jsepCopy = copy; Module.jsepCopyAsync = copyAsync; Module.jsepCreateKernel = createKernel; Module.jsepReleaseKernel = releaseKernel; Module.jsepRun = run; }; ``` This simple JavaScript snippet defines all language barrier level functions that requires by JSEP to achieve implementing kernels and data transfers using JavaScript inside ONNX Runtime: - `jsepBackend`: assign the singleton object to webassembly module - `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc() and Free() - `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU) - `jsepCopyAsync`: asynchronized copy ( GPU to CPU) - `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object that maintained in JS to match lifecycle of Kernel in ORT - `jsepRun`: OpKernel::Compute() should call into this The abstraction above allows to tie as little as possible connections and dependencies between C/C++ and TypeScript/JavaScript. Resource Management Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the implementation are left to JavaScript. JavaScript code are responsible to implement the callbacks correctly. For WebGPU, the GPU data is managed by JavaScript using a singleton map (tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton. Shaders are managed using a singletonmap (shader_key => gpu_program), while shader_key is generated by cache_key (OP specific, including attributes) and input shapes. about data transfer `js::DataTransfer::CopyTensor` implemented to call either synchronized or asynchronized copy callback, depending on the destination is GPU or not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function to be called in the synchronized context. run kernel in JS Kernel class constructor calls once `jsepCreateKernel()` with an optional per-kernel specific serialization to pass attributes into JavaScript. `Compute()` are implemented in a way that a metadata serialization is performed in a base class and JavaScript code can access the data using the Emscripten specific builtin macro `EM_ASM_`. disabled features* memory pattern is force disabled, because the WebGPU data is not presented by a general memory model (a buffer can be represented by offset + size). concurrent run support is disabled. WebGPU is stateful and it also has async function call. To support concurrent run will significantly increase the complexity and we don't get any real benefit from it. prefer channels last JSEP prefers channels last and returns `DataLayout::NHWC` in method `GetPreferredLayout()`. This will let the graph transformers to preprocess the graph into a channels last form so that a more optimized WebGPU shader can be used. Testing code It's impossible to test JSEP directly because JSEP itself does not contain any kernel implementation. However, it has the kernel registration which need to work together with the corresponding JavaScript code. There are unit tests that run onnx models from JavaScript API. --------- Co-authored-by: Scott McKay <skottmckay@gmail.com>	2023-04-24 15:21:18 -07:00
George Wu	8dd32fed47	[TensorRT EP] avoid excessive library load/unload overhead when running unit tests. (#15639 ) TensorRT will load/unload libraries as builder objects are created and torn down. This will happen for every single unit test, which leads to excessive test execution time due to that overhead. This overhead has steadily increased over the past few TensorRT versions as the library objects get bigger leading to 8 hours to run all the unit tests. Nvidia suggests to keep a placeholder builder object around to avoid this.	2023-04-24 14:43:13 -07:00
George Wu	c2acf69d13	support new include,lib dir structure in upcoming QNN 2.11 (#15605 ) upcoming QNN 2.11 will have a different include/lib directory structure. update cmake files to support the new structure.	2023-04-24 13:10:17 -07:00
Ashwini Khade	ccb2243ee7	Update build option for training in java to enable_training_api (#15638 ) ### Description Updating the build option for enabling training in java builds from ENABLE_TRAINING -> ENABLE_TRAINING_APIS. In the native codebase ENABLE_TRAINING is used for enabling full training and ENABLE_TRAINING_APIS is used for creating the lte builds with training apis. Making the change to sync the naming convention across all the language bindings. It was a bit confusing to see ENABLE_TRAINING when debugging the android build failures for training. Making this change just to improve readability of logs during debugging. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-04-24 11:53:08 -07:00
Tianlei Wu	686fd3c22a	Fix cuda 12.1 windows Build (#15614 ) ### Description Fix CUDA 12.1 Windows build error of cuda namespace ambiguous. Use a new namespace for attention softmax. Tested with VS 2019 and VS 2022 with the following settings: - OS: Microsoft Windows 11 Enterprise (Version 10.0.22621 Build 22621) - CUDA: cuda_12.1.0_531.14_windows - TensorRT: TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0 - CUDNN: 8.8.1.3 for cuda 12 - Visual Studio Enterprise 2019, version 16.11.26 (MSVC v142) or Visual Studio Enterprise 2022 (64-bit), version 17.5.4 - Python: 3.10 - CMake: 3.25.2 VS 2019: ``` build.bat --cmake_generator "Visual Studio 16 2019" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80;86" --skip_submodule_sync --parallel --build_shared_lib --update --build --build_dir .\build\trt --use_cuda --cuda_version "12.1" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1" --cudnn_home "C:\CuDNN\8.8.1.3_cuda12" --use_tensorrt --tensorrt_home "C:\TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0\TensorRT-8.6.0.12" ``` VS 2022: ``` build.bat --cmake_generator "Visual Studio 17 2022" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80;86" --skip_submodule_sync --parallel --build_shared_lib --update --build --build_dir .\build\trt_2022 --use_cuda --cuda_version "12.1" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1" --cudnn_home "C:\CuDNN\8.8.1.3_cuda12" --use_tensorrt --tensorrt_home "C:\TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0\TensorRT-8.6.0.12" ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/15242	2023-04-24 10:02:35 -07:00
cloudhan	9e44248bf9	Workaround ROCm global pool (#15481 ) Implement global avg/max pool with reduction	2023-04-23 11:48:43 +08:00
Ye Wang	633dec0b17	refactor some code (#15566 ) ### Description <!-- Describe your changes. --> 1. moved onnxruntime/contrib_ops/cuda/decoder to onnxruntime/contrib_ops/cuda/bert 2. create utils.cuh under /bert for shared implementations in decoder_masked_multihead_attention_impl_utils.h and rotary_embedding_util.h 3. refactored relative_attn_bias_impl.cu by reusing the template specializations in utils.cuh ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-04-21 12:57:08 -07:00
Scott McKay	446c478fbd	Add iOS Swift Package Manager support (#15297 ) ### Description <!-- Describe your changes. --> Add Swift Package Manager (SPM) support for ORT based on #14621 - uses the existing objective-c bindings - some re-organization of the directory structure was required but the contents of the files are unchanged, apart from adjustments due to file movements Add tool for updating ORT native pod used in the SPM package Update CIs to use ORT native pod from build, and build/test using SPM ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> iOS developers are using SPM as much as cocoapods, so adding SPM means both are catered for.	2023-04-20 16:18:35 +10:00
George Nash	f2889b41c1	[AMX] Update assembler check (#15501 ) A recent commit added an assembler check if the ASM dialect was ATT This unfortunately broke the AMX build for systems that don't have the ASM-ATT dialect. This change assumes if the CMAKE_ASM-ATT_COMPILER_ID is not found and the CMAKE_ASM_COMPILER_ID is "GNU" based on all the other already passed checks AMX is supported by the compiler and assembler. ### Description ### Motivation and Context On my build system the recent change to add the ASM-ATT version check disabled AMX code from the build. --------- Signed-off-by: George Nash <george.nash@intel.com>	2023-04-19 14:16:26 -07:00

... 3 4 5 6 7 ...

1776 commits