Commit graph

1563 commits

Author SHA1 Message Date
snadampal
77da2ef278
[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 (#17031)
### Description
This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to
implement matrix multiplication with bfloat16 SIMD instructions (bfmmla)
and MatMul operator changes to invoke the Sbgemm kernel. To enable
Sbgemm kernel, set the following session option:
"kOrtSessionOptionsGemmFastMathMode"

The PR also adds new test cases for mlas and ort.

### Motivation and Context

This is to improve MatMul performance on aarch64 platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x
-1.76x performance improvement compared to sgemm (fp32) kernel
performance.

```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
And the unit test precision results are matching to sgemm kernel
results.
`./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync `
2024-01-22 14:43:06 -08:00
Edward Chen
c8ce83967e
Download protoc for all Apple host builds, remove protoc build from iOS packaging pipeline. (#19209) 2024-01-19 15:30:09 -08:00
luoyu-intel
459c750b03
Update x64 template kernel library for 'sqnbitgemm' (#19016)
### Description
<!-- Describe your changes. -->
1. Make JBLAS codes an external module of ORT.
2. Move q4 gemm code to contrib_ops.
3. Update template kernel library to v0.1 release.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We found that the current LLM model performance is far below our
expectations. Here is some performance data collected on Mistral-7B
model with Xeon-8480:
8 threads | prompt length=32 past_len=32 | prompt length=1   past_len=32
-- | -- | --
ORT-main | 1220ms | 263ms
Neural-speed | 564ms | 87ms
ORT-this PR|597ms|120ms

Although `Neural-speed` and `ORT-this PR` use the same int4 kernel code,
there is a 33ms(87ms vs. 120ms) latency gap between the two frameworks.
Through some statistics analysis, the summary latency of `MatMulNBits`
is 86.7ms
The summary latency of all int4 GEMMs in `Neural-speed` is 84.8ms. So
other OPs introduce an extra 30ms latency.

The performance of MatMulNBits in this PR meets our expectations.

### Remain Issues
1. For hybrid CPUs, like core 12900K, the ONNXRuntime thread pool uses
TaskGranularityFactor to scale its number of threads. This is not
expected in our code design. It may slow down the hybrid CPU performance
by 30~40%.
2. Prepack uses a single thread which is very slow to init a session.
3. MatMulNBits with zero points will fall through to COMP_FP32 even
accuracy_level=4. Our COMP_INT8 IGemmCore with zero points process is
not optimized for now. It will be updated in the future. So, for an int4
model with zero points, whether the accuracy_level is 0 or 4 will be no
difference.
2024-01-18 13:16:34 -08:00
Maximilian Müller
bc219ed553
[TensorRT EP] Enable a minimal CUDA EP compilation without kernels (#19052)
Adresses https://github.com/microsoft/onnxruntime/issues/18542.
I followed the advice given by @RyanUnderhill
[here](https://github.com/microsoft/onnxruntime/pull/18731#issuecomment-1848261925)
and went with a minimal CUDA EP for now.
2024-01-17 11:33:34 -08:00
Wanming Lin
07d3aed3aa
[WebNN EP] Fixed build issue with disable_rtti (#19173)
Previously building webnn ep with --disable_rtti will throw
unboundTypeError since unbound type names are illegal with RTTI disabled
in Embind API, we can fix it by adding a
-DEMSCRIPTEN_HAS_UNBOUND_TYPE_NAMES=0 flag.
2024-01-16 21:35:13 -08:00
Changming Sun
e2e488d6f8
Revert "iOS packaging pipeline stability" (#19135)
Reverts microsoft/onnxruntime#19097 because it broken Android CI
pipeline.
2024-01-16 09:18:35 -08:00
Jeff Bloomfield
8d4369b77e
Update DirectML nuget version to 1.13.1 (#19122)
### Description
Update DML version to 1.13.1



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-15 19:04:41 -08:00
Edward Chen
e1e45901e2
iOS packaging pipeline stability (#19097)
- Remove protoc build step which sometimes times out. Download protoc instead.
- Use macOS-12 image in the set variables stage. It seems more stable.
2024-01-13 19:27:44 -08:00
Edward Chen
150c4cb8fe
[MLAS AArch64] SQNBitGemm CompInt8 kernel (#18953)
Implement ARM NEON SQNBitGemm kernel that first block quantizes A to int8 and then does int8 multiplication.
2024-01-12 17:58:08 -08:00
Guenther Schmuelling
96dbac6e4b
update to emsdk-3.1.51 (#18844) 2024-01-12 16:04:33 -08:00
Preetha Veeramalai
c340bf08f6
Openvino EP code changes for 1.17 update (#19023)
### Description
Introduce AppendExecutionProvider_OpenVINO_V2 API and support for OV
2023.3.


### Context

- The API is added to facilitate customers in using published official
Microsoft onnxruntime libraries with OVEP libraries.
- Add support for OpenVINO 2023.3 official release.
- Extend operator coverage 
- GH fixes

---------

Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
2024-01-12 13:20:51 -08:00
Changming Sun
0e8d4c3d21
Enable Address Sanitizer in CI (#19073)
### Description
1. Add two build jobs for enabling Address Sanitizer in CI. One for
Windows CPU, One for Linux CPU.
2. Set default compiler flags/linker flags in build.py for normal
Windows/Linux/MacOS build. This can help control compiler flags in a
more centralized way.
3. All Windows binaries in our official packages will be built with
"/PROFILE" flag. Symbols of onnxruntime.dll can be found at [Microsoft
public symbol
server](https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/microsoft-public-symbols).

Limitations:
1. On Linux Address Sanitizer ignores RPATH settings in ELF binaries.
Therefore once Address Sanitizer is enabled, before running tests we
need to manually set LD_LIBRARY_PATH properly otherwise
libonnxruntime.so may not be able to find custom ops and shared EPs.
4. On Linux we also need to set LD_PRELOAD before running some tests(if
the main executable, like python, is not built with address sanitizer.
On Windows we do not need to.
5. On Windows before running python tests we should manually copy
address sanitizer DLL to the onnxruntime/capi directory, because python
3.8 and above has enabled "Safe DLL Search Mode" that wouldn't use the
information provided by PATH env.
6. On Linux Address Sanitizer found a lot of memory leaks from our
python binding code. Therefore right now we cannot enable Address
Sanitizer when building ONNX Runtime with python binding.
7. Address Sanitizer itself uses a lot of memory address space and
delays memory deallocations, which is easy to cause OOM issues in 32-bit
applications. We cannot run all the tests in onnxruntime_test_all in
32-bit mode with Address Sanitizer due to this reason. However, we still
can run individual tests in such a way. We just cannot run all of them
in one process.

### Motivation and Context
To catch memory issues.
2024-01-12 07:24:40 -08:00
PeixuanZuo
5f3113ecd6
[ROCm] Fix hipify error: fast_divmod.h: No such file or directory (#19060)
Fix error:
```
[ 48%] Built target onnxruntime_optimizer

In file included from /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_stream_handle.cc:5:
/onnxruntime_src/onnxruntime/core/providers/rocm/rocm_common.h:11:10: fatal error: core/providers/rocm/shared_inc/fast_divmod.h: No such file or directory
   11 | #include "core/providers/rocm/shared_inc/fast_divmod.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
```

This error is due to onnxruntime_optimizer missing dependencies on
hipify generated files.
2024-01-10 14:49:19 +08:00
Milos Puzovic
37ac9d391c
Enable Arm Compute Library 23.08 (#17672)
### Description

This PR enables onnxruntime to build with the most recent release of Arm
Compute Library

### Motivation and Context

The latest version of Arm Compute Library that onnxruntime can build is
20.02 which is more than 3 years old.
2024-01-09 14:10:25 -08:00
Changming Sun
68c29ece23
In a Linux or Android build check if the compiler support bfloat16 and float16 (#18813)
### Description
Restrict clang version because we have an upcoming change that requires
clang version >=16 , which will mainly affect Android build.
2024-01-08 19:46:33 -08:00
Jeff Bloomfield
55a669409a
Merge pull request #18983 from microsoft/WindowsAI
Merge WindowsAI to main
2024-01-04 17:21:19 -08:00
Wei-Sheng Chin
658e30eb33
Remove DORT since it's in PyTorch main now (#18996)
Main code are removed and tests are modified to use DORT directly from
PyTorch.
2024-01-04 12:59:47 -08:00
Xavier Dupré
889b1ef2d1
Fix schema type constraint for custom operators (#17497)
### Description
onnxruntime may raise an error "type inference failed" but when a custom
operator sets IsHomogeneous to false in its schema. This change make
sure that TypeInferenceFunction and schema type constraints are aligned
to prevent that from happening.

---------

Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2024-01-04 20:27:46 +01:00
Yulong Wang
b18abaaa2c
[js/web] wait for threadpool initialization (#18952)
### Description

a replacement of #18683. try to resolve #18689.

By specifying "-s PTHREAD_POOL_SIZE" flag in emscripten, it forces the
threadpool to initialize before the webassembly instance is available.
2024-01-04 08:06:55 -08:00
Sheil Kumar
107d7492b9 [DirectML EP] Add DML EP registration for Col2Im (#17786)
### Description
[DirectML EP] Add DML EP registration for Col2Im operator

### Motivation and Context
Add Col2Im support for opset 18.
This operator is implemented as the DirectML Fold operator.

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
2024-01-03 16:13:14 -08:00
Jeff Bloomfield
c3d96a7b35 Update DML version to 1.13.0 (#18978)
Update DML nuget version to 1.13.0
2024-01-03 16:09:55 -08:00
Sheil Kumar
dbb8680bdc
Delay load dxcore.dll in addition to ext-ms-win-dxcore-l1-1-0.dll (#18913)
Delay load dxcore.dll in addition to ext-ms-win-dxcore-l1-1-0.dll

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-12-26 12:33:42 -08:00
pengwa
37f743680a
Fix build when flash attention and memory efficient attention are disabled (#18761)
### Fix build when flash attention and memory efficient attention are
disabled

On a customer env with lower version of CUDA < 11.6. Both flash
attention and memory efficient attention is turned OFF according to
e8f33b54ba/cmake/CMakeLists.txt (L701).
So
e8f33b54ba/cmake/external/cutlass.cmake (L1)
condition check return false. No cutlass lib is built.

```
Turn off flash attention since CUDA compiler version < 11.6
```

While, the kernels in
https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/moe/ft_moe
are depending on cutass for its build, so we get error like this:

```
[ 77%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
   23 | #include "cutlass/array.h"
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
   23 | #include "cutlass/array.h"
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
   23 | #include "cutlass/array.h"
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
   23 | #include "cutlass/array.h"
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
fatal   : Could not open input file /tmp/tmpxft_00044da3_00000000-11_moe_gemm_kernels_fp16_fp16.compute_60.cpp1.ii
make[2]: *** [CMakeFiles/onnxruntime_providers_cuda.dir/build.make:6290: CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:2210: CMakeFiles/onnxruntime_providers_cuda.dir/all] Error 2
make: *** [Makefile:166: all] Error 2
Traceback (most recent call last):
  File "/tmp/onnxruntime/tools/ci_build/build.py", line 2746, in <module>
    sys.exit(main())
  File "/tmp/onnxruntime/tools/ci_build/build.py", line 2639, in main
    build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target)
  File "/tmp/onnxruntime/tools/ci_build/build.py", line 1527, in build_targets
    run_subprocess(cmd_args, env=env)
  File "/tmp/onnxruntime/tools/ci_build/build.py", line 824, in run_subprocess
    return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
  File "/tmp/onnxruntime/tools/python/util/run.py", line 49, in run
    completed_process = subprocess.run(
  File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
```


### Motivation and Context

To summarize, there are two cases we will have build failure for Linux
CUDA build:
1. User use cuda version < 11.6
2. User disabled Flash attention and memory efficient attention
explictly with onnxruntime_USE_FLASH_ATTENTION and
onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION
2023-12-26 08:57:58 +08:00
luoyu-intel
5f00bc9931
Integrate high-performance x64 gemm library to MLAS (#17669)
### Description
Improve MLAS to support high-performance x64 INT4 kernels



### Motivation and Context
1. improve LLM inference performance on Intel CPUs.
2. support more 4bit quantization types: nf4, fp4
3. support dynamic block size: block size aligned with kernel's tiling
size(e.g. 4 for VNNI kernel), per channel on N dimension
4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni,
amx_bf16, amx_int8, avx512_fp16
5. support MatMulNBits' data format

### Tasks
- [x] support block_size: 32, 128, -1(per channel)
- [x] get weight pack size without memory allocation
- [x] use ort's thread pool for parallelism
- [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8

### Benchmark
Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time | 47613
| 47401 | 12970
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
6347792 | 6317562 | 109
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
11814014 | 11757847 | 59
Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
50222 | 50031 | 13759
Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
2038222 | 2028743 | 341
Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
3792832 | 3774485 | 191
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
58717 | 58501 | 11467
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
1360846 | 1354598 | 543
Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
2564232 | 2551365 | 266
Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time | 57929
| 57694 | 12047
Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5495330 | 5465810 | 126
Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10676240 | 10617817 | 66
Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
68305 | 68047 | 10026
Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5504862 | 5476215 | 126
Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
11758623 | 11697337 | 66
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
67713 | 67451 | 10298
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5508325 | 5480237 | 126
Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10738528 | 10681656 | 64
Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
60708 | 60486 | 11321
Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5523784 | 5495736 | 126
Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10829633 | 10772161 | 67


Reference:

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time | 53088 | 52911 |
13364
Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time | 6268981 |
6230335 | 110
Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time | 11701237 |
11632339 | 59

Win11+12900K 8 cores:
Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time | 215976
| 211295 | 2884
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time |
60960590 | 60937500 | 10
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time |
1.18E+08 | 1.19E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time |
470377 | 453059 | 1414
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time |
1.54E+08 | 1.53E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time |
3.18E+08 | 3.13E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time |
569072 | 559398 | 1229
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time |
1.54E+08 | 1.52E+08 | 4
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time |
3.22E+08 | 3.28E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time |
1486055 | 1473325 | 403
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time |
4.14E+08 | 4.14E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time |
8.88E+08 | 8.59E+08 | 1

---------

Signed-off-by: Mengni Wang <mengni.wang@intel.com>
Co-authored-by: Mengni Wang <mengni.wang@intel.com>
2023-12-19 09:36:31 -08:00
Changming Sun
f52668cc68
Disable mlas unit test in ARM64EC build (#18747)
### Description
Disable mlas unit test in ARM64EC build because the program has some
link errors. We will fix the errors later.
This PR only impacts Windows ARM64EC build. It has no impact on the
existing build pipelines.
2023-12-15 09:17:47 -08:00
Changming Sun
d795fc636c
FIX: Our cmake script didn't check googletest's hash (#18826) 2023-12-15 08:48:15 -08:00
Changming Sun
cbad4fe49b
Update absl and googletest (#18827)
### Description
Update absl and googletest to their latest version to include some cmake
changes:
1. A googletest's cmake change that will allow using external absl and
re2.
2. Nullability enhancements that will allow our clang-based static
analysis detecting many kinds of null pointer errors.



### Motivation and Context
To fix a C4744 link warning in our Windows pipelines.
```
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<bool>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\parse.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\parse.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\usage.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<bool>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<int>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
```
2023-12-14 16:15:07 -08:00
Yueqing Zhang
b42d4b8ea6
[VitisAI] 1. api compatbile 2. dynamic load onnx (#18470)
### Description
<!-- Describe your changes. -->

1. Add a backward-compatible API for compiling model.
2. Run-time load vitisai-ep.dll


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2023-12-14 14:43:41 -08:00
Suryaprakash Shanmugam
0723dcb8b5
OpenVINO Execution Provider with 2023.2 support (#18596)
- Add support for OpenVINO 2023.2
- num_of_threads provider option is mapped to the CPU device property
inference_num_threads of the CPU plugin, so users can control the
#threads used for inference by the CPU
- Logging in Debug mode now includes the runtime properties set for
devices
- Fix issue in using external weights through OpenVINO

---------

Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2023-12-13 15:56:43 -08:00
Adrian Lizarraga
81796a3081
[QNN EP Quantization] Add fusion preprocessing to QNN quantization (#18719)
### Description
- Adds graph fusions to preprocessing step that can be called before
creating a QDQ model for QNN EP.
- Fuse Erf sequence to Gelu (adapted from
[optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_gelu.py)).
Required by QNN EP.
- Fuse ReduceMean sequence to LayerNormaliation (adapted from
[optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_layernorm.py)).
Not required by QNN EP.
- Fuse ReduceL2 sequence to LpNormalization (new, specific to QNN EP).
Required by QNN EP.

Example use:
```python3
from quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model

# Added by this PR:
model_updated = qnn_preprocess_model("model.fp32.onnx", "model.fp32.preprocessed.onnx", fuse_layernorm=True)
model_to_quantize = "model.fp32.preprocessed.onnx" if model_updated else "model.fp32.onnx"

# Quantize model ...
qnn_config = get_qnn_qdq_config(model_to_quantize, data_reader, activation_type=QuantType.QUInt16)
quantize(model_to_quantize, "model.qdq.onnx", qnn_config)
```
### Motivation and Context
Allow more models to be quantized for use with QNN EP

---------

Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
2023-12-12 08:43:04 -08:00
Changming Sun
c7799d7058
Build fixes for Windows ARM32 desktop build (#18752)
### Description
Fix a link error:

```
onnxruntime_common.lib(cpuid_info.obj) : error LNK2019: unresolved external symbol __imp_RegGetValueA referenced in function "privat
e: void __cdecl onnxruntime::CPUIDInfo::ArmWindowsInit(void)" (?ArmWindowsInit@CPUIDInfo@onnxruntime@@AAAXXZ) [C:\Users\snnn\src\on
nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]
onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventRegister referenced in function "pub
lic: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\src\on
nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]
onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventUnregister referenced in function "p
ublic: virtual __cdecl onnxruntime::WindowsTelemetry::~WindowsTelemetry(void)" (??1WindowsTelemetry@onnxruntime@@UAA@XZ) [C:\Users\y
ilyu\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]
onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventSetInformation referenced in functio
n "public: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\
src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]
onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventWriteTransfer referenced in function
_tlgWriteTransfer_EventWriteTransfer [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]
C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\RelWithDebInfo\onnx_test_runner.exe : fatal error LNK1120: 5 unresolved ex
ternals [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]

```
2023-12-08 12:45:06 -08:00
Changming Sun
bf33919afb
Update absl and gtest to fix an ARM64EC build error (#18735)
### Description
Update absl and gtest to fix an ARM64EC build error


### Motivation and Context
We need to get an important fix into ORT.
The fix is:

8028a87c96
2023-12-07 15:55:17 -08:00
junchao-loongson
4abec9749e
[mlas] add loongarch lsx and lasx optimize code (#17937)
### Description
Hello we(@lixing-star) are the developers of loongson team.

We add 128 (lsx), 256 (lasx) vector optimization code for the loongarch
architecture


[100% tests passed, 0 tests failed out of
7](https://cloud.a-boat.cn:2021/api/public/dl/6831z1Bi?inline=true)

### Development Environments1
```
CPU: 
    Loongson-3C5000L
uname -a:  
    Linux localhost.localdomain 4.19.190-6.4.lns8.loongarch64 #1 SMP Thu Jul 14 12:08:04 CST 2022 loongarch64 loongarch64 loongarch64 GNU/Linux

```
### LonngArch Documents
- [LoongArch Reference Manual - Volume 1: Basic Architecture: This
manual describes the basic part of the LoongArch
architecture.](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html)
- [LoongArch ELF psABI: This manual describes the LoongArch ELF
psABI.](https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html)
-
[more](https://loongson.github.io/LoongArch-Documentation/README-EN.html)
2023-12-07 11:15:59 -08:00
moyo1997
9479ba525b
Build onnxruntime.dll as arm64x (#18633)
Build onnxruntime.dll as arm64x

Added a .cmake file to generate a link repro of the onnxruntime.dll
during arm64 build. This provides us a directory containing all the
arm64 objs, def file and libs to link to when it is time to building
arm64x onnxruntime.dll during the arm64ec build by passing the
/machine:arm64x flag to the linker along with the arm64 artifacts.

If other dlls wanted to be built as x, setting the ARM64X_TARGETS
variable in the toplevel cmakelists.txt to include these other targets
is all that will be needed.

Added build_arm64x.bat as a wrapper for the multiple (rm64, then
arm64ec) cmake calls needed to build as arm64x.

AB#22533
2023-12-06 16:49:00 -08:00
Ye Wang
c012e41f93
MoE with Expert Slicing (#18565)
### Description
<!-- Describe your changes. -->

Registered Sharded MoE op under contrib_op/cuda/collective with expert
slicing. The broadcast process happens just before adding second bias(if
has) and permutation undoing. Tensor slicing is planned but not included
in this PR.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-05 16:56:38 -08:00
Adrian Lizarraga
e066fca777
[Quantization] Tensor quant overrides and QNN EP quantization configuration (#18465)
### Description
#### 1. Adds `TensorQuantOverrides` extra option
Allows specifying a dictionary of tensor-level quantization overrides:
```
TensorQuantOverrides = dictionary :
    Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a
    list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For
    per-channel quantization, the list contains a dictionary for each channel in the tensor.
    Each dictionary contains optional overrides with the following keys and values.
          'quant_type' = QuantType : The tensor's quantization data type.
          'scale' =  Float         : The scale value to use. Must also specify `zero_point` if set.
          'zero_point' = Int       : The zero-point value to use. Must also specify `scale` is set.
          'symmetric' = Bool       : If the tensor should use symmetric quantization. Invalid if also
                                     set `scale` or `zero_point`.
          'reduce_range' = Bool    : If the quantization range should be reduced. Invalid if also
                                     set `scale` or `zero_point`.
          'rmax' = Float           : Override the maximum real tensor value in calibration data.
                                     Invalid if also set `scale` or `zero_point`.
          'rmin' = Float           : Override the minimum real tensor value in calibration data.
                                     Invalid if also set `scale` or `zero_point`.
```

- All of the options are optional.
- Some combinations are invalid.
- Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale`
are also specified.

Example for per-tensor quantization overrides:
```Python3
extra_options = {
    "TensorQuantOverrides": {
        "SIG_OUT": [{"scale": 1.0, "zero_point": 127}],
        "WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
        "BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
    },
}
```

Example for per-channel quantization overrides (Conv weight and bias):
```Python3
extra_options = {
    "TensorQuantOverrides": {
        "WGT": [
            {
                "quant_type": quantization.QuantType.QUInt8,
                "rmin": 0.0,
                "rmax": 2.5,
                "reduce_range": True,
            },
            {
                "quant_type": quantization.QuantType.QUInt8,
                "rmin": 0.2,
                "rmax": 2.55,
                "reduce_range": False,
            },
        ],
        "BIAS": [
            {"zero_point": 0, "scale": 0.000621},
            {"zero_point": 0, "scale": 0.23},
        ],
    },
}
```

#### 2. Adds utilities to get the default QDQ configs for QNN EP
Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method
that inspects the model and returns suitable quantization
configurations.

Example usage:
```python3
from quantization import quantize, QuantType
from quantization.execution_providers.qnn import get_qnn_qdq_config

qnn_config = get_qnn_qdq_config(input_model_path,
                                data_reader,
                                activation_type=QuantType.QUInt16,
                                weight_type=QuantType.QUInt8)
                                
quantize(input_model_path,
         output_model_path,
         qnn_config)
```

### Motivation and Context
Make it possible to create more QDQ models that run on QNN EP.

---------

Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
2023-12-04 17:54:58 -08:00
snadampal
05a9c95764
[DNNL] add Arm Compute Library (ACL) backend for dnnl execution provider (#15847)
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.

### Description
<!-- Describe your changes. -->
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to enable the optimized ACL gemm kernels for dnnl execution
provider on aarch64 platform.
2023-12-01 09:16:44 -08:00
George Wu
5c67a00d8e
Revert "remove full protobuf requirement for tensorrt ep" (#18626)
Reverts microsoft/onnxruntime#18413

there's a timing issue here. we eventually want to get this change
merged in but we need to update OSS onnx-tensorrt first.
2023-11-29 22:27:51 -08:00
Edward Chen
14a343441d
Fix Objective-C static analysis build (#18606)
- Patch abseil to fix a compile error about not finding `cxxabi.h`.
- Fix some static analysis warnings.
2023-11-28 17:14:20 -08:00
Rachel Guo
288b80d363
Add MacOS build to ORT C Pod (#18550)
### Description
<!-- Describe your changes. -->

As title.

1. Add macos build as an optionally enabled arch for pod and changes to
exsiting build_ios_framework/assemble_c_pod scripts.
2. Enable macos build arch in ios packaging pipeline (currently for
variants other than Mobile) and check the output artifacts are correct.
3. Write MacOS Test Target scheme in the test app and integrate into ios
packaging CI testing pipeline.
Currently the changes only apply to onnxruntime-c pod. as the original
request was from ORT SPM which consumes the onnxruntime-c pod only as
the binary target. TODO: could look into adding macos platform to objc
pod as well.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable macos platform support in cocoapods. and also potentially produce
binary target for enabling macos platform in SPM as well.

Replace https://github.com/microsoft/onnxruntime/pull/18334

---------

Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-11-28 10:11:53 -08:00
Chen Fu
05046e5452
Adding unit test for sm80 prepack (#18514)
### Description
Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware


### Motivation and Context
Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs.
This kernel requires sophisticated quantized weight rearrangement to
speedup loading data to tensor-core. To facilitate the addition, this
change includes the following:

1. matrix_layout.h A new layout lib that facilitate iterating matrix
elements and tiles that balance memory safety and performance.
2. prepack_sm80.h Code for rearranging quantized weight, scales and
offsets (aka. prepacking)
3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the
memory safety and correctness of the prepacking code.

Currently the prepacking code runs on CPU with single threaded code. We
run this on CPU in order to minimize GPU memory fragmentation. On the
other hand, hopefully we get around to parallelize this part of the
code. Should be straight forward with the unit tests in place.
2023-11-28 10:01:09 -08:00
Sheil Kumar
0b7048e7d6
Update winml to use #cores - #soc cores by Default as the number of intraopthreads (#18384)
Update winml to use #cores - #soc cores by Default as the number of
intraopthreads

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-11-28 09:26:48 -08:00
cloudhan
6f3c1f9dc9
[ROCm] Update ck for GemmFloat8 (#18487) 2023-11-23 12:06:19 +08:00
pengwa
43a5147e01
Memory optimization refactor and refinement (#17481)
### Memory optimization refactor and refinement

Currently memory optimizer runs graph transformations and print
recompute opportunities in INFO level, while ORT backend has many many
INFO level logs making users hard to find those information. So we are
looking for a Python binding API to retrieve the memory optimization
opportunities instead of depending on the MemoryOptimizer's default
logging.
Then we can print ORTModule feature statistics using this information. 
Also, with such an API, we can create an ORT session created, where
allocation plan is done, the analysis will consider buffer reuse as
well. This can void giving some recomputation subgraphs that are reusing
other subgraphs' output buffers.

Check
https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md
for the new flow using `MemoryOptimizer`.

This pull requests made following refactoring:
1. Print the log in ORTModule Python script, along with ORTModule
feature enabling stats. This is implemented by exposing an API
`get_serialized_ortmodule_memory_stat` to retrieve the memory
optimization opportunities.
2. We are analyzing memory optimization opportunities considering ORT
memory planning. This is done by firstly creating the execution graph
without enabling MemoryOptimizer, then we call
`execution_agent.get_serialized_ortmodule_memory_stat` which internally
will consider the session memory allocation planner when analyzing
memory optimization opportunity. As a direct result, the memory
optimization opportunities can show those stashed activations that are
reusing other buffers.
3. Move recompute analysis logic from memory_optimizer.h/cc to
recompute_analysis.h/cc.
4. Abstract optimization strategies for their own implementation. This
will make introducing new strategies (for example compression and
decompression ) easier.

New logging matrix (INFO Level), in WARNING level, the details will NOT
show.
```
2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] -
***** ONNX Runtime Training (ORTModule) is accelerating your model *****

ORTModule is enabled with following features ON/OFF for [training] mode:

  ATen Executor         :   ON    :   Dispatch ATen operators to ORT's ATen executor
  Cast Propagation      :   ON    :   Level 1 enabled
  Custom Function       :   ON    :   Support custom torch.autograd.Function export and execution
  Memory Optimizer      :   ON    :   RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs:
                                      Config                                                      Freq    Saving(B)       Saving Symbolic(Bytes)
   - Plan 1             :   ON    :   Reshape+Where+BiasSoftmax+:1:-1                             5       671,088,640     640.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 2             :   ON    :   Cast+:1:-1                                                  6       402,587,648     inputs_input_ids_dim0*inputs_input_ids_dim1*(384.0*inputs_input_ids_dim1 - 64.0)
   - Plan 3             :   OFF   :   Reshape+Where+:1:-1                                         1       134,217,728     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 4             :   OFF   :   BiasSoftmax+:1:-1                                           1       134,086,656     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1*(inputs_input_ids_dim1 - 1)
   - Plan 5             :   OFF   :   BiasGelu+:1:-1                                              6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 6             :   OFF   :   FusedMatMul+:1:-1                                           6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 7             :   OFF   :   FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1               5       26,214,400      25600.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 8             :   OFF   :   Add+:1:-1                                                   1       5,237,760       5120.0*inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1)
   - Plan 9             :   OFF   :   Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1         1       4,096           4.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 10            :   OFF   :   Cast+:2:-1                                                  1       2,048           2.0*inputs_input_ids_dim0*inputs_input_ids_dim1
  Compute Optimizer     :   ON    :   Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0
   - FLOPReduction      :   ON    :   Reduce FLOPs by upstreaming shrinking-sized ops
  Auto Fallback         :   ON    :   Fallback to PyTorch when encountering unsupported ops
  TritonOp Enabled      :   OFF   :   ORT will switch to Triton for executing some ops to further accelerate training.
  ZeRO Stage3 Support   :   OFF   :   Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0

Total ORT initialization overhead is 10.73s where export takes 8.39s.
Other overhead details:  graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s

Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0

Note 1: use comma to enable multiple plans at the same time.
  export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,...
Note 2: saving is calculated based on the 1st batch symbolic dim values:
  inputs_input_ids_dim0=1,
  inputs_input_ids_dim1=1024,
  inputs_attention_mask_dim0=1,
  inputs_attention_mask_dim1=1024,
  inputs_labels_dim0=1,
  inputs_labels_dim1=1024,

************************************************************************
```

If DEVINFO level is enabled, then more details about the memory
optimizations are printed.
```

MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1
==========================================================================================================================================
|Freq   | Memory Optimization Opportunities (Clustered by node-level activation patterns)                                                |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|3      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+Reshape+                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(3),                                                                                                  |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+                                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1                                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(2),                                                                                                  |
|       |   - Output 0  : [ x 2560 x ], byte/elem: 2, 100% saved                                                                         |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+BiasSoftmax+                                                                  |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1                       |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=2                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1         |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+                                                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1                                   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved                           |
|       |                                                                                                                                |
|       |>>Option 2     : RecomputeWithCompromise subgraph Cast+                                                                         |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasSoftmax+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Add+                                                                                        |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1                                             |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
==========================================================================================================================================
Note: use comma as a separator for enabling more than one subgraphs.

************************************************************************

```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-23 11:39:00 +08:00
Dmitri Smirnov
cc542024ce
Create edges with arg positons correctly accounting for non-existing args (#18462)
### Description
Truncate traling non-existing arguments.
  Make sure we do not skip on the non-existing arguments in the middle,
  because shape inferece relies on their proper position.
This also affects the argument position in the Edges that must be
properly rebuilt
  each time If node branch is inlined.
Make sure that when we rename Defs in subgraphs, new renamed defs are
created in those subgraphs
  instead of pointing to outer scope defs.
  Add unit test.

### Motivation and Context
This is a follow up for
https://github.com/microsoft/onnxruntime/pull/18105
Currently, the non-trailing arguments are simply ignored and the edges
are created
with potentially incorrect positions.
2023-11-20 14:49:09 -08:00
Akshay Sonawane
97cc40d75a
Add fusion patterns for conformer-transducer model (#18461)
### Description
Add conformer-transducer model type to optimizer. This PR adds pattern
matches for attention shown below:
Unfused attention:

![ct_unfused](https://github.com/microsoft/onnxruntime/assets/111780983/46c71ed8-67e0-4607-85b1-bcadba5a2956)

Fused attention:

![ct_fused](https://github.com/microsoft/onnxruntime/assets/111780983/fbb91c96-0d4b-4f0b-8674-1ae3b9b9a92e)
2023-11-18 23:39:04 -08:00
Ashwini Khade
02333293de
Removed all the deprecated python training code and related tests and utils (#18333)
### Description
Motivation for this PR is code cleanup.

1. Remove all deprecated python code related to orttrainer, old
checkpoint, related tests and utils
2. Cleanup orttraining_pybind_state.cc to remove all deprecated
bindings.
2023-11-17 18:19:21 -08:00
George Wu
d73073d491
remove full protobuf requirement for tensorrt ep (#18413)
tensorrt can work with protobuf lite.
2023-11-16 20:44:27 -08:00
Yulong Wang
6f9f653ada
[wasm] increase test max memory from 2G to 4G (#18459)
### Description
increase max memory from 2G to 4G for onnxruntime_test_all in
WebAssembly build.
2023-11-15 17:51:04 -08:00
Edward Chen
0a4d76d98b
MLAS AArch64 quantized int4 Gemm kernel (#18031)
- Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs.

- Connect MatMulNBits contrib op to MLAS function.
2023-11-15 09:31:54 -08:00