onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-08 17:17:15 +00:00

Author	SHA1	Message	Date
Jian Chen	8914fe687b	[js/webgpu] Include Support for neg.int32 (#17374 ) ### Description Include Support for neg.int32 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-06 12:00:16 -07:00
Edward Chen	a3a1237270	Disable xcpretty filtering of xcodebuild output in iOS packaging pipeline. (#17429 )	2023-09-06 09:04:17 -07:00
Yulong Wang	fa868ca9cd	[js/node] release sessions after use in npm test (#17353 ) ### Description resolve sessions after use in NPM test.	2023-09-05 23:42:32 -07:00
Yulong Wang	d88406a31b	[js/common] use Map instead of object for backends (#17352 ) ### Description resolved https://github.com/microsoft/onnxruntime/security/code-scanning/1140	2023-09-05 23:14:46 -07:00
Yulong Wang	75710f0006	[js/webgpu] add matmul broadcast tests (#17335 ) ### Description Commit `fffefb1c22` (#16969) optimized matmul and also fixes broadcasting. So #17191 is no longer needed. However, the newly added operator test file from the PR by @dakenf is helpful so pick and add it to enhance the tests.	2023-09-05 20:41:46 -07:00
Yulong Wang	110a2d0b73	[build][wasm] add js_internal_api.js to link dependency (#17407 ) ### Description add js_internal_api.js to link dependency. Now changes to js_internal_api.js will correctly trigger re-link of ort-wasm.wasm	2023-09-05 20:40:40 -07:00
Yulong Wang	2cb75420ac	[js/common] clean up JSDoc (#17408 ) ### Description clean up JSDoc for onnxruntime-common: - replace "@internal" to "@ignore" as JSDoc do not use "@internal". Using "@ignore" will let the content not show on the generated doc.	2023-09-05 20:40:23 -07:00
Vincent Wang	deda5db231	[ORTModule] Add Manual Seed to Fix UT Failure (#17411 ) Add manual seed to fix ORTModule UT failure.	2023-09-06 11:24:55 +08:00
James Baker	16eba537a8	rust bindings: Do not unnecessarily re-run build.rs (#17018 ) ### Description Remove unnecessary cargo:rerun-if-changed declaration. ### Motivation and Context 'cargo:rerun-if-changed' declarations tell Cargo when to re-run the build script. The intention is that if the build script depends on other files, then Cargo knows to re-run if those files change. It stores the output and checks it before each build. The intention is that one emits the declarations for _inputs_ of the build. This rerun-if-changed declaration is a declaration on the _output_ of the build, and stores the absolute path of the output. This is not a useful declaration because the output path is unique to the build script - there is no way for anything else to change it. However, this does generate unnecessary rebuilds in some cases, for example if the dependent repository is moved in the filesystem. This causes me some issues when using https://crane.dev, as due to some implementation details, if a crate being moved triggers a rebuild, by default the build is broken. To summarise: - declaration is redundant - causes issues in niche cases.	2023-09-05 19:42:06 -07:00
Changming Sun	c6b0d185b4	Update cmake to 3.27 and upgrade Linux CUDA docker files from CentOS7 to UBI8 (#16856 ) ### Description 1. Update docker files and their build instructions. ARM64 and x86_64 can use the same docker file. 2. Upgrade Linux CUDA pipeline's base docker image from CentOS7 to UBI8 AB#18990	2023-09-05 18:12:10 -07:00
xhcao	026672e947	[js/webgpu] Support slice int32 (#16968 ) Co-authored-by: Xing Xu <xing.xu@intel.com>	2023-09-05 18:05:47 -07:00
Scott McKay	e1a9f2ed6d	Fix insufficient space error in Android CI (#17423 ) ### Description <!-- Describe your changes. --> Remove onnxruntime_test_all from emulator once tests have finished as it's 1.2GB and takes up too much space given the 2GB maximum partition size for the emulator. Side issue is the java build isn't able to strip the binaries in the java apk which causes that to be 800MB (exceeding the 2GB max). That may require an Android/Gradle fix as I don't think we can hardcode an NDK version into our build files. https://issuetracker.google.com/issues/237187538?pli=1 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix Android CI build failures for	2023-09-06 10:12:05 +10:00
petermcaughan	fa28359beb	Reduce GPU memory for Whisper models converted to ONNX (#17378 ) ### Description This PR changes the Whisper export scripts to further optimize the process of removing duplicate initializers from two subgraphs. The current Greedy approach is quicker by a large factor, but results in some duplicate initializers not being caught and removed. This not only results in a slightly larger Whisper model, but also a model that uses more GPU memory. The approach in this PR uses data hashes and caches to keep a quick export but no longer rely on a greedy approach. --------- Co-authored-by: Peter McAughan <petermca@microsoft.com>	2023-09-05 16:24:20 -07:00
Dmitri Smirnov	dbcc60bed5	Introduce output type/shape validation (#17301 ) ### Description Validate outputs type and shapes. Make sure sparse initializers are taken into account. ### Motivation and Context ORT currently does not validate output types or shapes. Further, neither inputs or outputs take into account sparse initializers that are converted from dense. It is currently possible to pre-allocate a wrong type/shape buffer for output. Cc: @Craigacp	2023-09-05 15:25:12 -07:00
Tianlei Wu	8818a99c93	Set proper nvcc threads to avoid OOM (#17419 ) ### Description There are 8 cu files under [flash attention](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/bert/flash_attention) and 4 cu files under [cutlass fmha](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/bert/cutlass_fmha) need a lot of memory to compile. Previously, the default value is same as parallel - number of CPU cores. Standard_NC4as_T4_v3 has 4 CPUs and 28 GB memory, and we launched 16 nvcc threads in total (4 parallel jobs, and 4 nvcc threads per job). Each thread might take 4 GB on average (peak is around 6GB, but threads are not started at same time). OOM happens since 16 threads might need close to 64 GB in worst case. When build machine has 64GB or larger memory, OOM is rare. Here we set a proper nvcc --threads based on available memory to avoid OOM. ### Motivation and Context Fix `Python Packaging Pipeline (Training Cuda 11.8)`	2023-09-05 10:59:27 -07:00
Lennart Hannink	e3bb2a0cdd	Fix git working dir for ORT_BUILD_INFO (fixes #17197 ) (#17198 ) ### Description Git commands producing `git-commid-id` and `git-branch` are always run in `CMAKE_CURRENT_SOURCE_DIR` (i.e. `onnxruntime/cmake`) ### Motivation and Context Please refer to corresponding issue [#17197](https://github.com/microsoft/onnxruntime/issues/17197).	2023-09-05 09:20:49 -07:00
cloudhan	6ea3908db4	Add ck's streamk and splitk gemm impl (#17280 )	2023-09-04 11:49:07 +08:00
Jiajia Qin	5e747071be	[js/webgpu] Fix bug in conv2dByMatMul path (#17369 ) ### Description <!-- Describe your changes. --> For the conv2dByMatMul path, the simulated matmul output shape is the reshape of the original conv2d. So we should pass this information to `createMatmulProgramInfo` so that it can process it correctly.	2023-09-02 00:16:28 -07:00
Tianlei Wu	e745575187	fix assert error in attention fusion script (#17375 ) Add a check of num_heads and hidden_size to avoid assert error (https://github.com/microsoft/onnxruntime/issues/17254)	2023-09-01 08:18:50 -07:00
Tianlei Wu	e23f16adbf	output all parameters in the bert_perf_test tool (#17379 ) Currently, there are some parameters missing in output file. This output all parameters. Example output: Latency(ms) \| Latency_P50 \| Latency_P75 \| Latency_P90 \| Latency_P95 \| Latency_P99 \| Throughput(QPS) \| model \| graph_optimization_level \| intra_op_num_threads \| batch_size \| sequence_length \| test_cases \| test_times \| use_gpu \| use_io_binding \| average_sequence_length \| random_sequence_length -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 10.91 \| 11.16 \| 11.3 \| 11.7 \| 11.78 \| 11.84 \| 91.66 \| model.onnx \| ENABLE_ALL \| 4 \| 1 \| 512 \| 1 \| 10 \| TRUE \| TRUE \| 64 \| FALSE	2023-09-01 08:17:58 -07:00
Baiju Meswani	8b98ecad70	Change RuntimeError to ImportError (#17380 ) The `onnxruntime-validation` for ORTModule checks for `ImportError`: `44101e8771/onnxruntime/python/onnxruntime_validation.py (L73-L75)` If any other kind of error is raised, it does not silently fail and will raise an exception. This causes a problem when ortmodule is explicitly not made available on win/mac packages since we currently raise a RuntimeError. Resolves issue: https://github.com/microsoft/onnxruntime-training-examples/issues/161	2023-09-01 09:56:40 +08:00
Rachel Guo	16cfcd0590	Fix NNAPI optional input handling checks and unblock Android CI pipeline test failures (#17358 ) ### Description <!-- Describe your changes. --> - Fix missing optional input checks originally coming from a github issue for no shape on Resize Op. - Exclude Antialias support for Opset 18 + Resize for NNAPI - Unblock Android CI pipeline tests failure. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Bug fixes. Issue: https://github.com/microsoft/onnxruntime/issues/17035 thanks @skottmckay for pointing out the cause. --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>	2023-08-31 16:40:22 -07:00
aciddelgado	44101e8771	Flash Attention v2 MHA (#17227 ) ### Description Integrate Flash Attention V2 to PackedMultiHeadAttention, MultiHeadAttention and Attention operators. Flash Attention v2 source code is from https://github.com/Dao-AILab/flash-attention/tree/main/csrc/flash_attn/src. We did some change to remove dependency on Torch, then removed backward and bfloat16 related code. Add benchmark script (see benchmark_mha.sh) to compare different attention kernels for MultiHeadAttention operator. Current limitations for Flash Attention in PackedMultiHeadAttention, MultiHeadAttention and Attention operators: * Relative Position Bias is not supported * Different hidden size for Q and V is not supported * Only float16 is supported * Padding/attention mask is not supported * For MultiHeadAttention, when there is past or present input, bias shall be provided to activate flash attention * For Attention, past or present inputs will deactivate flash attention * Causal is not supported Some limitations (like attention mask and causal) might be removed later. Currently, Flash Attention v2 only works in Linux. For Windows, we will enable later with Cutlass 3.2. Two environment variables can be used for testing purpose: (1) `ORT_DISABLE_FLASH_ATTENTION` to disable flash attention. Default value is 0 (enable). Set it to "1" to disable it. (2) `ORT_MIN_SEQ_LEN_FLASH_ATTENTION_PACKED_QKV`. Default value is "513", which means that we only enable flash attention when sequence length is larger than 512 for packed QKV format. Set it to "0" if you want to use flash attention v2 whenever possible. ### Speedup The following result is from Standard_ND96amsr_A100_v4 VM (A100-SXM4-80GB GPU) using benchmark_mha.sh. The metric is TFLOPs per second for MultiHeadAttention operator. There are 3 input formats: * `Q,K,V` means separated inputs query, key and value of BxSxNH * `Q,KV` means packed KV, where key is 5D: BxSxNx2xH * `QKV` means packed QKV, where query is 5D: BxSxNx3xH Note that flash attention cannot use packed QKV format, so extra Transpose is needed. We found that TensorRT kernel is faster for sequence length <= 512 for packed QKV. The reason might be no transpose is needed for TensorRT kernel in this format. We also notice that, TensorRT kernel is faster for stable diffusion 512x512 image (see seq_len=4096, heads=8, head_dim=40 below), while flash attention v2 is faster for 1024x1024 image (see seq_len=16384, heads=8, head_dim=40 below). input format \| batch size \| sequence length \| heads \| head dim \| flash_v2 (TFLOPs/s) \| TensorRT (TFLOPs/s) \| Memory Efficient Attention (TFLOPs/s) -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Q,K,V \| 32 \| 512 \| 64 \| 32 \| 78.1 \| 60.0 \| 39.3 Q,K,V \| 32 \| 512 \| 128 \| 16 \| 46.8 \| 44.1 \| 21.7 Q,K,V \| 16 \| 1024 \| 64 \| 32 \| 99.0 \| 72.8 \| 44.3 Q,K,V \| 16 \| 1024 \| 128 \| 16 \| 54.7 \| 49.2 \| 23.4 Q,K,V \| 8 \| 2048 \| 64 \| 32 \| 113.8 \| 81.2 \| 47.8 Q,K,V \| 8 \| 2048 \| 128 \| 16 \| 59.7 \| 51.9 \| 24.7 Q,K,V \| 4 \| 4096 \| 64 \| 32 \| 122.5 \| 85.6 \| 49.7 Q,K,V \| 4 \| 4096 \| 128 \| 16 \| 62.5 \| 53.3 \| 25.3 Q,K,V \| 2 \| 8192 \| 64 \| 32 \| 127.4 \| 87.5 \| 50.7 Q,K,V \| 2 \| 8192 \| 128 \| 16 \| 64.0 \| 54.2 \| 25.6 Q,K,V \| 1 \| 16384 \| 64 \| 32 \| 129.5 \| 91.0 \| 51.2 Q,K,V \| 1 \| 16384 \| 128 \| 16 \| 64.7 \| 54.5 \| 25.8 Q,K,V \| 1 \| 4096 \| 8 \| 40 \| 51.0 \| 43.6 \| 36.8 Q,K,V \| 1 \| 4096 \| 8 \| 80 \| 97.7 \| 77.0 \| 55.5 Q,K,V \| 1 \| 4096 \| 8 \| 160 \| 120.0 \| 39.7 \| 57.8 Q,K,V \| 4 \| 4096 \| 8 \| 40 \| 89.0 \| 84.4 \| 49.2 Q,K,V \| 4 \| 4096 \| 8 \| 80 \| 133.0 \| 92.2 \| 63.2 Q,K,V \| 4 \| 4096 \| 8 \| 160 \| 164.8 \| 42.7 \| 63.8 Q,K,V \| 1 \| 16384 \| 8 \| 40 \| 96.9 \| 91.3 \| 52.1 Q,K,V \| 1 \| 16384 \| 8 \| 80 \| 142.9 \| 101.5 \| 65.6 Q,K,V \| 1 \| 16384 \| 8 \| 160 \| 177.4 \| 44.2 \| 65.7 Q,K,V \| 128 \| 128 \| 12 \| 64 \| 29.0 \| 26.9 \| 25.7 Q,K,V \| 64 \| 128 \| 12 \| 64 \| 23.1 \| 10.8 \| 21.3 Q,K,V \| 128 \| 384 \| 12 \| 64 \| 83.5 \| 60.8 \| 55.7 Q,K,V \| 64 \| 384 \| 12 \| 64 \| 72.6 \| 40.5 \| 52.8 Q,K,V \| 128 \| 512 \| 12 \| 64 \| 98.9 \| 77.9 \| 62.1 Q,K,V \| 64 \| 512 \| 12 \| 64 \| 94.7 \| 75.6 \| 60.4 Q,KV \| 32 \| 512 \| 64 \| 32 \| 85.9 \| 41.1 \| 41.1 Q,KV \| 32 \| 512 \| 128 \| 16 \| 47.1 \| 21.6 \| 21.6 Q,KV \| 16 \| 1024 \| 64 \| 32 \| 104.4 \| 45.8 \| 45.8 Q,KV \| 16 \| 1024 \| 128 \| 16 \| 54.7 \| 23.6 \| 23.6 Q,KV \| 8 \| 2048 \| 64 \| 32 \| 116.8 \| 48.5 \| 48.5 Q,KV \| 8 \| 2048 \| 128 \| 16 \| 59.8 \| 24.7 \| 24.7 Q,KV \| 4 \| 4096 \| 64 \| 32 \| 124.2 \| 50.1 \| 50.1 Q,KV \| 4 \| 4096 \| 128 \| 16 \| 62.6 \| 25.3 \| 25.3 Q,KV \| 2 \| 8192 \| 64 \| 32 \| 128.5 \| 50.8 \| 50.9 Q,KV \| 2 \| 8192 \| 128 \| 16 \| 64.1 \| 25.6 \| 25.6 Q,KV \| 1 \| 16384 \| 64 \| 32 \| 129.4 \| 51.2 \| 51.2 Q,KV \| 1 \| 16384 \| 128 \| 16 \| 64.8 \| 25.8 \| 25.8 Q,KV \| 1 \| 4096 \| 8 \| 40 \| 67.5 \| 37.7 \| 37.5 Q,KV \| 1 \| 4096 \| 8 \| 80 \| 101.3 \| 56.7 \| 56.6 Q,KV \| 1 \| 4096 \| 8 \| 160 \| 124.0 \| 58.6 \| 58.6 Q,KV \| 4 \| 4096 \| 8 \| 40 \| 90.8 \| 49.8 \| 49.8 Q,KV \| 4 \| 4096 \| 8 \| 80 \| 135.6 \| 63.8 \| 63.8 Q,KV \| 4 \| 4096 \| 8 \| 160 \| 166.3 \| 64.5 \| 64.5 Q,KV \| 1 \| 16384 \| 8 \| 40 \| 97.5 \| 52.3 \| 52.3 Q,KV \| 1 \| 16384 \| 8 \| 80 \| 143.5 \| 65.9 \| 65.8 Q,KV \| 1 \| 16384 \| 8 \| 160 \| 178.4 \| 65.9 \| 65.8 Q,KV \| 128 \| 128 \| 12 \| 64 \| 26.8 \| 48.1 \| 30.9 Q,KV \| 64 \| 128 \| 12 \| 64 \| 28.0 \| 38.9 \| 25.0 Q,KV \| 128 \| 384 \| 12 \| 64 \| 97.7 \| 61.1 \| 61.0 Q,KV \| 64 \| 384 \| 12 \| 64 \| 89.5 \| 57.8 \| 57.9 Q,KV \| 128 \| 512 \| 12 \| 64 \| 111.9 \| 66.7 \| 66.9 Q,KV \| 64 \| 512 \| 12 \| 64 \| 107.2 \| 64.9 \| 64.8 QKV \| 32 \| 512 \| 64 \| 32 \| 77.2 \| 84.7 \| 39.3 QKV \| 32 \| 512 \| 128 \| 16 \| 43.4 \| 53.1 \| 20.9 QKV \| 16 \| 1024 \| 64 \| 32 \| 98.8 \| 87.4 \| 44.6 QKV \| 16 \| 1024 \| 128 \| 16 \| 52.0 \| 54.1 \| 23.2 QKV \| 8 \| 2048 \| 64 \| 32 \| 113.1 \| 89.0 \| 47.9 QKV \| 8 \| 2048 \| 128 \| 16 \| 58.2 \| 54.6 \| 24.5 QKV \| 4 \| 4096 \| 64 \| 32 \| 120.6 \| 89.7 \| 49.7 QKV \| 4 \| 4096 \| 128 \| 16 \| 61.7 \| 54.6 \| 25.2 QKV \| 2 \| 8192 \| 64 \| 32 \| 125.9 \| 89.5 \| 50.7 QKV \| 2 \| 8192 \| 128 \| 16 \| 63.6 \| 54.8 \| 25.5 QKV \| 1 \| 16384 \| 64 \| 32 \| 128.5 \| 92.0 \| 51.2 QKV \| 1 \| 16384 \| 128 \| 16 \| 64.6 \| 54.8 \| 25.7 QKV \| 1 \| 4096 \| 8 \| 40 \| 60.2 \| 69.8 \| 38.1 QKV \| 1 \| 4096 \| 8 \| 80 \| 101.6 \| 75.2 \| 56.7 QKV \| 1 \| 4096 \| 8 \| 160 \| 130.2 \| 41.2 \| 58.4 QKV \| 4 \| 4096 \| 8 \| 40 \| 90.6 \| 91.0 \| 49.5 QKV \| 4 \| 4096 \| 8 \| 80 \| 133.6 \| 98.1 \| 62.8 QKV \| 4 \| 4096 \| 8 \| 160 \| 165.3 \| 43.7 \| 63.9 QKV \| 1 \| 16384 \| 8 \| 40 \| 97.2 \| 92.8 \| 52.1 QKV \| 1 \| 16384 \| 8 \| 80 \| 143.0 \| 103.1 \| 65.6 QKV \| 1 \| 16384 \| 8 \| 160 \| 177.6 \| 44.5 \| 65.7 QKV \| 128 \| 128 \| 12 \| 64 \| 31.1 \| 65.9 \| 27.6 QKV \| 64 \| 128 \| 12 \| 64 \| 26.1 \| 49.8 \| 23.5 QKV \| 128 \| 384 \| 12 \| 64 \| 84.6 \| 88.5 \| 56.1 QKV \| 64 \| 384 \| 12 \| 64 \| 79.1 \| 80.3 \| 53.5 QKV \| 128 \| 512 \| 12 \| 64 \| 97.3 \| 114.2 \| 62.2 QKV \| 64 \| 512 \| 12 \| 64 \| 95.9 \| 110.7 \| 60.6 QKV \| 4 \| 2048 \| 32 \| 128 \| 125.26 \| 44.72 \| 78.15 QKV \| 4 \| 4096 \| 32 \| 128 \| 141.62 \| 46.29 \| 85.84 QKV \| 8 \| 2048 \| 32 \| 128 \| 127.40 \| 45.49 \| 78.75 QKV \| 8 \| 4096 \| 32 \| 128 \| 144.24 \| 46.60 \| 86.95 ### Known Issues NVCC uses huge memory while compiling flash attention CUDA kernel. Linux build with CUDA might fail when machine has limited memory while number of CPUs is large. Walkaround is to use a build machine with larger memory, or use argument like `--nvcc_threads 1` to limit nvcc threads in build. ### Motivation and Context Increases speed and efficiency of MHA or Packed MHA. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>	2023-08-31 13:52:21 -07:00
Rachel Guo	b54619509f	Refine build script for adding disable selected data types option (#17284 ) ### Description <!-- Describe your changes. --> As title. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Now we have multiple data types that we want to disable for minimal build and to reduce binary size. may be worth adding an argument in the build script for specifying that. Also for fp16 type stuff, it may be too restrict to disable that for all minimal build. --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>	2023-08-31 13:32:55 -07:00
Chi Lo	30a450dcf8	Fix TRT EP's cuda graph feature (#17355 ) When users run inference with cuda graph enable with multithreading, only the main thread creating the inference session will successfully initialize cuda graph instance, for other threads executing the inference run directly, they will hit segfault due to not calling allocation/initialization for cuda graph instance. This PR fixes this issue.	2023-08-31 13:32:15 -07:00
Changming Sun	ae90b716ff	Change _mm512_setzero to _mm512_setzero_ps (#17362 ) ### Description _mm512_setzero is just an alias of _mm512_setzero_ps, and it is a wrong one. See: https://gcc.gnu.org/legacy-ml/gcc-patches/2018-05/msg00338.html And https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/avx512fintrin.h	2023-08-31 13:11:44 -07:00
Wanming Lin	3a53836836	[WebNN EP] Fix compilation with newer flatbuffers (#17367 )	2023-08-31 10:22:15 -07:00
Jian Chen	e60493525f	[js/webgpu] Adding support for abs with int32 type (#17359 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-31 08:13:54 -07:00
Jiajia Qin	352b745deb	[js/webgpu] Add input/output shapes information to profiling (#17342 ) ### Description This PR is to enhance the profiling information. With the PR, the profiling result is like below: ``` [profiling] kernel "[Split] 51288384" input[0]: 1,256,64,64, output[0]: 1,256,64,64, execution time: 37135 ns program-manager.ts:114 [profiling] kernel "[Concat] 52361040" input[0]: 1,256,64,64, output[0]: 1,256,64,64, execution time: 50833 ns program-manager.ts:114 [profiling] kernel "[Transpose] 52375264" input[0]: 1,256,64,64, output[0]: 1,64,64,256, execution time: 99791 ns program-manager.ts:114 [profiling] kernel "[Sub] 51098472" input[0]: , input[1]: 1, output[0]: 1, execution time: 7448 ns program-manager.ts:114 [profiling] kernel "[Mul] 51344440" input[0]: 1, input[1]: 1,256,1,1, output[0]: 1,256,1,1, execution time: 8334 ns ``` Without this PR, the profiling result is like below: ``` [profiling] kernel "52097928\|[Split] 52097928" execution time: 37760 ns program-manager.ts:105 [profiling] kernel "41898328\|[Concat] 41898328" execution time: 51666 ns program-manager.ts:105 [profiling] kernel "41915648\|[Transpose] 41915648" execution time: 95416 ns program-manager.ts:105 [profiling] kernel "49757856\|[Sub] 49757856" execution time: 7969 ns program-manager.ts:105 [profiling] kernel "51680504\|[Mul] 51680504" execution time: 8906 ns ``` With the new information, we can easily know what kind of shape ops have poor performance. Also it can help us to check whether too small shape ops run on gpu.	2023-08-31 08:12:28 -07:00
Changming Sun	bbf28f09f2	Fix a build warning: a constexpr function calls a non-constexpr function (#17363 ) ### Description The warning is: ``` /onnxruntime_src/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc:1202:41: error: call to non-‘constexpr’ function ‘bool onnx_transpose_optimization::TransposeQuantizeDequantizeAxis(const onnx_transpose_optimization::api::GraphRef&, const std::vector<long int>&, onnx_transpose_optimization::api::NodeRef&)’ return TransposeQuantizeDequantizeAxis(graph, perm, node); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~ ``` The function TransposeQuantizeDequantizeAxis is not constexpr. \	2023-08-31 06:56:27 -07:00
pengwa	58af36b49a	Fuse ScaledSum and its backward BatchScale (#16517 ) ### Fuse ScaledSum and its backward BatchScale For deberta models, there is a pattern a / scalar_0 + b / scalar_1 + c / scalar_2 We can fuse this into ScaledSum operator, taking 2(or 3) inputs, and 2(or 3) attributes scalar, generating one output. For the backward, the gradient of a, b and c will be computed with BatchScale. ### Benchmark on 8x32GV100 ```bash torchrun --nproc_per_node=8 examples/onnxruntime/training/language-modeling/run_mlm.py --model_name_or_path microsoft/deberta-v3-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 400 --logging_steps 1 --use_module_with_loss --deepspeed aml_ds_config_zero_1.json --per_device_train_batch_size 10 ``` #### Main Branch ``` Total overhead: 127954ms where export takes 116489ms. epoch = 14.29 train_loss = 4.9803 train_runtime = 0:10:27.29 train_samples = 2223 train_samples_per_second = 51.013 train_steps_per_second = 0.638 throughput per GPU = 14.29* 2223/ (627.29 - 127.954) / 8 (gpu) = 7.952 samples/second ``` #### This PR ``` Total overhead: 128761ms where export takes 118510ms. *** train metrics *** epoch = 14.29 train_loss = 4.6144 train_runtime = 0:10:04.31 train_samples = 2223 train_samples_per_second = 52.953 train_steps_per_second = 0.662 throughput per GPU = 14.29*2223 / (604.31 - 128.761) / 8 = 8.350 samples/second ``` 5.x% performance gains.	2023-08-31 14:55:27 +08:00
Tianlei Wu	c11ed065ba	Fix SkipLayerNorm fusion in transformer optimizer (#17320 ) ### Description Fix issues: (1) When the output of Add before LayerNormalization node is a graph output, we shall output it in SkipLayerNormalization, but currently not. (2) When there is Cast before Add bias, the Cast output (instead of input) shall be used as SkipLayerNormalization input. (3) The skip input is not at the second input of fused node. According to op spec, skip shall be the second. It could bring issue when we add skip broadcasting support later. ### Motivation and Context Fusion for Clip model of SDXL failed since the last hidden state is a graph output.	2023-08-30 21:12:18 -07:00
Yi Zhang	507a40e1e9	Add compiler cache in Linux GPU TensorRT CI. (#17348 ) ### Description Add the compiler cache in linux GPU tensorRT CI. Save about 30 minutes in the GPU machine. (52 minutes -> 24 minutes) PS. There're only white-space differences in the dockerfile. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-31 08:13:26 +08:00
Hector Li	47fe7fe900	Enable QDQ node unit support for Log op (#17354 ) ### Description Enable QDQ node unit support for Log op	2023-08-30 16:23:33 -07:00
cao lei	64f06d0b4a	only Flush once for the same stream in copyInputAcrossDevice() (#17303 ) ### Description <!-- Describe your changes. --> In CopyInputAcrossDevice() function, we assign each feed a stream to copy across device, once the copy is done, each stream will trigger the Flush() function which is undesired. Same stream should be only flushed once ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This change is to address a perf issue of TLNGv4 inference which contains subgraph with many input feeds.	2023-08-30 16:10:26 -07:00
Wanming Lin	70e8c23944	[WebNN EP] Fix bug in interpreting ONNX's pads into WebNN's padding (#17325 ) The ONNX's pads is [beginning_height, beginning_width, ending_height, ending_width], while WebNN's padding is [beginning_height, ending_height, beginning_width, ending_width]. We should permute the ONNX's pads to [0, 2, 1, 3] for WebNN.	2023-08-30 13:15:39 -07:00
RandySheriffH	6c39641ea2	Fix a memleak in RunAsync python (#17326 ) Release ort value outputs that are created and released from ort::run(...). --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-08-30 12:54:17 -07:00
Jian Chen	081c0692a4	Update to nodejs version from 16 to 18.17.1 (#17351 ) ### Description Update to nodejs version from 16 to 18.17.1 ### Motivation and Context Nodejs will reach EOL in September 2023	2023-08-30 12:41:48 -07:00
Nat Kershaw (MSFT)	2da08c477a	Add website publish placeholder (#17318 )	2023-08-30 11:01:54 -07:00
Changming Sun	71da0824f3	Upgrade binskim and fix an error in nuget packaging pipeline (#17340 ) ### Description Upgrade binskim and fix an error in nuget packaging pipeline.	2023-08-30 07:52:06 -07:00
Adrian Lizarraga	21ae86e405	[QNN EP] Fix test zero-point calculation and flaky MatMul test (#17338 ) ### Description - Fix incorrect zero-point calculation in unit tests. Affects int8(signed) QDQ models. - Replace flaky MatMul test that occasionally fails on main branch with a version that uses explicit inputs. ### Motivation and Context Fix bug and improve test accuracy and stability.	2023-08-29 23:16:57 -07:00
Jian Chen	922629aad8	Upgrade Centos7 to Alamlinux8 (#16907 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Get the latest gcc 12 by default --------- Co-authored-by: Changming Sun <chasun@microsoft.com>	2023-08-29 21:05:36 -07:00
Tianlei Wu	c961f67b5e	Handle dtype attribute in float16 conversion script (#17321 ) Some operators have dtype attribute (search `dtype` in https://github.com/onnx/onnx/blob/main/docs/Operators.md). This change make sure dtype attribute is handled correctly in float16 conversion.	2023-08-29 18:41:56 -07:00
Adam Louly	8224891236	add logits option to generate artifacts (#17276 ) ### Description Adding the ability to export logits as an output for train and eval graphs in generate_artifacts it will remain optional..	2023-08-29 16:55:31 -07:00
cloudhan	f3682eee3b	Fix log color, otherwise, the immediate line followed by the colored log will be tainted (#17329 )	2023-08-30 07:46:04 +08:00
Ryan Hill	c438360c1e	Noticed a simple simplification in beam_search_topk (#17275 ) ### Description There was an Init() method that does exactly like the lines I replaced, so I switched to it. ### Motivation and Context Simpler with no drawbacks.	2023-08-29 15:17:33 -07:00
Yi Zhang	d4a61ac71f	Pr trggiers generated by code (#17247 ) ### Description 1. Refactor the trigger rules generation. 2. Skip all doc changes in PR pipelines. ### Motivation and Context Make all trigger rules generated by running set-trigger-rules.py to reduce inconsistences. It's easily to make mistakes to copy&paste manually. For example: these 2 excludes are different, Why? `4e6cec4d09/tools/ci_build/github/azure-pipelines/linux-ci-pipeline.yml (L16-L18)` `4e6cec4d09/tools/ci_build/github/azure-pipelines/linux-gpu-ci-pipeline.yml (L27-L29)` ### Note All changes in workflow yamls are generated by code. Please review the skip-js.yml, skip-docs.yml and set-trigger-rules.py. @fs-eire, please double check the filter rules in skip-js.yml and the skipped workflows `7023c2edff/tools/ci_build/set-trigger-rules.py (L14-L41)`	2023-08-30 05:57:03 +08:00
AtanasDimitrovQC	fd0917b27b	Propagate noop_with_empty_axes in reduce operators. (#16845 )	2023-08-29 14:15:03 -07:00
kushalpatil07	7b92057376	EvalStep called with wrong inputs onnxruntime_training_cxx_inline.h (#17331 )	2023-08-29 14:14:35 -07:00
Yulong Wang	e5ca3f3dcb	[js/api] introducing IO binding for tensor (#16452 ) [//]: # (## Work In Progress. Feedbacks are welcome!) ### Description This PR adds a few properties, methods and factories to Tensor type to support IO-binding feature. This will allow user to create tensor from GPU/CPU bound data without a force transferring of data between CPU and GPU. This change is a way to resolve #15312 ### Change Summary 1. Add properties to `Tensor` type: a. `location`: indicating where the data is sitting. valid values are `cpu`, `cpu-pinned`, `texture`, `gpu-buffer`. b. `texture`: sit side to `data`, a readonly property of `WebGLTexture` type. available only when `location === 'texture'` c. `gpuBuffer`: sit side to `data`, a readonly property of `GPUBuffer` type. available only when `location === 'gpu-buffer'` 2. Add methods to `Tensor` type (usually dealing with inference outputs): - async function `getData()` allows user to download data from GPU to CPU manually. - function `dispose()` allows user to release GPU resources manually. 3. Add factories for creating `Tensor` instances: a. `fromTexture()` to create a WebGL texture bound tensor data b. `fromGpuBuffer()` to create a WebGPUBuffer bound tensor data c. `fromPinnedBuffer()` to create a tensor using a CPU pinned buffer ### Examples: create tensors from texture and pass to inference session as inputs ```js // when create session, specify we prefer 'image_output:0' to be stored on GPU as texture const session = await InferenceSession.create('./my_model.onnx', { executionProviders: [ 'webgl' ], preferredOutputLocation: { 'image_output:0': 'texture' } }); ... const myImageTexture = getTexture(); // user's function to get a texture const myFeeds = { input0: Tensor.fromTexture(myImageTexture, { width: 224, height: 224 }) }; // shape [1, 224, 224, 4], RGBA format. const results = await session.run(myFeeds); const myOutputTexture = results['image_output:0'].texture; ```	2023-08-29 12:58:26 -07:00

1 2 3 4 5 ...

9547 commits