onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-07 17:15:29 +00:00

Author	SHA1	Message	Date
Wanming Lin	41ad83fb00	[WebNN EP] Support rest Reduction ops for TFLite backend (#21135 ) - reduceLogSum, reduceLogSumExp and reduceSumSquare have been landed in https://chromium-review.googlesource.com/c/chromium/src/+/5575815 - reduceL1 and reduceL2 have been landed in https://chromium-review.googlesource.com/c/chromium/src/+/5606091	2024-06-25 18:30:55 -07:00
Wanming Lin	4743803944	[WebNN EP] Support more Normalization ops for TFLite backend (#21151 ) Following Normalization ops have been supported in Chromium for TFLite backend: - batchNormalization: https://chromium-review.googlesource.com/c/chromium/src/+/5532745 - layerNormalization: https://chromium-review.googlesource.com/c/chromium/src/+/5573326 - instanceNormalization: https://chromium-review.googlesource.com/c/chromium/src/+/5532750	2024-06-24 19:04:23 -07:00
Jian Chen	f81c0ec32a	Remove warning suppression from Java Packaging pipeline. (#21010 ) ### Description Remove warning suppression from Java Packaging pipeline. ### Motivation and Context We want the CI step not to produce warning.	2024-06-24 16:46:21 -07:00
mindest	adaf0e8116	[Fix] USE_NCCL -> ORT_USE_NCCL (#21136 ) ### Description Correct the macro used when NCCL enabled.	2024-06-24 11:33:17 -07:00
Wanming Lin	3a917e49fb	[WebNN EP] Support 4 more ops for TFLite backend (#21134 ) Recently WebNN TFLite backend supports gelu, expand, softsign, reciprocal.	2024-06-24 09:52:12 -07:00
aciddelgado	ebd0368bb0	Make Flash Attention work on Windows (#21015 ) ### Description Previously, Flash Attention only worked on Linux systems. This PR will make it work and enable it to be built and run on Windows. Limitations of Flash Attention in Windows: Requires CUDA 12. ### Motivation and Context This will significantly increase the performance of Windows-based LLM's with hardware sm>=80. To illustrate the improvement of Flash Attention over Memory Efficient Attention, here are some average benchmark numbers for the GQA operator, run with configurations based on several recent models (Llama, Mixtral, Phi-3). The benchmarks were obtained on RTX4090 GPU using the test script located at (onnxruntime/test/python/transformers/benchmark_gqa_windows.py). * Clarifying Note: These benchmarks are just for the GQA operator, not the entire model. ### Memory Efficient Attention Kernel Benchmarks: \| Model Name \| Max Sequence Length \| Inference Interval (ms) \| Throughput (samples/second) \| \|----------------------------------------\|---------------------\|-------------------------\|-----------------------------\| \| Llama3-8B (Average Prompt) \| 8192 \| 0.19790525 \| 13105.63425 \| \| Llama3-8B (Average Token) \| 8192 \| 0.207775538 \| 12025.10172 \| \| Llama3-70B (Average Prompt) \| 8192 \| 0.216049167 \| 11563.31185 \| \| Llama3-70B (Average Token) \| 8192 \| 0.209730731 \| 12284.38149 \| \| Mixtral-8x22B-v0.1 (Average Prompt) \| 32768 \| 0.371928785 \| 7031.440056 \| \| Mixtral-8x22B-v0.1 (Average Token) \| 32768 \| 0.2996659 \| 7607.947159 \| \| Phi-3-mini-128k (Average Prompt) \| 131072 \| 0.183195867 \| 15542.0852 \| \| Phi-3-mini-128k (Average Token) \| 131072 \| 0.198215688 \| 12874.53494 \| \| Phi-3-small-128k (Average Prompt) \| 65536 \| 2.9884929 \| 2332.584142 \| \| Phi-3-small-128k (Average Token) \| 65536 \| 0.845072406 \| 2877.85822 \| \| Phi-3-medium-128K (Average Prompt) \| 32768 \| 0.324974429 \| 8094.909517 \| \| Phi-3-medium-128K (Average Token) \| 32768 \| 0.263662567 \| 8978.463687 \| ### Flash Attention Kernel Benchmarks: \| Model Name \| Max Sequence Length \| Inference Interval (ms) \| Throughput (samples/second) \| \|--------------------------------------\|---------------------\|-------------------------\|-----------------------------\| \| Llama3-8B (Average Prompt) \| 8192 \| 0.163566292 \| 16213.69057 \| \| Llama3-8B (Average Token) \| 8192 \| 0.161643692 \| 16196.14715 \| \| Llama3-70B (Average Prompt) \| 8192 \| 0.160510375 \| 17448.67753 \| \| Llama3-70B (Average Token) \| 8192 \| 0.169427308 \| 14702.62043 \| \| Mixtral-8x22B-v0.1 (Average Prompt) \| 32768 \| 0.164121964 \| 15618.51301 \| \| Mixtral-8x22B-v0.1 (Average Token) \| 32768 \| 0.1715865 \| 14524.32273 \| \| Phi-3-mini-128k (Average Prompt) \| 131072 \| 0.167527167 \| 14576.725 \| \| Phi-3-mini-128k (Average Token) \| 131072 \| 0.175940594 \| 15762.051 \| \| Phi-3-small-128k (Average Prompt) \| 65536 \| 0.162719733 \| 17824.494 \| \| Phi-3-small-128k (Average Token) \| 65536 \| 0.14977525 \| 16749.19858 \| \| Phi-3-medium-128K (Average Prompt) \| 32768 \| 0.156490786 \| 17679.2513 \| \| Phi-3-medium-128K (Average Token) \| 32768 \| 0.165333833 \| 14932.26079 \| Flash Attention is consistently faster for every configuration we benchmarked, with improvements in our trials ranging from ~20% to ~650%. In addition to these improvements in performance, Flash Attention has better memory usage. For example, Memory Efficient Attention cannot handle a max sequence length higher than 32,768, but Flash Attention can handle max sequence lengths at least as high as 131,072. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2024-06-24 09:43:49 -07:00
zhijiang	269d9b094f	Zhijxu/fix softmax cudnn bf16 (#21045 ) if seq >2048, ort will fallback to cudnn version, while when dtype is bf16, ort will throw exception, this PR trying to fix it.	2024-06-24 16:07:39 +08:00
Yi Zhang	5b5ce0bfb0	Add UsePython Task in Nuget Publish workflow (#21144 ) ### Description Otherwise it would fail in `b95982e588/tools/ci_build/github/azure-pipelines/publish-nuget.yml (L78-L81)` ### Motivation and Context The Windows CPU image is migrated to managed image ### Verification Link https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1313	2024-06-24 13:36:13 +08:00
Dmitri Smirnov	b95982e588	Fix 2D detection bug (#21128 ) ### Description Should compare two leading dims for 1.f ### Motivation and Context Vulnerability scanner	2024-06-21 13:58:21 -07:00
Dwayne Robinson	ac21626725	DML EP EinSum make more generic to avoid EP fallback (#21114 ) ### Problem Newer models using more novel equations (e.g. `bhwc,hkc->bhwk` in Segment Anything's encoder or `bqc,bchw->bqhw`) cause fallback from DML to CPU, yielding performance issues. The EP had some pattern matching to map more common equations to existing DML operators, but the number of permutations was prohibitive and could not catch them all. ### Solution So, ditch the static mapping, and instead handle any 1-input or 2-input cases via remapped strides and a mini-graph of elementwise multiplication & sum reduction (as if DML had a `DML_OPERATOR_DOT_PRODUCT` that took `axes`). A subset of mappings still exist for performance (GEMM, pure reduction, transpose...), but they are identified generally rather than via a pattern table. Also... - Diagonals are supported now (e.g. iji->i). - Removes any remaining DML-specific EinSum `GTEST_SKIP` statements. - Handles any cases up to 8 unique labels (DML dimension limit is 8D). - \>= 3 inputs and arbitrary size inputs via ellipsis are not handled, but we have yet to come across a model.	2024-06-21 11:46:16 -07:00
Caroline Zhu	6236707c64	Enable >2GB models + allow model paths to be passed for generate_artifacts API (#20958 ) ### Description Alternative design from #20942 Allow users to pass in a model path for the generate_artifacts API. ### Motivation and Context - ONNX API calls such as the onnx checker + shape inference fail when given a model > 2GB, but work if a path to a model >2GB is passed in.	2024-06-21 09:55:26 -07:00
RuomeiMS	7cf9263ee7	Add changes for strided calibration (#20949 ) Context and motivation: When quantizing large transformer models, we faced OOM issue when the number of calibration samples goes up. To resolve this, in the PR we want to add support for reading quantization data in chunck, calculating ranges for intermediate tensors, then accumulating results for the final ranges.	2024-06-21 08:23:23 -07:00
Changming Sun	f5625b8858	Revert "[MIGraphX EP] enable compilation and execution on Windows (21084)" (#21132 ) ### Description This reverts commit `1d7bf56947` because it broken the AMD GPU CI pipeline. Sorry when I reviewed the PR I forgot to run the AMD GPU CI pipeline. Will revert the PR first then ask the author to fix the issue.	2024-06-21 01:01:07 -07:00
Yi Zhang	69d522f4e9	[Fix] use cmdline in Final Jar Testing Stage for new managed Windows Image (#21130 ) ### Description No bash command in Managed Windows image. Use CmdlLine step instead. ### Verified Link https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=491902&view=logs&j=f1f8e11e-a9fa-53e5-cd29-3ba2c1988550	2024-06-21 12:41:06 +08:00
Jake Mathern	b9eb1dc21e	Update protobuf_cmake.patch to allow extra disablements configurable by projects that build ORT (#20875 ) ### Description Update protobuf_cmake.patch to allow extra disablements. ORT repo already patches protobuf to not disable the warning 4996. ### Motivation and Context To meet SDL requirements, Microsoft repos have to fail build if there is warning 4996 Binskim also gives errors if warning 4996 is disabled. We can suppress the Binskim issues, but we need a way to disable the warnings for the minimal set of code that has them. Right now, WindowsAI disables 4996 for entirety of ORT, but it should only be disabled for protobuf.	2024-06-20 16:28:15 -07:00
Ted Themistokleous	1d7bf56947	[MIGraphX EP] enable compilation and execution on Windows (#36 ) (#21084 )	2024-06-20 16:21:11 -07:00
Changming Sun	efcaa835b1	Update generate_nuspec_for_native_nuget.py for training (#21112 ) ### Description Similar to #21096 , but this one is for ORT training nuget package.	2024-06-20 16:13:31 -07:00
Yi-Hong Lyu	00c713088d	Adpot QDQFinalCleanupTransformer for Q->DQs/DQ->Qs cases (#21018 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-20 11:21:32 -07:00
Wanming Lin	0c80cd2157	[WebNN EP] Update Prelu restriction for CPU backend (#20878 )	2024-06-20 11:04:01 -07:00
ivberg	55f7f9d7a9	Fix Crash When Enabling and Disabling ETW with Old Callbacks (#21086 ) ### Description Under certain conditions with enabling & disabling ETW continuously, we got a crash report. Allows ETW callbacks to be de-registered upon class destructor. Related to #20537 ### Motivation and Context Fixes crash ### Callstack We see it crash in [0x0] onnxruntime!<lambda_967a738fca8512372f170fcaf2d094d4>::operator()+0x34 0x12941ff570 0x7ffa994f0a04 [0x1] onnxruntime!std::_Func_class<void,_GUID const ,unsigned long,unsigned char,unsigned __int64,unsigned __int64,_EVENT_FILTER_DESCRIPTOR ,void *>::operator()+0x54 0x12941ff7b0 0x7ffa994f0d64 [0x2] onnxruntime!onnxruntime::logging::EtwRegistrationManager::InvokeCallbacks+0xcc 0x12941ff7b0 0x7ffa994f0d64 [0x3] onnxruntime!onnxruntime::logging::EtwRegistrationManager::ORT_TL_EtwEnableCallback+0x94 0x12941ff860 0x7ffa98d19628 and seems to us that the this pointer captured in etwRegistrationManager.RegisterInternalCallback( [&etwRegistrationManager, this]( ... is no longer valid when the callback is called.	2024-06-20 06:45:45 -07:00
Changming Sun	bd3a9ee99d	Add UsePythonVersion (#21109 ) ### Description The machine has multiple python installations and none of them is in PATH. Therefore we should explicitly set python version via this task to avoid having surprises. ### Motivation and Context Similar to #21095	2024-06-19 20:47:21 -07:00
Changming Sun	27f3ac78d4	Delete RoslynAnalyzers (#21104 ) ### Description Delete RoslynAnalyzers. Use CodeQL instead. ### Motivation and Context Now we already have CodeQL which is modern and also covers C# code. The RoslynAnalyzers one is not in our pull request pipelines. The "RoslynAnalyzers@2" task is outdated and needs be upgraded. I will delete it for now since we already have CodeQL.	2024-06-19 20:11:15 -07:00
Chi Lo	e737547862	Add support for INT64 types in TensorRT constant layer calibration (#21101 ) This PR is a duplicate of the https://github.com/microsoft/onnxruntime/pull/21041 Create this PR in case the original one can't be updated for patch release timeline.	2024-06-19 20:36:26 -05:00
Jing Fang	6817b013b9	[MLAS] add q4 quantize and transpose kernel to support MatMulNBits QDQ fuse (#21054 ) ### Description 1. added kernel to quantize matmul B tensor to q4, and store in the same shape as original tensor. scales and zero points are calculated as well. scales and zero points have the same shape. 2. added kernel to transpose q4 B tensor to B tensor in MatMulNBits. Scales and zero points are transposed as well. #### Benchmark <1024 x 4096 input, 64 quant block, 8 threads>: - quantize: 23035923 ns - transpose: 718635 ns <1024 x 4095 input, 64 quant block, 8 threads>: - quantize: 26759319 ns - transpose: 1279064 ns ### Motivation and Context The MatMulNbits tool chain current only supports converting a MatMul op direct to MatMulNBits op. MatMulNbits op is not an ONNX standard op. Therefore, we need the tool chain to support converting MatMul to Q/DQ format, and later in the transform step converts DQ + MatMul to MatMulNBits. The tensors stored in DQ are the quantized constants and will be stored in the MatMulNBits.	2024-06-19 17:15:45 -07:00
Jian Chen	8448f31d90	change is_pod tp is_trivial (#21071 ) ### Description change is_pod tp is_trivial ### Motivation and Context This is commonnly needed for both linux and win c++20 upgrade. is_trivial was introduced backed in C++11	2024-06-19 16:23:47 -07:00
Changming Sun	be423747b1	Delete pyop (#21094 ) ### Description Remove the "--enable_language_interop_ops" build flag, because the code is incompatible with the latest numpy, and the build flag is not used anywhere except a macOS CI pipeline. It does not seem to have a ship plan. ### Motivation and Context The build error was: ``` onnxruntime/core/language_interop_ops/pyop/pyop.cc:122:85: error: no member named 'elsize' in '_PyArray_Descr' static_cast<int64_t>(PyArray_DescrFromType(type)->elsize), ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ```	2024-06-19 16:21:33 -07:00
Clément Péron	8ab8e649a7	tools: build: fix typo (#21052 ) ### Description Typo in the python build script	2024-06-19 16:14:58 -07:00
Changming Sun	8b9656717b	Fix a perm issue in Windows Static Analysis pipeline (#21100 ) ### Description Due to a security setting change, now we need to explicitly set the permissions. I forgot doing that when bringing the old change back. ### Motivation and Context Now the pipeline cannot publish scanning result to Github	2024-06-19 14:44:39 -07:00
Adrian Lizarraga	3ae5df1d18	[QNN EP] Update QNN SDK to 2.23.0 (#21008 ) ### Description - Updates CI pipelines to use QNN SDK 2.23.0 by default. - QNN SDK adds support for int64 Cast. This allows QNN EP to support ONNX ArgMax/ArgMin/TopK operators that generate an int64 graph output. Example translation of ArgMax: - ONNX: input --> ArgMax --> output (int64) - QNN: input --> ArgMax --> Cast (int32 to int64) --> output (int64) ### Motivation and Context Update onnxruntime to use the latest QNN SDK.	2024-06-19 12:37:42 -07:00
Jian Chen	6a0d64e65c	Component Gov round 7 (#21051 ) ### Description ignoreDirectories does not recursively include sub folders like we thought it would. We need to add additional sub folders. ### Motivation and Context Fix CG : 1. https://aiinfra.visualstudio.com/Lotus/_componentGovernance/218239/alert/11474679?typeId=25427568 2. https://aiinfra.visualstudio.com/Lotus/_componentGovernance/218239/alert/11475140?typeId=25421034&pipelinesTrackingFilter=0	2024-06-19 11:07:02 -07:00
Tianlei Wu	769d379c63	Refactor MultiHeadAttention cpu op (#21055 ) Refactoring of MultiHeadAttention op - [x] Add some checking for cross attention of pass_past_in_kv to make sure there is no kv cache and bias. - [x] Update interface of PackVIntoRotaryQKV so that it can be used by SparseAttention later. - [x] Add test cases ### Motivation and Context To prepare the pull request for SparseAttention cpu op.	2024-06-19 10:23:26 -07:00
Xu Xing	c3076721f3	[js/webgpu] Support conv3d naive (#20706 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-19 10:13:50 -07:00
Tianlei Wu	01279d8896	[ROCM] Exclude flash attention from hipify (#21091 ) Exclude flash attention sub-directory from hipify.	2024-06-19 08:59:10 -07:00
Scott McKay	6e742c426e	Update nuget package generation script entries for .net8 MAUI (#21096 ) ### Description <!-- Describe your changes. --> Remove xamarin related entries. Update MAUI entries to net8 Remove macos entries (not required by MAUI) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Updates missed from #21062	2024-06-19 21:10:22 +08:00
Yi Zhang	cc3168bcbb	Add UsePython task in Nuget_Packaging_CPU stage (#21095 ) ### Description supplement of https://github.com/microsoft/onnxruntime/pull/21062 ### Motivation and Context	2024-06-19 21:09:37 +08:00
Peishen Yan	50b49642d5	[WebNN EP] Update triangular_op_builder.cc (#20994 ) As a follow-up of https://github.com/microsoft/onnxruntime/pull/20730	2024-06-19 03:28:34 -07:00
Wanming Lin	40879a2623	[WebNN EP] Enable Cast op for WebNN CPU backend (#20864 ) WebNN TFLite backend supports `cast` op but doesn't support casting to `uint64` data type.	2024-06-19 01:51:19 -07:00
Wanming Lin	35c430a95a	[WebNN EP] Enable several ops for WebNN CPU backend (#20847 ) WebNN CPU implementation has been migrated from XNNPack to TFLite which supports more ops. Turn on partial `cpu` supported ops which just need the change from `false` to `true` firstly.	2024-06-19 01:45:31 -07:00
Scott McKay	5fc60f36f2	Update to the net8 MAUI targets. Remove Xamarin. (#21062 ) ### Description <!-- Describe your changes. --> Xamarin is EOL so remove support. The MAUI targets are EOL and need updating. https://dotnet.microsoft.com/en-us/platform/support/policy/maui Other cleanups: - netcoreapp3.1 is EOL - the net6 macos target was added in the mistaken belief that was for MAUI mac support, but that is actually via the mac-catalyst target which we recently added support for. - some CIs that were using the old build setup of splitting pre-net6 targets. The ORT C# bindings csproj was updated last year and the `PreNet6` and `SelectedTargets` properties no longer exist as they were replaced by the simpler `IncludeMobileTargets` property. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Remove EOL components. #21058	2024-06-19 16:20:58 +10:00
Jian Chen	1ad2c0a4b2	fix Window_CI in Github Action (#21070 ) ### Description fix Window_CI in Github Action	2024-06-18 23:14:08 -07:00
cloudhan	ddd4ce3cb7	[ROCm] Update ck to use ck_tile (#21030 )	2024-06-19 14:06:10 +08:00
Yi Zhang	5a0e5237f5	Fix onebranch exception in code signing (#21088 ) ### Description Fix regression caused by https://github.com/microsoft/onnxruntime/pull/20995 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-19 12:07:17 +08:00
Yulong Wang	5e81fa8aec	[js] fix vulnerability CVE-2024-4068: upgrade `braces` to 3.0.3 (#21078 ) ### Description Upgrade `braces` to 3.0.3 [CVE-2024-4068](https://github.com/advisories/GHSA-grv7-fg5c-xmjg) ``` # npm audit report braces <3.0.3 Severity: high Uncontrolled resource consumption in braces - https://github.com/advisories/GHSA-grv7-fg5c-xmjg fix available via `npm audit fix` node_modules/braces 1 high severity vulnerability ```	2024-06-18 16:02:08 -07:00
Changming Sun	ffb8e8eb0e	Update build.py: add a comment (#20993 ) ### Description Update build.py: add a comment ### Motivation and Context See the comment.	2024-06-18 13:52:34 -07:00
Yulong Wang	631a2c16be	[js/web] skip default locateFile() when dynamic import is disabled (#21073 ) ### Description skip default `locateFile()` when dynamic import is disabled. This allows the file to work with bundlers to load WebAssembly file correctly if `env.wasm.wasmPaths` is not set.	2024-06-18 12:21:45 -07:00
Changming Sun	b75b2fcdcb	Add MSVC static analyzer back (#21056 ) ### Description Add MSVC static analyzer back. Previously it had a stability issue. It was deleted in #17522 . ### Motivation and Context	2024-06-18 12:10:11 -07:00
Yang Gu	1473d66a00	[js/webgpu] Prefer adapter.info to adapter.requestAdapterInfo (#21065 ) WebGPU is deprecating async adapter.requestAdapterInfo, and replacing it with sync adapter.info. Spec change: https://github.com/gpuweb/gpuweb/pull/4662	2024-06-18 12:02:38 -07:00
Ted Themistokleous	dadd0c451a	[MIGraphX EP] Fix MIGraphX mixed precision run input parameters (#20982 ) See #20643 ### Description Changes order of how we perform quantization to better support mixed precision and fixes a bug found with parameters of inputs for int8 quantization not being correctly handled. We now perform int8 quantization first on a full precision input model, before then quantizing the model to fp16 for remain ops that aren't quantized. The former case was causing us to use a low precision input which could cause larger values to be inserted than intended to the model when int8 quantization is perform. The symptom of this was a failure during quantization steps. Similar to the above input parameters were being uninitialized and resulting in similar failure during int8 quantization. GPU faults were intermittent but present as using uninitialized memory created undefined behavior when we started testing more complex models during mixed precision. ### Motivation and Context In some cases we've seen random data and/or invalid values entering into compiled onnx graphs. This is due to input parameters to the MIGraphX Graph not being set correctly when mixed precision (int8 + fp16) is used and ordering of quantization steps is causes a lower precision model to be used to perform int8 quantization. In most cases the failure is silent/intermittent. In some cases we've observed gpu faults due to out of bounds values being set. This change is required as a large input parameter to the MIGraphX graph is initialized to a large random value, and the next operator is using that for indexing, we get undefined behavior and a GPU fault.	2024-06-18 11:18:13 +08:00
Yi Zhang	809cb26ace	Use A100 for LLama2 model test (#21068 ) ### Description ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-18 11:04:02 +08:00
Changming Sun	9ef4f1b789	Update pybind11 (#21072 ) ### Description Upgrade pybind11 to the latest as suggested by @gnought in #21063 ### Motivation and Context Recently numpy released a new version, which caused compatibility issue between the latest numpy version and the latest ONNX Runtime version.	2024-06-17 19:50:57 -07:00

1 2 3 4 5 ...

11267 commits