onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-07 04:39:07 +00:00

Author	SHA1	Message	Date
Changming Sun	2cb5781b43	Remove two tests from test_logging_apis.cc (#19100 ) ### Description In some environments the test code has undefined behavior. To prove it, save the following code as test.cpp ```c++ #include <iostream> #include <stdio.h> int main(){ char buf[1024]; int ret = snprintf(buf, sizeof(buf), "%ls","abc"); if(ret <0){ std::cout<< ret<< std::endl; } else{ std::cout<< "OK: ret="<<ret<< std::endl; } return 0; } ``` Then compile it as ``` g++ -DNDEBUG -std=gnu++17 test.cpp -o /tmp/t ``` Or ``` g++ -O2 -DNDEBUG -std=gnu++17 test.cpp -o /tmp/t ``` The first command is without optimization. The second one turns on optimization. Then the outputs are different. When optimization is enabled, the output might be: ``` OK: ret=-1 ``` You cannot explain why it would go to this branch when ret is "-1". It might be a bug of a specific version of GCC. However, at this moment we cannot change the version. It was found in GCC version 8.5.0 20210514 (Red Hat 8.5.0-18) (GCC) that is provided by UBI8. RHEL9 doesn't have the problem. snprintf is a builtin function of GCC. So the problem was not related to glibc. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-12 09:26:28 -08:00
Xavier Dupré	c8399a81fe	Quantization tool: support float 8 with MatMul, support float 16 weights (#18043 ) ### Description Whenever a node QuantizeLinear or DequantizeLinear, the type of the weights before being quantize must be known to create the scale with the expected type. Another option would be to add many operator CastLike but that would push the burden to onnxruntime optimizer. The PR tries to avoid changing the signature. To do so, it modified the scale computation to use a numpy array to store the result and not a python float. The numpy array must be of the same type than the weights to quantize. The PR adds many `assert` to check the type of the scale is not a python type or a float64. This was added to make sure all the code follows the same logic. These lines were kept for the first review. DequantizeLinear, QuantizeLinear cannot be tested with onnx==1.15. PR https://github.com/onnx/onnx/pull/5709 is missing to fix shape inference. PR https://github.com/onnx/onnx/pull/5473) is missing to support QLinearMatMul with float 16. That explains why some tests are disabled with float 16. ### Motivation and Context The current quantization tool assumes every weight is float 32. For large models such as LLAMA, it is usually float 16. The quantization needs to quantize such weights.	2024-01-12 17:54:55 +01:00
Changming Sun	0e8d4c3d21	Enable Address Sanitizer in CI (#19073 ) ### Description 1. Add two build jobs for enabling Address Sanitizer in CI. One for Windows CPU, One for Linux CPU. 2. Set default compiler flags/linker flags in build.py for normal Windows/Linux/MacOS build. This can help control compiler flags in a more centralized way. 3. All Windows binaries in our official packages will be built with "/PROFILE" flag. Symbols of onnxruntime.dll can be found at [Microsoft public symbol server](https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/microsoft-public-symbols). Limitations: 1. On Linux Address Sanitizer ignores RPATH settings in ELF binaries. Therefore once Address Sanitizer is enabled, before running tests we need to manually set LD_LIBRARY_PATH properly otherwise libonnxruntime.so may not be able to find custom ops and shared EPs. 4. On Linux we also need to set LD_PRELOAD before running some tests(if the main executable, like python, is not built with address sanitizer. On Windows we do not need to. 5. On Windows before running python tests we should manually copy address sanitizer DLL to the onnxruntime/capi directory, because python 3.8 and above has enabled "Safe DLL Search Mode" that wouldn't use the information provided by PATH env. 6. On Linux Address Sanitizer found a lot of memory leaks from our python binding code. Therefore right now we cannot enable Address Sanitizer when building ONNX Runtime with python binding. 7. Address Sanitizer itself uses a lot of memory address space and delays memory deallocations, which is easy to cause OOM issues in 32-bit applications. We cannot run all the tests in onnxruntime_test_all in 32-bit mode with Address Sanitizer due to this reason. However, we still can run individual tests in such a way. We just cannot run all of them in one process. ### Motivation and Context To catch memory issues.	2024-01-12 07:24:40 -08:00
Changming Sun	285606108a	Set pythonInterpreter in set-python-manylinux-variables-step.yml (#19105 ) ### Description Set pythonInterpreter in set-python-manylinux-variables-step.yml. To fix a build error: ``` Starting: Set Python manylinux variables ============================================================================== Task : Python script Description : Run a Python file or inline script Version : 0.231.1 Author : Microsoft Corporation Help : https://docs.microsoft.com/azure/devops/pipelines/tasks/utility/python-script ============================================================================== ##[error]Parameter 'toolPath' cannot be null or empty. Finishing: Set Python manylinux variables ``` The error was because today I deleted a bunch of software from the VM image. The task might fail if no Python versions are found in $(Agent.ToolsDirectory).	2024-01-12 07:22:02 -08:00
Changming Sun	e3ee255950	Remove the references to CreateFileMapping2 (#19102 ) ### Description Remove the references to CreateFileMapping2 because the function is mainly for system services. To use the function, we need to link to one of the four [Windows umbrella libraries](https://learn.microsoft.com/en-us/windows/win32/apiindex/windows-umbrella-libraries). It's tricky because a custom build might want to use any of the four. So I cannot just choose one and add that one to our CMakeLists.txt. Given it's so complicated and the code is not actually used now, I will remove it. It is not used because it requires NTDDI_VERSION >= NTDDI_WIN10_RS5 but in our top level CMakeLists.txt we set the version to the first Windows 10 release which is lower than RS5.	2024-01-12 07:21:12 -08:00
zesongw	e1db44b4f0	[WebNN EP] Add quantize Ops (#18011 ) ### Description <!-- Describe your changes. --> Add four quantize Ops: MatmulInteger, ConvInteger, DynamicQuantizeLinear and DequantizeLinear. Add datatype TensorProto_DataType_INT8 and TensorProto_DataType_UINT8. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Support quantized models.	2024-01-12 02:25:09 -08:00
Jiajie Hu	acba63c36a	[js/webgpu] Change A/sqrt(B) to AinverseSqrt(B) in normalization ops (#19101 ) ### Description Change `A / sqrt(B)` to `A inverseSqrt(B)` in BatchNormalization, InstanceNormalization, LayerNormalization and SkipLayerNormalization. ### Motivation and Context For the same reason as the existence of the `inverseSqrt` built-in in WebGPU spec.	2024-01-12 00:08:16 -08:00
dependabot[bot]	5373c0c730	Bump follow-redirects from 1.15.2 to 1.15.4 in /js/web (#19068 ) Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.15.2 to 1.15.4. <details> <summary>Commits</summary> <ul> <li><a href="`65858205e5`"><code>6585820</code></a> Release version 1.15.4 of the npm package.</li> <li><a href="`7a6567e16d`"><code>7a6567e</code></a> Disallow bracketed hostnames.</li> <li><a href="`05629af696`"><code>05629af</code></a> Prefer native URL instead of deprecated url.parse.</li> <li><a href="`1cba8e85fa`"><code>1cba8e8</code></a> Prefer native URL instead of legacy url.resolve.</li> <li><a href="`72bc2a4229`"><code>72bc2a4</code></a> Simplify _processResponse error handling.</li> <li><a href="`3d42aecdca`"><code>3d42aec</code></a> Add bracket tests.</li> <li><a href="`bcbb096b32`"><code>bcbb096</code></a> Do not directly set Error properties.</li> <li><a href="`192dbe7ce6`"><code>192dbe7</code></a> Release version 1.15.3 of the npm package.</li> <li><a href="`bd8c81e4f3`"><code>bd8c81e</code></a> Fix resource leak on destroy.</li> <li><a href="`9c728c314b`"><code>9c728c3</code></a> Split linting and testing.</li> <li>Additional commits viewable in <a href="https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=follow-redirects&package-manager=npm_and_yarn&previous-version=1.15.2&new-version=1.15.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-01-11 22:25:50 -08:00
gunandrose4u	e2c145d37f	Add Anubis metrics schema for local benchmark results uploading (#19018 ) ### Description 1. Add metrics.py for define the metrics schema used by Anubis 2. Add two examples (llama2 and whisper) of how to save local benchmark results following Anubis metrics schema ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Kyle Zhang <Xi.Zhang@microsoft.com> Co-authored-by: ironman <bitzhangxi@outlook.com>	2024-01-12 14:24:01 +08:00
Chi Lo	46dd0d3f52	[TensorRT EP] Load precompiled TRT engine file directly (#18217 ) When the TRT engine cache (precompiled engine) is present, it doesn't make sense to go over the processes of model verification, model optimization, TRT EP's GetCapability(), TRT EP's model proto reconstruction, calling TRT parser and engine compilation. This PR makes TRT EP skip those processes and directly load the engine to perform inference. The feature request: https://github.com/microsoft/onnxruntime/issues/18072 Features: - Replace original model with TRT engine wrapped ONNX model. It can save a lot of time as mentioned above. - How to get TRT engine wrapped ONNX model? 1. Set `trt_dump_ep_context_model` provider option to "true" and run the inference. You will find the "xxx_wrapper.onnx" at the engine cache path. (The same logic of generating engine cache) 2. Use gen_trt_engine_wrapper_onnx_model.py - Three provider options are added, `trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP `trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine cache path, 1 means engine binary data. `trt_ep_context_compute_capability_enable`: Add hardware_arch as attribute. When running the model, TRT EP will check consistency between model's hardware_arch and GPU's compute capability. - When the engine cache path is given in the wrapped model, TRT EP will first search for the engine file using the path (relative to model path), if it can't find it, it will change to use the path as it is (depends on user, could be relative to working dir or absolute path) Note: 1. This PR includes the change of https://github.com/microsoft/onnxruntime/pull/17751 Constraints: 1. The whole model should be fully supported by TRT. 4. Users need to make sure the engine is built with min/max/opt optimization profiles that large enough to cover the range of all inputs. TRT EP will simply fail and won't rebuild the engine if the input shape is out of range during runtime.	2024-01-11 22:20:54 -08:00
Ye Wang	b6d82834d4	add bfp16 to gqa (#19095 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-11 20:53:31 -08:00
dependabot[bot]	189be8e997	Bump follow-redirects from 1.15.2 to 1.15.4 in /onnxruntime/test/wasm (#19069 ) Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.15.2 to 1.15.4. <details> <summary>Commits</summary> <ul> <li><a href="`65858205e5`"><code>6585820</code></a> Release version 1.15.4 of the npm package.</li> <li><a href="`7a6567e16d`"><code>7a6567e</code></a> Disallow bracketed hostnames.</li> <li><a href="`05629af696`"><code>05629af</code></a> Prefer native URL instead of deprecated url.parse.</li> <li><a href="`1cba8e85fa`"><code>1cba8e8</code></a> Prefer native URL instead of legacy url.resolve.</li> <li><a href="`72bc2a4229`"><code>72bc2a4</code></a> Simplify _processResponse error handling.</li> <li><a href="`3d42aecdca`"><code>3d42aec</code></a> Add bracket tests.</li> <li><a href="`bcbb096b32`"><code>bcbb096</code></a> Do not directly set Error properties.</li> <li><a href="`192dbe7ce6`"><code>192dbe7</code></a> Release version 1.15.3 of the npm package.</li> <li><a href="`bd8c81e4f3`"><code>bd8c81e</code></a> Fix resource leak on destroy.</li> <li><a href="`9c728c314b`"><code>9c728c3</code></a> Split linting and testing.</li> <li>Additional commits viewable in <a href="https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=follow-redirects&package-manager=npm_and_yarn&previous-version=1.15.2&new-version=1.15.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-01-11 16:13:22 -08:00
Aditya Goel	d8962d67f4	RegexFullMatch operator (#18002 ) ### Description <!-- Describe your changes. --> ### Motivation and Context Closes https://github.com/microsoft/onnxruntime/issues/17594.	2024-01-11 15:50:07 -08:00
Jeff Bloomfield	08cf4fbcad	Handle all float types in IsQDQPairSupported (#19085 ) ### Description This makes detection of identical QDQ scales work with float16 and bfloat16 rather than failing. ### Motivation and Context This addresses failures in customer models	2024-01-11 15:16:44 -08:00
Christian Larson	8a0a972f39	Update DML EP to accept broadcasted tensor of size 1 to match CPU (#19081 ) ### Description With QDQ enabled for Dml EP we are seeing some models not optimize constant nodes with incorrect tensor size of scale[1] and zeropoint[1] that does not match the input size. CPU accepts this parameter type so updating Dml EP to match CPU behavior. ### Motivation and Context Want to match CPU EP behavior. --------- Co-authored-by: Christian Larson <28911437+chrilaMSFT@users.noreply.github.com> Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>	2024-01-11 15:15:51 -08:00
Maximilian Müller	daa22f919f	[TensorRT] query GPU properties only once when setting device_id (#19092 ) ### Description For most models this does not show significant overhead but for very small models it shows significant impact. Attached screenshot shows impact when only using 2 FC layers: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/b4fdf8bf-0422-43ab-a49e-7d2996cba68e)	2024-01-11 13:37:10 -08:00
ivberg	4d1243b4b4	ORT ETW dynamic logging that improves ORT diagnosability & performance (#18882 ) ### Description This PR has several combined ORT ETW changes that improve ORT log diagnosability & performance. - The existing log behavior in the ORT API and Severity behavior remain the same as compiled by the dev using the ORT API - The PR keeps the existing design which has 2 TraceLogging providers defined (although both were not used before this PR) - Keeps great inference (inf) and session load performance even with dynamic logging enabled (see below) - On Windows, when ONNXRuntimeTraceLoggingProvider is enabled, then ORT will dynamically _add_ a new sink reflecting the severity level provided by ETW dynamically. E.G Critical - Verbose per the need at runtime - This allows previous printf style LOGS() statements both for CPU and NPU cases to flow to ETW via a local trace (if enabled) - This DOES NOT add any new Telemetry which can optionally be sent to Microsoft. - Telemetry are ETW events marked with the Measure keyword that _can_ be sampled if a box opts-in - Existing Microsoft.ML.ONNXRuntime events have appropriate keywords and levels added if they were missing - If Execution Providers (EPs) can provide more detailed insight into their HW or component, then this PR allows for those to be dynamically logged instead of just at compile time - In this PR, the QNN EP for QC NPUs can have basic or detailed profiling enabled to give some insight into how the NPU is performing - When the Microsoft.ML.ONNXRuntime ETW provider is enabled with the Profiling keyword and level 5 then QC QNN basic profiling info is output to ETW ### Motivation and Context - This make ORT logging and diagnosability more performant (on Windows) and available in a wider variety of runtime environments. - The performance difference for inf times was ~300x+ drastically better/faster when these logs were output to ETW vs just stdout (Verbose Severity) - This style of ETW dynamic tracing is the widely used standard for Windows components, and even by some 3rd party software since the ETW API is open and part of the Windows API - These ETW based logs can be seamlessly combined with other ETW logs such as an AI component/feature using ORT, OS CPU profiling, scheduling, and more - Before the PR, ORT logging is largely printf style and output to a sink (usually stdout) only if compiled with a certain log Severity. When enabled the previous logging (to stdout) would vastly slow down inference times. Once compiled for release the internal ORT logs were not accessible by anyone except the AI model developer in their dev inner loop. That means logs could not be enabled on a lab machine, or on a production system where the runtime behavior or performance might be different for various reasons on a wide variety of HW. - This change was tested with performance in mind and tested with a mobilenet small AI model with onnxruntime_perf_test - CPU: There was no statistical difference for inf times with the baseline (main) or this PR whether ETW was enabled or not (both ORT providers all keywords level 5). - NPU (QNN on SP9 or Dev Kit 2023 QC SQ3): There was no statistical difference for inf times with this PR whether ETW (both ORT providers all keywords) were enabled or not for Level 5 (Verbose). This is even with QNN Basic profiling turned on and outputting NPU stats to ETW - As expected and designed, there was perf slowdown when Max Level 255 is enabled which translates to QNN Detailed profiling. This mirrors the expected slowdown in the NPU when individual model operations are monitored & recorded as well. This perf is similar to the QNN SDK Detailed profiling performance separate from this PR. This is designed to be above level 5 (verbose) as that is commonly the max level used in many trace profiles and won't affect inf performance. - Other OSes such as Linux & Android are left untouched for now. - Out of scope for this PR but TraceLogging is available for Linux with LTTng tracing. So in the future, this optional tracing could also be made available on other OSes where a TraceLogging API is available	2024-01-11 12:43:27 -08:00
Guenther Schmuelling	d0bac8216d	[js/webgpu] fix bcast in where (#19009 )	2024-01-11 12:13:24 -08:00
Jian Chen	53497702a6	Fix Nuget CUDA Packaging pipeline (#19054 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Yi Zhang <zhanyi@microsoft.com>	2024-01-11 11:59:21 -08:00
RandySheriffH	24e9daf707	Enrich cuda resources with ep options (#19014 ) Allow custom ops to access cuda ep options. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2024-01-11 10:56:07 -08:00
Baiju Meswani	58bf836592	Offline tooling for training to use reduction with keepdims=False (#19027 )	2024-01-11 10:51:23 -08:00
Aditya Goel	4694edcd41	String concat operator (#17994 ) ### Description <!-- Describe your changes. --> ### Motivation and Context Closes https://github.com/microsoft/onnxruntime/issues/17595. --------- Signed-off-by: Aditya Goel <agoel4512@gmail.com>	2024-01-11 10:01:43 -08:00
Hariharan Seshadri	f68dfcd888	[CUDA] Improve performance of DecoderMaskedMultiheadAttention on A100 (#18695 ) ### Description Currently there are 2 memory latency bound hotspots in the DecoderMaskedMultiheadAttention kernel in terms of reading from global memory - one reading K values and the other reading V values. The current logic to read them both is something like this - for(int i=0; i<all_time_steps; ++i) { auto data_in_register = load_chunk_from_global_memory(i); do_compute(data_in_register); } This incurs a data read stall as data needs to be fetched into the registers before compute can begin and the compute instruction incurs a data read stall and this also does not fully utilize the memory bandwidth of A100. The above logic can be re-written by doing some manual loop unrolling so that more data read is triggered "in flight". Unroll factor: 4 for(int i=0; i<all_time_steps; i+=4) { auto data_in_register_0 = load_chunk_from_global_memory(i); // Do bounds check for the following auto data_in_registers_1 = load_chunk_from_global_memory(i+1); auto data_in_register_2 = load_chunk_from_global_memory(i+2); auto data_in_register_3 = load_chunk_from_global_memory(i+3); do_compute(data_in_register_0); // Do bounds check for the following do_compute(data_in_register_1); do_compute(data_in_register_2); do_compute(data_in_register_3); } The idea is that the memory read latency is hidden by instructions being issued for subsequent data reads. See here for more details - https://forums.developer.nvidia.com/t/global-memory-access-synchronous-or-asynchronous-read-write/3256/4 Kernel clock cycles, latency, and memory bandwidth usage before: <img width="1210" alt="image" src="https://github.com/microsoft/onnxruntime/assets/9969784/7a1f41f9-fdaa-47b3-b629-996d7b5eef17"> Kernel clock cycles, latency, and memory bandwidth usage after: <img width="1205" alt="image" src="https://github.com/microsoft/onnxruntime/assets/9969784/c76b2d2f-43e3-43c9-a710-b5fae76f69b6"> As can be seen, the kernel latency is better by >30% and memory throughput is better by >14%. We have a 1P customer using the Whisper model (sampling using BeamSearch) and the E2E perf for a representative production input is > 6.5% Whisper E2E Latency for sample input before (on A100): <img width="194" alt="image" src="https://github.com/microsoft/onnxruntime/assets/9969784/84ef59f5-84f2-4277-b9f8-b04c27336642"> Whisper E2E Latency for sample input after (on A100): <img width="191" alt="image" src="https://github.com/microsoft/onnxruntime/assets/9969784/ca9fe5d3-f726-403e-b27c-be4ee07e0625"> This feature of loading more data in flight may not always yield gains and it will be workload dependent. For now, keeping the feature turned OFF by default. It can be turned ON by the user when needed. ### Motivation and Context Improve BeamSearch performance on CUDA EP	2024-01-11 09:19:12 -08:00
Jian Chen	2eb3db6bf0	Adding python3.12 support to ORT (#18814 ) ### Description Adding python3.12 support to ORT ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-11 08:34:28 -08:00
Jiajia Qin	a89db01fce	[js/webgpu] disable GroupedConvVectorize path (#19090 ) Disable createGroupedConvVectorizeProgramInfo path due to bots failures on below two cases: [webgpu]Conv - conv - vectorize group - B [webgpu]Conv - conv - vectorize group - D	2024-01-11 08:13:14 -08:00
dependabot[bot]	f11713702f	Bump follow-redirects from 1.15.2 to 1.15.4 in /js/node (#19070 ) Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.15.2 to 1.15.4. <details> <summary>Commits</summary> <ul> <li><a href="`65858205e5`"><code>6585820</code></a> Release version 1.15.4 of the npm package.</li> <li><a href="`7a6567e16d`"><code>7a6567e</code></a> Disallow bracketed hostnames.</li> <li><a href="`05629af696`"><code>05629af</code></a> Prefer native URL instead of deprecated url.parse.</li> <li><a href="`1cba8e85fa`"><code>1cba8e8</code></a> Prefer native URL instead of legacy url.resolve.</li> <li><a href="`72bc2a4229`"><code>72bc2a4</code></a> Simplify _processResponse error handling.</li> <li><a href="`3d42aecdca`"><code>3d42aec</code></a> Add bracket tests.</li> <li><a href="`bcbb096b32`"><code>bcbb096</code></a> Do not directly set Error properties.</li> <li><a href="`192dbe7ce6`"><code>192dbe7</code></a> Release version 1.15.3 of the npm package.</li> <li><a href="`bd8c81e4f3`"><code>bd8c81e</code></a> Fix resource leak on destroy.</li> <li><a href="`9c728c314b`"><code>9c728c3</code></a> Split linting and testing.</li> <li>Additional commits viewable in <a href="https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=follow-redirects&package-manager=npm_and_yarn&previous-version=1.15.2&new-version=1.15.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-01-10 22:08:14 -08:00
pengwa	d03e477b90	Fix missing subgraph candidates for recompute (#19077 ) ### Fix missing subgraph candidates for recompute For subgraphs for example `MatMul+Transpose+Reshape`, since the ending node is a Reshape, in ORT, it is reusing input buffers. Currently, the subgraph detection logic has defect, as a result, those subgraphs will be missing as recompute candidates. Also append a few more node types for recompute support. TODO: add unit test later. This PR is needed for a customer model now.	2024-01-11 12:50:55 +08:00
Yulong Wang	0a0ef958eb	update .vscode/settings.json (#19084 ) ### Description `"explicit"` now replaced `true` to config entry "source.organizeImports". Latest VSCode will automatically modify this config.	2024-01-10 19:26:01 -08:00
Changming Sun	053ddfe3fd	Disable per-session thread pool for web (#18480 ) ### Description ORT web prefers to use a global thread pool for all inference sessions. See how OrtCreateSession is implemented in https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/wasm/api.cc#L183 . Application code can only the global thread poo. However, internal testing code still often use per-session threadpool. This pr is to fix the inconsistency. ### Motivation and Context Replace PR #18476	2024-01-10 18:45:49 -08:00
Yvonne Chen	5678317baf	Fix the duplicated QDQ attributes setup issue (#18039 ) ### Description The copied QDQ node should have exactly the same attributes as the original QDQ node. Otherwise, it might cause errors when the original node has attributes that use non default values (such as axis != 1 case). An example user case is like: A DequantizeLinear node has more than 1 consumer in the graph, and its attributes axis is 0. ### Motivation and Context I see the errors like https://github.com/microsoft/onnxruntime/issues/16188 and this fix could solve the issue.	2024-01-10 18:36:33 -08:00
Jiajia Qin	fd6bab4250	[js/webgpu] Provide a vectorized algorithm for GroupedConv (#18884 ) ### Description This PR provides a vectorized algorithm for NHWC GroupedConv to improve performance. The aggregate time of GroupedConv in mobilenetv2-12 becomes ~1ms from ~4ms on Intel Alder Lake machine. About 20% improvement for the whole model.	2024-01-10 16:12:43 -08:00
Yifan Li	e58319ebfc	[TensorRT EP] Fix memleak (#19053 ) ### Description <!-- Describe your changes. --> To fix memleak: ```bash 192 bytes in 1 blocks are definitely lost in loss record 1,254 of 1,999 at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) by 0x4A93FD5: OrtApis::CreateTensorRTProviderOptions(OrtTensorRTProviderOptionsV2) (in /code/onnxruntime/build/Linux/Release/libonnxruntime.so.1.17.0) by 0x1502E1: onnxruntime::perftest::OnnxRuntimeTestSession::OnnxRuntimeTestSession(Ort::Env&, std::random_device&, onnxruntime::perftest::PerformanceTestConfig const&, TestModelInfo const&) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test) by 0x15A404: onnxruntime::perftest::PerformanceRunner::PerformanceRunner(Ort::Env&, onnxruntime::perftest::PerformanceTestConfig const&, std::random_device&) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test) by 0x14C6D9: real_main(int, char) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test) by 0x145A2A: main (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test) ``` add ptr to help release trtep provider options ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-10 15:29:34 -08:00
yuwenzho	731b50dfc4	Support INT4 weight only quantize, including RTN and GPTQ 2 algorithms (#17390 ) ### Description Support INT4 weight only quantize (WOQ) via Intel Neural Compressor, including RTN and GPTQ 2 algorithms. Note: Please install `neural-compressor==2.3` for weight only quantize. ### Motivation and Context As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy. Compared to normal quantization like W8A8, weight only quantization is probably a better trade-off to balance the performance and the accuracy. RTN is the most straightforward way to quantize weight. GPTQ algorithm provides more accurate quantization but requires more computational resources. ### Evaluation results The following table shows the accuracy results of Llama-2 models evaluated on [lambada_openai](https://huggingface.co/datasets/lambada) task. `GPTQ W4G32Asym` in configuration column means GPTQ algorithm is used for 4-bit weight only quantization, setting group_size=32 and scheme=asym. <table class="tg"> <thead> <tr> <th rowspan="2">Model name</th> <th rowspan="2">Configuration</th> <th colspan="2">Lambada_openai</th> <th rowspan="2">Accuracy Ratio<br>[WOQ/FP32]</th> </tr> <tr> <th>Accuracy</th> <th>Perplexity</th> </tr> </thead> <tbody> <tr> <td rowspan="2">meta-llama/Llama-2-7b-chat-hf</td> <td>FP32</td> <td>0.7058</td> <td>3.2788</td> <td>/</td> </tr> <tr> <td>GPTQ<br>W4G32Asym</td> <td>0.7025</td> <td>3.4489</td> <td>99.53%</td> </tr> <tr> <td rowspan="2">meta-llama/Llama-2-7b-hf</td> <td>FP32</td> <td>0.7392</td> <td>3.3950</td> <td>/</td> </tr> <tr> <td>GPTQ<br>W4G32Asym</td> <td>0.7326</td> <td>3.5286</td> <td>99.11%</td> </tr> <tr> <td rowspan="2">meta-llama/Llama-2-13b-chat-hf</td> <td>FP32</td> <td>0.7312</td> <td>2.9163</td> <td>/</td> </tr> <tr> <td>GPTQ<br>W4G128Asym</td> <td>0.7289</td> <td>3.0061</td> <td>99.56%</td> <tr> <td rowspan="2">meta-llama/Llama-2-13b-hf</td> <td>FP32</td> <td>0.7677</td> <td>3.0438</td> <td>/</td> </tr> <tr> <td>GPTQ<br>W4G32Asym</td> <td>0.7607</td> <td>3.1562</td> <td>99.09%</td> </tr> <tr> <td rowspan="2">meta-llama/Llama-2-70b-chat-hf</td> <td>FP32</td> <td>0.7543</td> <td>2.6181</td> <td>/</td> </tr> <tr> <td>RTN<br>W4G32Sym</td> <td>0.7489</td> <td>2.6850</td> <td>99.28%</td> </tr> <tr> <td rowspan="2">meta-llama/Llama-2-70b-hf</td> <td>FP32</td> <td>0.7964</td> <td>2.6612</td> <td>/</td> </tr> <tr> <td>RTN<br>W4G32Sym</td> <td>0.7896</td> <td>2.7546</td> <td>99.15%</td> </tr> </tbody> </table> --------- Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Co-authored-by: Wang, Mengni <mengni.wang@intel.com>	2024-01-10 15:13:04 -08:00
RandySheriffH	df116b82c7	Custom op API for thread pool (#18980 ) Allow custom op to invoke internal thread-pool for parallelism. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2024-01-10 14:13:25 -08:00
Xavier Dupré	cf78d01546	remove use of ai.onnx.ml in test for custom ops and local functions (#19043 ) ### Description QNN_Nuget_Windows does not allow ai.onnx.ml operators but the test test_custom_op_local_function is using LabelEncoder. The operator can be removed as the test is only checking custom ops api. ### Motivation and Context Fix test test_custom_op_local_function in QNN_Nuget_Windows pipeline.	2024-01-10 16:36:50 +01:00
PeixuanZuo	5f3113ecd6	[ROCm] Fix hipify error: fast_divmod.h: No such file or directory (#19060 ) Fix error: ``` [ 48%] Built target onnxruntime_optimizer In file included from /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_stream_handle.cc:5: /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_common.h:11:10: fatal error: core/providers/rocm/shared_inc/fast_divmod.h: No such file or directory 11 \| #include "core/providers/rocm/shared_inc/fast_divmod.h" \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. ``` This error is due to onnxruntime_optimizer missing dependencies on hipify generated files.	2024-01-10 14:49:19 +08:00
Xu Xing	ed0f26d3d4	[js/webgpu] Revert parse norm attributes (#19074 ) This resolves the below build errors: ``` lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS2724: '"./ops/instance-norm"' has no exported member named 'parseInstanceNormAttributes'. Did you mean 'InstanceNormAttributes'? 19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm'; ~~~~~~~~~~~~~~~~~~~~~~~~~~~ lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS6133: 'parseInstanceNormAttributes' is declared but its value is never read. 19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm'; ~~~~~~~~~~~~~~~~~~~~~~~~~~~ lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS2305: Module '"./ops/layer-norm"' has no exported member 'parseLayerNormAttributes'. 20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm'; ~~~~~~~~~~~~~~~~~~~~~~~~ lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS6133: 'parseLayerNormAttributes' is declared but its value is never read. 20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm'; ```	2024-01-09 20:58:50 -08:00
Baiju Meswani	730df1bfa2	Increase MacOS pipeline timeout (#19072 )	2024-01-09 18:35:21 -08:00
Changming Sun	b25980c011	Disable rust pipeline for now (#19067 ) ### Description They are not working. When we have time to continue working on it, we can restore them from git history.	2024-01-09 17:09:31 -08:00
Wanming Lin	fa14dcd2b6	[WebNN EP] Support subgraph of the control flow nodes (#18923 ) This PR also makes some processing on the subgraph's initializers. The subgraph doesn't contain all its required initializers, some common initializers are stored in its ancestor graphs. We need to collect all required initializers and re-map to the subgraph.	2024-01-09 15:07:54 -08:00
Xu Xing	76dfe5347c	[js/webgpu] Support uniforms for instance-norm (#18929 ) Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>	2024-01-09 14:56:00 -08:00
Milos Puzovic	37ac9d391c	Enable Arm Compute Library 23.08 (#17672 ) ### Description This PR enables onnxruntime to build with the most recent release of Arm Compute Library ### Motivation and Context The latest version of Arm Compute Library that onnxruntime can build is 20.02 which is more than 3 years old.	2024-01-09 14:10:25 -08:00
Changming Sun	a2afd92093	Format TS code (#19066 ) ### Description Format code	2024-01-09 13:41:10 -08:00
Ashwini Khade	897a4163d7	Update transformer version for training CIs (#19046 ) ### Description Updating version to resolve security vulnerability.	2024-01-09 12:00:34 -08:00
Yifan Li	574c7caf3a	[TensorRT EP] Clear constrain of trt plugin with different input type (#19044 ) ### Description <!-- Describe your changes. --> Add heterogeneous support to skip this check for TRT plugin which has different input tensor types ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-09 10:29:06 -08:00
zesongw	ad6dd0a597	[WebNN] Enable npm unit tests (#18486 ) ### Description - Support more test cases for WebNN EP in suite-test-list.jsonc - Add DISABLE_WEBNN flag in build.ts as preparing for WebNN EP release - Add test option: '--webnn-device-type' in test-runner-args-cli.ts to support running WebNN 'gpu' deviceType - Use Chrome Stable as default browser for WebNN testing to unblock the CI limitation.	2024-01-09 10:10:57 -08:00
Xu Xing	557ac74c05	[js/webgpu] Support gemm uniforms (#19056 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-09 09:57:06 -08:00
Xu Xing	42ba2aed54	[js/webgpu] Support pad uniforms (#19057 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-09 09:34:56 -08:00
Xu Xing	eb92681bfb	[js/webgpu] Support range uniforms (#19055 )	2024-01-09 09:33:57 -08:00
junchao-loongson	c1367ae553	Sqnbitgemm: add loongarch64 code path (#18775 ) ### Description Add support code for loongarch64 platform in sqnbitgemm ``` 100% tests passed, 0 tests failed out of 7 Total Test time (real) = 116.99 sec 2023-12-11 10:43:21,287 build [INFO] - Build complete ```	2024-01-09 09:20:45 -08:00

1 2 3 4 5 ...

10337 commits