Commit graph

12332 commits

Author SHA1 Message Date
aa4e8e31dc Quick fix, referencing a known issue from the official repo (onnxruntime) 2014-08-26 20:50:00 +02:00
b98cacd5ed Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-08-26 17:07:00 +02:00
1ed9494329 Refactor for clarity, might break a few tests though (onnxruntime) 2014-07-22 02:59:00 +02:00
c5c4cdab9b Late-night bugfix on financial RL environment (onnxruntime) 2014-07-22 22:21:00 +02:00
d9111330b2 Late-night bugfix on financial RL environment (onnxruntime) 2014-07-22 21:53:00 +02:00
16db96a5ee Late-night bugfix on financial RL environment (onnxruntime) 2014-07-11 22:08:00 +02:00
5c7e0eaadb Refactor for clarity, might break a few tests though (onnxruntime) 2014-06-06 07:20:00 +02:00
6583d7a9e0 Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-06-06 21:06:00 +02:00
07cff3567b Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-06-06 06:21:00 +02:00
8eb35e2ef1 Trying out boneh-franklin approach for IBE (ref. 2003 paper) (onnxruntime) 2014-05-21 18:29:00 +02:00
b01237e408 Quick fix, referencing a known issue from the official repo (onnxruntime) 2014-03-27 20:11:00 +01:00
16e98e7844 Implementing approach from a new paper read last night (onnxruntime) 2014-03-27 20:33:00 +01:00
7d74e4142f Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime) 2014-03-27 02:30:00 +01:00
bbf6c15c38 Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-03-13 20:45:00 +01:00
dd57ce7d25 Quick fix, referencing a known issue from the official repo (onnxruntime) 2014-03-13 05:30:00 +01:00
3b6a120e7b Refactor for clarity, might break a few tests though (onnxruntime) 2014-03-13 00:23:00 +01:00
95e97d9aeb Implementing approach from a new paper read last night (onnxruntime) 2014-03-13 03:47:00 +01:00
f8bc8f7d01 Late-night bugfix on financial RL environment (onnxruntime) 2014-02-26 08:47:00 +01:00
0d3fd2dae5 Implementing approach from a new paper read last night (onnxruntime) 2014-02-03 02:30:00 +01:00
8870016009 Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-01-15 07:10:00 +01:00
ba8a97128a Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime) 2014-01-03 05:27:00 +01:00
0b529cce22 Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-01-03 08:37:00 +01:00
353bef6887 Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-01-03 21:38:00 +01:00
77ccf0441d Late-night bugfix on financial RL environment (onnxruntime) 2014-01-03 05:27:00 +01:00
Yifan Li
0274b7b82f
fix on trtCudaVersion (#23616)
### Description
<!-- Describe your changes. -->
TensorRT 10.8 zip file has suffix of cuda-12.8, not 12.6


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2025-02-08 14:20:00 -08:00
Yulong Wang
740e9ab9f8
update run CI script (#23621)
### Description

Add `Win_TRT_Minimal_CUDA_Test_CI`.
2025-02-08 12:28:50 -08:00
shaoboyan091
5ef18328bf
[WebGPU] Support PIX Capture for WebGPU EP (#23192)
PIX Capture tool requires 'present' to end a frame capture. ORT doesn't
have rendering work so no 'present' happens.

To avoid endless waiting for PIX capture tool, this PR added a blank
surface and 'present' on it in each session run.

The surface is created in WebGPU ep constructor and closed in WebGPU ep
destructor.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2025-02-08 02:05:15 -08:00
Javier Martinez
01145511b1
Fix for C4267 warning (#23610)
### Description
A recent
[commit](1fce51b3b2)
is causing an OVEP warning in
[openvino_provider_factory.cc](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/openvino/openvino_provider_factory.cc#L151).
This PR fixes the warning.

### Motivation and Context
Minor fix
2025-02-07 23:01:28 -08:00
Hector Li
002916acb0
Validate the context_file_path before EP compile graphs (#23611)
Validate the context_file_path before EP compile graphs to make it fail fast. To avoid the possibility that EP generate new file (context binary file or blob file) over write the existing file. Return error if the path points to folder.
2025-02-07 21:31:11 -08:00
Jie Chen
0887e3694a
[webgpu] Use pushErrorScope()/popErrorScope() once for an inference run (#23438)
The CPU walltime of waiting for PopErrorScope is non-trivial, and also
validation errors are not expected to happen in Release build.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2025-02-07 13:52:58 -08:00
microsoft-github-policy-service[bot]
65008cbb73
Auto-generated baselines by 1ES Pipeline Templates (#23603) 2025-02-06 17:06:29 -08:00
Tianlei Wu
09e5724f3b
[CUDA] Fix beam search of num_beams > 32 (#23599)
### Description
* Pass topk_scores to beam scorer in slow topk path.
* Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk.
* Add a test case for slow topk path.

### Motivation and Context

This bug was introduced in
https://github.com/microsoft/onnxruntime/pull/16272

Beam search uses fast cuda kernel when number of beams <= 32. When beam
size is larger than that threshold, we use another code path (slower
cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be
passed to beam scorer but it is not.

This bug will cause incorrect result when num_beams > 32. It was not
found previously since such large beam size is rarely used.
2025-02-06 16:50:31 -08:00
Sushanth Rajasankar
82840f635d
Implement Flash Attention 2 for webgpu EP (#23576)
### Description
This change implements FlashAttention 2 for the webgpu EP for the MHA
operator.

Numbers from Alderlake device show a 2.2x speed up for prefill, which
considering that Attention is 50% of prefill phase (other 50% being
MatMul) implies 4x speed up for Attention with this implementation. This
is inline with the expected perf gain of 2-4x with FlashAttention over
regular attention.

```
Baseline
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       9.54997e+06   <<<<<
        avg (tokens/s): 104.817
        p50 (us):       9.49218e+06
        stddev (us):    251442
        n:              5 * 1001 token(s)
------
With FlashAttention 2
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       4.27937e+06     <<<<<
        avg (tokens/s): 233.913
        p50 (us):       4.27687e+06
        stddev (us):    5344.1
        n:              5 * 1001 token(s)
```

### Motivation and Context

On integrated GPUs memory bandwidth is premium, Flash attention makes
softmax computation (and therefore output attention vector computation)
a running operation instead of maintaining full QKt attention scores in
memory. As a result, we see significant improvements in prefill speed -
200% speed up measured here.

This change uses techniques from co-operative matrix multiply to use
registers from a subgroup for fast in register matrix multiply. Without
the co-operative matrix multiply technique ALD showed about 6.0s prefill
time.

Tested on ALD/TGL intel integrated and Nvidia 4070.

### Future Work
- Fine tuning and profiling optimizations.
- Current implement is for prefill only, a generation phase optimized
FA2 implementation is possible, however attention is a tiny part of the
generation phase.
2025-02-06 16:32:05 -08:00
Ankit Maheshkar
a6ea57b8f3
OpenVINO EP Weights Sharing Feature (#23553)
### Description
These changes are done to ensure that weight sharing happens between two model using session context option ep_weight_sharing.

Key changes introduced in this feature are:

Creating a shared context between two models Extracting external constant initializers and re labelling them back as
inputs to the model to allow weight loading in the direct blob. Creating EP Context Nodes when Subgraph partitioning is happening.

### Motivation and Context
This change was required to ensure that LLM with prefill and kvcache models can use the same share
The change was also required to ensure EP Context nodes can be formed even when model is being subgraph partitioned.

---------

Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
2025-02-06 14:57:38 -08:00
Tianlei Wu
2c2ff4aef9
[CUDA] Fix BeamSearchTest.DummyT5WithSequenceInputIds test failure in Windows (#23596)
### Description
BeamSearchTest.DummyT5WithSequenceInputIds failed in Windows due to
early stopping triggered. The cause is state.early_stopping_ is
interpreted as true in cuda kernel at some point, however printf still
show its value is false. The root cause is unknown.

Update the code to use early_stopping as template parameter seems walk
around the issue.

Other changes: 
* Add some debug code (will not be built into binary unless
DEBUG_GENERATION is fined) to assist debugging beam search scorer in
CUDA.
* Enable DummyT5WithSequenceInputIds test in CI. This test was not run
in Windows CUDA CI pipeline previously.

### Motivation and Context

Fix a unit test BeamSearchTest.DummyT5WithSequenceInputIds failure in
Windows.
2025-02-06 13:15:09 -08:00
Joshua Lochner
d981b153d3
[webgpu/js] Optimize resize webgpu op & fix precision issues (#23591)
### Description
<!-- Describe your changes. -->

This PR is a follow-up to
https://github.com/microsoft/onnxruntime/pull/23488 and partially
improves upon https://github.com/microsoft/onnxruntime/issues/23403. It
does the following:
- Prevents unnecessary cache shader recompilation for 'nearest' resize
operation.
- Fixes precision (offset-by-one) errors with asymmetric coordinate
transform. When running the Kokoro TTS model, values for the
`/decoder/decoder/generator/f0_upsamp/Resize_output_0` results in
differences at the end bounds due to precision issues when dividing
21600 by 72 (should be 300, but seemingly results in 299.999, which
causes issues when flooring)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

I did a deep dive over the weekend to try fix Kokoro TTS on WebGPU and
found that the above node had a large difference. Thinking this was a
major issue, I spent some time fixing it. Turns out, it only happens for
a small number of values, leading to high maximum error, but most values
are correct (as seen here).

BEFORE:
```
[/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 78.6640682220459 | rtol: 24.13991587587724 | avgDiff: 0.009967932171121087 | medianDiff: 0.000030517578125
```

AFTER:
```
[/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 0.0011138916015625 | rtol: 0.0020059924232260704 | avgDiff: 0.00008570214675873825 | medianDiff: 0.000030517578125
```

So, although it has a very small impact on the final output (waveform),
this bug could appear with other models in a more severe way.

BEFORE:
```
[waveform] atol: 0.04784199967980385 | rtol: 1366.0462001093495 | avgDiff: 0.0009544936942737713 | medianDiff: 0.00015346752479672432
```

AFTER:
```
[waveform] atol: 0.04775865003466606 | rtol: 1354.7002460360852 | avgDiff: 0.000954830244055033 | medianDiff: 0.00015274062752723694
```
2025-02-06 10:26:25 -08:00
Changming Sun
328a13c06d
Enable VCPKG in more pipelines (#23590)
### Description
Enable VCPKG in more pipelines
2025-02-06 10:10:31 -08:00
Yifan Li
6728d6085d
[TensorRT EP] support TensorRT 10.8-GA (#23592)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2025-02-06 10:05:57 -08:00
Jambay Kinley
d1fb58b0f2
Quantization tool: Allow user to override calibrator's session EP (#23559)
### Description
The quantization calibrators have `execution_providers` attributes but
there is no way for a user to provide their own providers when using the
`quantize` or `quantize_static` functions. This PR adds a
`calibration_providers` parameter to allow users to specify the
execution providers to use during calibration. It is helpful when
quantizing large models which are slow to calibrate on the CPU.
- Chose `calibration_providers` as the name since there is the
docstrings refer to another `execution_provider`
169917b1e7/onnxruntime/python/tools/quantization/quantize.py (L204)

169917b1e7/onnxruntime/python/tools/quantization/quantize.py (L415)
which are not present anywhere in the code.
- Can change the name to something else if needed like
calibrator_providers, and/or make it into a string instead of a
providers list.
2025-02-05 22:38:21 -08:00
Hector Li
649ced4a60
Enable user loading model with external data from memory buffer (#23557)
Add session option to enable user loading model with external data from memory buffer. User want to set the folder path for the external data files.

### Description
For some cases user load the model from memory buffer, but they can't load the external files into memory. They need to have a way to set the folder path for the external data files so that Ort can figure out the external data location.
2025-02-05 22:31:13 -08:00
Satya Kumar Jandhyala
544bdd6073
Fix ConvTranspose for certain attribute combinations (#23488)
### Description
Convert output_padding attribute from 1D to 2D convtranspose



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/23403
2025-02-05 12:22:47 -08:00
Changming Sun
8f6ddf3bd5
Delete extra cgmanifest entries and files (#23583)
Remove the auto-generated cgmanifest.json. Because now we can get the
same information from vcpkg.
Also, remote some outdated entries in the main cgmanifest.json file.
2025-02-05 11:21:21 -08:00
Changming Sun
5f6a3158f8
Enable VCPKG in CI build (#23426)
### Description
1. Enable VCPKG flag in Windows CPU CI build pipelines. 
2. Increased the min supported cmake version from 3.26 to 3.28. Because
of it, drop the support for the old way of finding python by
"find_package(PythonLibs)". Therefore, in build.py we no longer set
"PYTHON_EXECUTABLE" cmake var when doing cmake configure.
3. Added "xnnpack-ep" as a feature for ORT's vcpkg config.
4. Added asset cache support for ORT's vcpkg build
5. Added VCPKG triplet files for Android build.
6. Set VCPKG triplet to "universal2-osx" if CMAKE_OSX_ARCHITECTURES was
found in cmake extra defines.
7. Removed a small piece of code in build.py, which was for support CUDA
version < 11.8.
8. Fixed an issue that CMAKE_OSX_ARCHITECTURES sometimes got specified
twice when build.py invoked cmake.
9. Added more model tests to Android build. After this change, we will
test all ONNX versions instead of just the latest one.
10. Fixed issues that are related to build.py's "--build_nuget"
parameter. Also, enable the flag in most Windows CPU CI build jobs.
11. Removed a restriction in build.py that disallowed cross-compiling
Windows ARM64 nuget package on Windows x86.
 
### Motivation and Context
Adopt vcpkg.
2025-02-05 10:58:53 -08:00
dependabot[bot]
e1e3f623f6
Bump lintrunner from 0.12.5 to 0.12.7 (#23326)
Bumps [lintrunner](https://github.com/suo/lintrunner) from 0.12.5 to
0.12.7.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/suo/lintrunner/blob/main/CHANGELOG.md">lintrunner's
changelog</a>.</em></p>
<blockquote>
<h2>[0.12.7] - 2024-12-05</h2>
<h3>Bug Fixes</h3>
<ul>
<li>Build x86_64 wheels for Windows (<a
href="a4d6b74693">a4d6b74</a>)</li>
<li>Fix <a href="https://doc.rust-lang.org/clippy/">Clippy</a>
violatoins (<a
href="05ff6431bb">05ff643</a>)</li>
<li>Fetch all commit history to fix MacOS builds (<a
href="3770be65ee">3770be6</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="1b70da01a6"><code>1b70da0</code></a>
chore(release): prep for 0.12.7</li>
<li><a
href="3770be65ee"><code>3770be6</code></a>
[CI] Fetch full commit history (<a
href="https://redirect.github.com/suo/lintrunner/issues/81">#81</a>)</li>
<li><a
href="b2482aff48"><code>b2482af</code></a>
[CI] Use <code>actions/checkout@v4</code> (<a
href="https://redirect.github.com/suo/lintrunner/issues/80">#80</a>)</li>
<li><a
href="05ff6431bb"><code>05ff643</code></a>
Fix clippy violations (<a
href="https://redirect.github.com/suo/lintrunner/issues/79">#79</a>)</li>
<li><a
href="1be20c6b8f"><code>1be20c6</code></a>
chore(release): prep for 0.12.6</li>
<li><a
href="a4d6b74693"><code>a4d6b74</code></a>
fix(build): build x86_64 wheels for Windows (<a
href="https://redirect.github.com/suo/lintrunner/issues/73">#73</a>)</li>
<li>See full diff in <a
href="https://github.com/suo/lintrunner/compare/v0.12.5...v0.12.7">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lintrunner&package-manager=pip&previous-version=0.12.5&new-version=0.12.7)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-04 19:50:56 -08:00
Jon Campbell
cd8775f518
Fix Node JS Samples (#23581)
### Description
The Node JS Samples included in the repository have outdated package
references that are broken, which are fixed in this PR.

### Motivation and Context
The samples included in this repository should just work, but sadly do
not. The reason is that they are using very outdated references for the
npm modules. This fix updates the dependencies to the current
onnxruntime-node, which fixes the samples. Also adds a small update to
the .gitignore to exclude the node_modules directories in the samples
directory, which keeps the local repo changelist cleaner.
2025-02-04 19:50:29 -08:00
Prathik Rao
6b4f9c481d
[WebGPU EP] Batch Norm Implementation (#23525)
Increases operator coverage for webgpu ep.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-02-04 17:38:45 -08:00
Gavin Kinsey
1fce51b3b2
Fix all instances of 4244 and 4267 warnings in OV EP code (#23567)
### Description
Remove MSVC warnings 4244, 4267 from the list of disabled warnings in
cmake.
Fix the code that generates the warnings so that it no longer does.

### Motivation and Context
This makes onnxruntime_providers_openvino.dll pass BinSkim analysis.
Without this change BinSkim complains about the disabled warnings.
2025-02-04 17:13:27 -08:00
Hector Li
c29ca1cb41
Update QNN default version to 2.31 (#23573)
Update QNN default version to 2.31
2025-02-04 16:24:54 -08:00
Caroline Zhu
2fc75a45a2
[mobile] Add Android BrowserStack test project back (#23551)
## Description
Follow-up for #23383 and #23474

* Adds android BrowserStack test back in
* Modifies MAUI csproj file to build into an APK


### Motivation and Context
There were 2 issues with the previous PRs:
1. The updated MAUI .csproj file configuration failed when building to
iOS and MacCatalyst. This caused problems in the packaging pipeline
because we build all C# projects in the .soln file in the packaging
pipeline. Removed the Mac & iOS build targets for now

3. The previous MAUI .csproj file configuration did not build into an
APK. It was missing the `<OutputType>` XAML tag and the Android package
type XAML tag.
2025-02-04 14:39:50 -08:00
Tianlei Wu
9e18b6a0f3
[CUDA] Update nvcc flags (#23572)
### Description
(1) Remove `if (CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 11)`
since build requires cuda >= 11.4.
(2) Add sm_86 and sm_89 since we generate SASS code for specified cuda
architectures only. This change could support popular consumer GPUs
(like RTX 30X0 and RTX 40X0).
(3) Add sm_120 to support Blackwell GPUs (like RTX 50X0 etc).
(4) Add `-Xfatbin=-compress-all` to reduce wheel size. When
CMAKE_CUDA_ARCHITECTURES is not specified, the linux wheel size built by
CUDA 12.8 is reduced 8% (from 324MB to 299MB).

### Motivation and Context

To support popular consumer GPUs (RTX 30x0, 40x0, 50x0) in the default
setting. Reduce binary size.

Note that the default sm settings does not impact official released
binary. ORT official released binary are built with augmentation like
CMAKE_CUDA_ARCHITECTURES=75;80;90, which has both SASS (real) and PTX
(virtual) by default. See
https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html for
more info.
2025-02-04 11:47:02 -08:00