Commit graph

12346 commits

Author SHA1 Message Date
c0cc6369dd Implementing approach from a new paper read last night (onnxruntime) 2015-05-22 00:04:00 +02:00
46b031c00b Minor doc updates: linking to article on quantization (onnxruntime) 2015-05-04 08:05:00 +02:00
10e0e4c77b Quick fix, referencing a known issue from the official repo (onnxruntime) 2015-05-04 08:45:00 +02:00
bd34bcc8f0 Trying out boneh-franklin approach for IBE (ref. 2003 paper) (onnxruntime) 2015-04-07 19:06:00 +02:00
82ea9b7e4b Minor doc updates: linking to article on quantization (onnxruntime) 2015-04-07 22:43:00 +02:00
bab0005f8f Implementing approach from a new paper read last night (onnxruntime) 2015-04-07 03:53:00 +02:00
77f6d2451c Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime) 2015-04-07 04:54:00 +02:00
670cd61113 Implementing approach from a new paper read last night (onnxruntime) 2015-03-15 02:06:00 +01:00
c27806ca45 Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2015-03-15 07:58:00 +01:00
352b192bd9 Refactor for clarity, might break a few tests though (onnxruntime) 2015-01-30 04:57:00 +01:00
203d84fdf3 Minor doc updates: linking to article on quantization (onnxruntime) 2014-09-22 03:38:00 +02:00
808d20cc47 Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime) 2014-09-22 02:04:00 +02:00
08e2251961 Refactor for clarity, might break a few tests though (onnxruntime) 2014-09-22 04:33:00 +02:00
fec9d1fb67 Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime) 2014-08-26 06:02:00 +02:00
aa4e8e31dc Quick fix, referencing a known issue from the official repo (onnxruntime) 2014-08-26 20:50:00 +02:00
b98cacd5ed Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-08-26 17:07:00 +02:00
1ed9494329 Refactor for clarity, might break a few tests though (onnxruntime) 2014-07-22 02:59:00 +02:00
c5c4cdab9b Late-night bugfix on financial RL environment (onnxruntime) 2014-07-22 22:21:00 +02:00
d9111330b2 Late-night bugfix on financial RL environment (onnxruntime) 2014-07-22 21:53:00 +02:00
16db96a5ee Late-night bugfix on financial RL environment (onnxruntime) 2014-07-11 22:08:00 +02:00
5c7e0eaadb Refactor for clarity, might break a few tests though (onnxruntime) 2014-06-06 07:20:00 +02:00
6583d7a9e0 Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-06-06 21:06:00 +02:00
07cff3567b Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-06-06 06:21:00 +02:00
8eb35e2ef1 Trying out boneh-franklin approach for IBE (ref. 2003 paper) (onnxruntime) 2014-05-21 18:29:00 +02:00
b01237e408 Quick fix, referencing a known issue from the official repo (onnxruntime) 2014-03-27 20:11:00 +01:00
16e98e7844 Implementing approach from a new paper read last night (onnxruntime) 2014-03-27 20:33:00 +01:00
7d74e4142f Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime) 2014-03-27 02:30:00 +01:00
bbf6c15c38 Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-03-13 20:45:00 +01:00
dd57ce7d25 Quick fix, referencing a known issue from the official repo (onnxruntime) 2014-03-13 05:30:00 +01:00
3b6a120e7b Refactor for clarity, might break a few tests though (onnxruntime) 2014-03-13 00:23:00 +01:00
95e97d9aeb Implementing approach from a new paper read last night (onnxruntime) 2014-03-13 03:47:00 +01:00
f8bc8f7d01 Late-night bugfix on financial RL environment (onnxruntime) 2014-02-26 08:47:00 +01:00
0d3fd2dae5 Implementing approach from a new paper read last night (onnxruntime) 2014-02-03 02:30:00 +01:00
8870016009 Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-01-15 07:10:00 +01:00
ba8a97128a Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime) 2014-01-03 05:27:00 +01:00
0b529cce22 Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-01-03 08:37:00 +01:00
353bef6887 Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime) 2014-01-03 21:38:00 +01:00
77ccf0441d Late-night bugfix on financial RL environment (onnxruntime) 2014-01-03 05:27:00 +01:00
Yifan Li
0274b7b82f
fix on trtCudaVersion (#23616)
### Description
<!-- Describe your changes. -->
TensorRT 10.8 zip file has suffix of cuda-12.8, not 12.6


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2025-02-08 14:20:00 -08:00
Yulong Wang
740e9ab9f8
update run CI script (#23621)
### Description

Add `Win_TRT_Minimal_CUDA_Test_CI`.
2025-02-08 12:28:50 -08:00
shaoboyan091
5ef18328bf
[WebGPU] Support PIX Capture for WebGPU EP (#23192)
PIX Capture tool requires 'present' to end a frame capture. ORT doesn't
have rendering work so no 'present' happens.

To avoid endless waiting for PIX capture tool, this PR added a blank
surface and 'present' on it in each session run.

The surface is created in WebGPU ep constructor and closed in WebGPU ep
destructor.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2025-02-08 02:05:15 -08:00
Javier Martinez
01145511b1
Fix for C4267 warning (#23610)
### Description
A recent
[commit](1fce51b3b2)
is causing an OVEP warning in
[openvino_provider_factory.cc](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/openvino/openvino_provider_factory.cc#L151).
This PR fixes the warning.

### Motivation and Context
Minor fix
2025-02-07 23:01:28 -08:00
Hector Li
002916acb0
Validate the context_file_path before EP compile graphs (#23611)
Validate the context_file_path before EP compile graphs to make it fail fast. To avoid the possibility that EP generate new file (context binary file or blob file) over write the existing file. Return error if the path points to folder.
2025-02-07 21:31:11 -08:00
Jie Chen
0887e3694a
[webgpu] Use pushErrorScope()/popErrorScope() once for an inference run (#23438)
The CPU walltime of waiting for PopErrorScope is non-trivial, and also
validation errors are not expected to happen in Release build.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2025-02-07 13:52:58 -08:00
microsoft-github-policy-service[bot]
65008cbb73
Auto-generated baselines by 1ES Pipeline Templates (#23603) 2025-02-06 17:06:29 -08:00
Tianlei Wu
09e5724f3b
[CUDA] Fix beam search of num_beams > 32 (#23599)
### Description
* Pass topk_scores to beam scorer in slow topk path.
* Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk.
* Add a test case for slow topk path.

### Motivation and Context

This bug was introduced in
https://github.com/microsoft/onnxruntime/pull/16272

Beam search uses fast cuda kernel when number of beams <= 32. When beam
size is larger than that threshold, we use another code path (slower
cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be
passed to beam scorer but it is not.

This bug will cause incorrect result when num_beams > 32. It was not
found previously since such large beam size is rarely used.
2025-02-06 16:50:31 -08:00
Sushanth Rajasankar
82840f635d
Implement Flash Attention 2 for webgpu EP (#23576)
### Description
This change implements FlashAttention 2 for the webgpu EP for the MHA
operator.

Numbers from Alderlake device show a 2.2x speed up for prefill, which
considering that Attention is 50% of prefill phase (other 50% being
MatMul) implies 4x speed up for Attention with this implementation. This
is inline with the expected perf gain of 2-4x with FlashAttention over
regular attention.

```
Baseline
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       9.54997e+06   <<<<<
        avg (tokens/s): 104.817
        p50 (us):       9.49218e+06
        stddev (us):    251442
        n:              5 * 1001 token(s)
------
With FlashAttention 2
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       4.27937e+06     <<<<<
        avg (tokens/s): 233.913
        p50 (us):       4.27687e+06
        stddev (us):    5344.1
        n:              5 * 1001 token(s)
```

### Motivation and Context

On integrated GPUs memory bandwidth is premium, Flash attention makes
softmax computation (and therefore output attention vector computation)
a running operation instead of maintaining full QKt attention scores in
memory. As a result, we see significant improvements in prefill speed -
200% speed up measured here.

This change uses techniques from co-operative matrix multiply to use
registers from a subgroup for fast in register matrix multiply. Without
the co-operative matrix multiply technique ALD showed about 6.0s prefill
time.

Tested on ALD/TGL intel integrated and Nvidia 4070.

### Future Work
- Fine tuning and profiling optimizations.
- Current implement is for prefill only, a generation phase optimized
FA2 implementation is possible, however attention is a tiny part of the
generation phase.
2025-02-06 16:32:05 -08:00
Ankit Maheshkar
a6ea57b8f3
OpenVINO EP Weights Sharing Feature (#23553)
### Description
These changes are done to ensure that weight sharing happens between two model using session context option ep_weight_sharing.

Key changes introduced in this feature are:

Creating a shared context between two models Extracting external constant initializers and re labelling them back as
inputs to the model to allow weight loading in the direct blob. Creating EP Context Nodes when Subgraph partitioning is happening.

### Motivation and Context
This change was required to ensure that LLM with prefill and kvcache models can use the same share
The change was also required to ensure EP Context nodes can be formed even when model is being subgraph partitioned.

---------

Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
2025-02-06 14:57:38 -08:00
Tianlei Wu
2c2ff4aef9
[CUDA] Fix BeamSearchTest.DummyT5WithSequenceInputIds test failure in Windows (#23596)
### Description
BeamSearchTest.DummyT5WithSequenceInputIds failed in Windows due to
early stopping triggered. The cause is state.early_stopping_ is
interpreted as true in cuda kernel at some point, however printf still
show its value is false. The root cause is unknown.

Update the code to use early_stopping as template parameter seems walk
around the issue.

Other changes: 
* Add some debug code (will not be built into binary unless
DEBUG_GENERATION is fined) to assist debugging beam search scorer in
CUDA.
* Enable DummyT5WithSequenceInputIds test in CI. This test was not run
in Windows CUDA CI pipeline previously.

### Motivation and Context

Fix a unit test BeamSearchTest.DummyT5WithSequenceInputIds failure in
Windows.
2025-02-06 13:15:09 -08:00
Joshua Lochner
d981b153d3
[webgpu/js] Optimize resize webgpu op & fix precision issues (#23591)
### Description
<!-- Describe your changes. -->

This PR is a follow-up to
https://github.com/microsoft/onnxruntime/pull/23488 and partially
improves upon https://github.com/microsoft/onnxruntime/issues/23403. It
does the following:
- Prevents unnecessary cache shader recompilation for 'nearest' resize
operation.
- Fixes precision (offset-by-one) errors with asymmetric coordinate
transform. When running the Kokoro TTS model, values for the
`/decoder/decoder/generator/f0_upsamp/Resize_output_0` results in
differences at the end bounds due to precision issues when dividing
21600 by 72 (should be 300, but seemingly results in 299.999, which
causes issues when flooring)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

I did a deep dive over the weekend to try fix Kokoro TTS on WebGPU and
found that the above node had a large difference. Thinking this was a
major issue, I spent some time fixing it. Turns out, it only happens for
a small number of values, leading to high maximum error, but most values
are correct (as seen here).

BEFORE:
```
[/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 78.6640682220459 | rtol: 24.13991587587724 | avgDiff: 0.009967932171121087 | medianDiff: 0.000030517578125
```

AFTER:
```
[/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 0.0011138916015625 | rtol: 0.0020059924232260704 | avgDiff: 0.00008570214675873825 | medianDiff: 0.000030517578125
```

So, although it has a very small impact on the final output (waveform),
this bug could appear with other models in a more severe way.

BEFORE:
```
[waveform] atol: 0.04784199967980385 | rtol: 1366.0462001093495 | avgDiff: 0.0009544936942737713 | medianDiff: 0.00015346752479672432
```

AFTER:
```
[waveform] atol: 0.04775865003466606 | rtol: 1354.7002460360852 | avgDiff: 0.000954830244055033 | medianDiff: 0.00015274062752723694
```
2025-02-06 10:26:25 -08:00