### Description
<!-- Describe your changes. -->
TensorRT 10.8 zip file has suffix of cuda-12.8, not 12.6
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
PIX Capture tool requires 'present' to end a frame capture. ORT doesn't
have rendering work so no 'present' happens.
To avoid endless waiting for PIX capture tool, this PR added a blank
surface and 'present' on it in each session run.
The surface is created in WebGPU ep constructor and closed in WebGPU ep
destructor.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Validate the context_file_path before EP compile graphs to make it fail fast. To avoid the possibility that EP generate new file (context binary file or blob file) over write the existing file. Return error if the path points to folder.
The CPU walltime of waiting for PopErrorScope is non-trivial, and also
validation errors are not expected to happen in Release build.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
* Pass topk_scores to beam scorer in slow topk path.
* Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk.
* Add a test case for slow topk path.
### Motivation and Context
This bug was introduced in
https://github.com/microsoft/onnxruntime/pull/16272
Beam search uses fast cuda kernel when number of beams <= 32. When beam
size is larger than that threshold, we use another code path (slower
cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be
passed to beam scorer but it is not.
This bug will cause incorrect result when num_beams > 32. It was not
found previously since such large beam size is rarely used.
### Description
This change implements FlashAttention 2 for the webgpu EP for the MHA
operator.
Numbers from Alderlake device show a 2.2x speed up for prefill, which
considering that Attention is 50% of prefill phase (other 50% being
MatMul) implies 4x speed up for Attention with this implementation. This
is inline with the expected perf gain of 2-4x with FlashAttention over
regular attention.
```
Baseline
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 9.54997e+06 <<<<<
avg (tokens/s): 104.817
p50 (us): 9.49218e+06
stddev (us): 251442
n: 5 * 1001 token(s)
------
With FlashAttention 2
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 4.27937e+06 <<<<<
avg (tokens/s): 233.913
p50 (us): 4.27687e+06
stddev (us): 5344.1
n: 5 * 1001 token(s)
```
### Motivation and Context
On integrated GPUs memory bandwidth is premium, Flash attention makes
softmax computation (and therefore output attention vector computation)
a running operation instead of maintaining full QKt attention scores in
memory. As a result, we see significant improvements in prefill speed -
200% speed up measured here.
This change uses techniques from co-operative matrix multiply to use
registers from a subgroup for fast in register matrix multiply. Without
the co-operative matrix multiply technique ALD showed about 6.0s prefill
time.
Tested on ALD/TGL intel integrated and Nvidia 4070.
### Future Work
- Fine tuning and profiling optimizations.
- Current implement is for prefill only, a generation phase optimized
FA2 implementation is possible, however attention is a tiny part of the
generation phase.
### Description
These changes are done to ensure that weight sharing happens between two model using session context option ep_weight_sharing.
Key changes introduced in this feature are:
Creating a shared context between two models Extracting external constant initializers and re labelling them back as
inputs to the model to allow weight loading in the direct blob. Creating EP Context Nodes when Subgraph partitioning is happening.
### Motivation and Context
This change was required to ensure that LLM with prefill and kvcache models can use the same share
The change was also required to ensure EP Context nodes can be formed even when model is being subgraph partitioned.
---------
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
### Description
BeamSearchTest.DummyT5WithSequenceInputIds failed in Windows due to
early stopping triggered. The cause is state.early_stopping_ is
interpreted as true in cuda kernel at some point, however printf still
show its value is false. The root cause is unknown.
Update the code to use early_stopping as template parameter seems walk
around the issue.
Other changes:
* Add some debug code (will not be built into binary unless
DEBUG_GENERATION is fined) to assist debugging beam search scorer in
CUDA.
* Enable DummyT5WithSequenceInputIds test in CI. This test was not run
in Windows CUDA CI pipeline previously.
### Motivation and Context
Fix a unit test BeamSearchTest.DummyT5WithSequenceInputIds failure in
Windows.
### Description
<!-- Describe your changes. -->
This PR is a follow-up to
https://github.com/microsoft/onnxruntime/pull/23488 and partially
improves upon https://github.com/microsoft/onnxruntime/issues/23403. It
does the following:
- Prevents unnecessary cache shader recompilation for 'nearest' resize
operation.
- Fixes precision (offset-by-one) errors with asymmetric coordinate
transform. When running the Kokoro TTS model, values for the
`/decoder/decoder/generator/f0_upsamp/Resize_output_0` results in
differences at the end bounds due to precision issues when dividing
21600 by 72 (should be 300, but seemingly results in 299.999, which
causes issues when flooring)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
I did a deep dive over the weekend to try fix Kokoro TTS on WebGPU and
found that the above node had a large difference. Thinking this was a
major issue, I spent some time fixing it. Turns out, it only happens for
a small number of values, leading to high maximum error, but most values
are correct (as seen here).
BEFORE:
```
[/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 78.6640682220459 | rtol: 24.13991587587724 | avgDiff: 0.009967932171121087 | medianDiff: 0.000030517578125
```
AFTER:
```
[/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 0.0011138916015625 | rtol: 0.0020059924232260704 | avgDiff: 0.00008570214675873825 | medianDiff: 0.000030517578125
```
So, although it has a very small impact on the final output (waveform),
this bug could appear with other models in a more severe way.
BEFORE:
```
[waveform] atol: 0.04784199967980385 | rtol: 1366.0462001093495 | avgDiff: 0.0009544936942737713 | medianDiff: 0.00015346752479672432
```
AFTER:
```
[waveform] atol: 0.04775865003466606 | rtol: 1354.7002460360852 | avgDiff: 0.000954830244055033 | medianDiff: 0.00015274062752723694
```