Commit graph

11997 commits

Author SHA1 Message Date
Yi Zhang
e96a038f01
Add VP test in Stable diffusion pipeline (#19300)
### Description
1. Add visual parity test based on openai clip model
2. Add trigger rules

### Motivation and Context
1. check generated image is expected
2. reduce unnecessary triggers
2024-01-29 09:33:58 -08:00
PeixuanZuo
82c1cb416b
[CUDA] Refactor GroupNorm and add common vectorize implementation (#19158)
Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2024-01-29 09:15:10 +08:00
Adrian Lizarraga
6d7ac9c93a
Support general session config entries in perf test tool (#19289)
### Description
Adds the ability to specify general session configuration entries via
the `-C` command-line option.
Example: `-C "session.disable_cpu_ep_fallback|1 ep.context_enable|1"`

Some session config entries can already be set via dedicated
command-line options. If the user uses multiple command-line options to
set the same session config entry, we'll print a warning. Note that the
dedicated command-line options will take precedence.

### Motivation and Context
Allows setting session configurations when testing EPs. QNN EP, for
example, uses the `session.disable_cpu_ep_fallback` and `ep.context_*`
options.
2024-01-26 19:51:48 -08:00
Tianlei Wu
d7ff81dfb7
[CUDA] support user_compute_stream in python API (#19229)
### Description
It is an important feature to pass user cuda stream to avoid
synchronization in python API. Here we allow user to pass cuda stream
for CUDA provider. Note that TRT or ROCm provider need similar change,
which are not included in this pull request.

Note that we will set `has_user_compute_stream` automatically based on
whether there is cuda stream passed, so setting
`has_user_compute_stream` through python API has no effect.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

https://github.com/microsoft/onnxruntime/issues/19094
2024-01-26 10:34:43 -08:00
cao lei
7d4dc66846
ExecutionProvider API refactor - make GenerateMetaDefId a standalone function, decouple it from EP (#18977)
### Description
<!-- Describe your changes. -->
Make EP's member function, GenerateMetaDefId, a standalone function
which decouples from EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is for ExecutionProvider API refactoring, we will make a
clean ExecutionProvider API first for later EPv2 work
2024-01-26 07:39:08 -08:00
Baiju Meswani
fc44f96ad5
Add support for a collection of OrtValue as inputs and outputs to C# TrainingSession (#19048) 2024-01-25 21:55:36 -08:00
Tianlei Wu
358650d441
Fix BigModel stable diffusion pipeline (#19277)
### Description
Fix two issues:
(1) We can only use single quote inside `bash -c "..."`. Current
pipeline job stopped at `python3 demo_txt2img.py astronaut` and skip the
following commands. In this change, we remove the remaining commands to
get same effect (otherwise, the pipeline runtime might be 2 hours
instead of 15 minutes).
(2) Fix a typo of Stable.
2024-01-25 17:19:04 -08:00
Xu Xing
a3f0e2422b
[js/webgpu] Support f16 uniform (#19098)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-25 16:58:22 -08:00
Tianlei Wu
8b4517218b
Remove USE_CUTLASS flag (#19271)
### Description
Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for
onnxruntime CUDA build), there is no need to have a flag to disable
cutlass.

Changes:
(1) Reverted https://github.com/microsoft/onnxruntime/pull/18761
(2) remove the condition to build cutlass.
(3) Fix a few build errors or warnings during testing CUDA 11.4 build. 

Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later.
Flash attention and cutlass fused multihead attention will not be built
for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if
you want to support latest GPUs.

It is better to include it in 1.17.0 (otherwise, the release branch
might encounter build failure with CUDA 11.4).

Tests:
(1) Build with flash attention and efficient attention off: **passed**
(2) Build with CUDA 11.4: **passed**

Example build command used in Ubuntu 20.04:
```
export CUDA_HOME=/usr/local/cuda-11.4
export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/
export CUDACXX=/usr/local/cuda-11.4/bin/nvcc

sh build.sh --config Release  --build_shared_lib --parallel  --use_cuda --cuda_version 11.4 \
            --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \
            --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \
            --disable_types float8
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-25 16:57:58 -08:00
Xu Xing
656ca66186
[js/webgpu] Support uniforms for conv, conv transpose, conv grouped (#18753) 2024-01-25 15:37:05 -08:00
Chi Lo
a2867b911e
[TensorRT EP] Fix mem leak for TRT plugins custom ops (#19248)
TRT EP's GetTensorRTCustomOpDomainList() will create vector of
OrtCustomOpDomain objects and release the ownership of those objects.
But, thoses objects are not released forever.
In session level, we need to make TRT EP remember what OrtCustomOpDomain
objects it created and release them at EP destruction time.
2024-01-25 11:51:39 -08:00
Tianlei Wu
2b285cd78a
[CUDA] Add functions to dump bfloat16 tensors (#19266)
### Description
GroupQueryAttention add BFloat16 in
https://github.com/microsoft/onnxruntime/pull/19095, and there is build
error when enable dumping. This supports print bfloat16 tensor to
console.
2024-01-25 09:30:15 -08:00
Jiajie Hu
5b06505073
[js/webgpu] Fix Tanh explosion (#19201)
### Description
```math
\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=
\left\{
\begin{array}{cc}
-\frac{1-e^{-2\cdot(-x)}}{1+e^{-2\cdot(-x)}}, & x<0 \\
0, & x=0 \\
\frac{1-e^{-2x}}{1+e^{-2x}}, & x>0
\end{array}
\right.
```

### Motivation and Context
On some platforms,
$$\tanh(1000)=\frac{e^{1000}-e^{-1000}}{e^{1000}+e^{-1000}}$$ would
produce NaN instead of 0.999... or 1 (imagine $e^{1000}=\infty$ and
$\frac{\infty}{\infty}$ explodes).
2024-01-25 08:25:35 -08:00
PeixuanZuo
1c92e56dc0
[Cuda] Refactor GroupNorm (#19146)
Split GroupNorm implementation into multiple files, to make ROCm EP can
reuse cuda code.

Related PR: https://github.com/microsoft/onnxruntime/pull/19158

---------

Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2024-01-25 22:28:47 +08:00
Vincent Wang
2b87dd373a
[ORTModule] Remove Mod from Hash to Avoid Conflict for Triton Code-gen (#19256)
Remove mod (10**8) from hash to avoid conflict for Triton code-gen.
2024-01-25 10:16:41 +08:00
Dmitri Smirnov
7dd1f4b8e2
Pad-18 Cuda implementation (#19211)
### Description
Implement Pad-18 for Cuda.

### Motivation and Context
Latest models converted by Dynamo fall back on CPU for Pad with
performance degradation.

This contributes to
https://github.com/microsoft/onnx-rewriter/issues/126
2024-01-24 18:12:04 -08:00
Phoebe Chen
4477f57ee3
Enable RISC-V 64-bit Cross-Compiling Support for ONNX Runtime on Linux (#19238)
### Description  
This pull request introduces the necessary changes to enable RISC-V
64-bit cross-compiling support for the ONNX Runtime on Linux. The RISC-V
architecture has gained popularity as an open standard instruction set
architecture, and this contribution aims to extend ONNX Runtime's
compatibility to include RISC-V, thereby broadening the reach of ONNX
models to a wider range of devices.

### Motivation and Context
RISC-V is a free and open-source instruction set architecture (ISA)
based on established RISC principles. It is provided under open licenses
without fees. Due to its extensibility and freedom in both software and
hardware, RISC-V is poised for widespread adoption in the future,
especially in applications related to AI, parallel computing, and data
centers.

### Example Build Command
```
./build.sh --parallel --config Debug --rv64 --riscv_toolchain_root=/path/to/toolchain/root --skip_tests
```

### Documentation Updates
Relevant sections of the documentation will be updated to reflect the
newly supported RISC-V 64-bit cross-compilation feature.
https://github.com/microsoft/onnxruntime/pull/19239

---------

Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>
2024-01-24 16:27:05 -08:00
Wanming Lin
0c2f0ba90d
[WebNN EP] Support conv1d by reshaping with prepended 1's (#18857)
WebNN only supports 4-D inputs for conv2d and convTranspose2d, this PR
supports 3-D inputs (i.e. conv1d) by prepending a 1 size dimension and
several reshape operations.
2024-01-24 15:53:10 -08:00
Wanming Lin
7252c6e747
[WebNN EP] Support WebNN async API with Asyncify (#19145) 2024-01-24 15:37:35 -08:00
Yufeng Li
c456f19dba
remove old quantization tool file (#19247)
### Description
<!-- Describe your changes. -->
remove old python files


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We have a new op MatMulNBits and this one is deprecated.
2024-01-24 15:20:36 -08:00
Yang Gu
591f90c0b9
[js/webgpu] Fix issue of timestamp query (#19258)
When we enable webgpu profiling mode between session.create and
session.run, current implementation has a problem to create querySet
(and also queryResolveBuffer) if we share the commandEncoder with inputs
upload. This PR fixes this by moving the querySet creation to the place
we set queryType.
2024-01-24 14:49:37 -08:00
Changming Sun
bc54ad3f03
Update abseil to a release tag and register neural_speed (#19255)
### Description
Update abseil to a release tag and register neural_speed to CG.


### Motivation and Context
Now we are using a non-relesed version of abseil. Using a tag is better.
2024-01-24 14:37:39 -08:00
Changming Sun
a28abeb241
Change "#ifdef WIN32" to "#ifdef _WIN32" (#19254)
### Description
`_WIN32` is a standard macro listed at
https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170
. But `WIN32` is not.
2024-01-24 14:35:44 -08:00
satyajandhyala
a33b5bd1fa
[JS/WebGPU] Added Uniforms to SkipLayerNorm. (#18788)
### Description
Added Uniforms to SkipLayerNorm



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-01-25 01:12:21 +05:30
Sheil Kumar
a39ac4a979
[DirectML] Register Pad19 (#19175)
### Description
Register Pad19 in DirectML

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2024-01-24 10:06:31 -08:00
Yi Zhang
d7aebf9ea8
Move Nuget Test from T4 to A10 to reduce release duration (#19253)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Running release process is very painful and boring because some GPU jobs
have to wait so long time.

![image](https://github.com/microsoft/onnxruntime/assets/16190118/1c5c981e-68d4-4678-9758-443fbf362802)

![image](https://github.com/microsoft/onnxruntime/assets/16190118/ba0d79ba-1554-4c7a-93dd-6ea8144c9295)

![image](https://github.com/microsoft/onnxruntime/assets/16190118/36cab833-71c1-4ff5-bca5-f4caa9aee0c9)
On the one hand, we could move some T4 from PR process since some jobs
are not using T4 any more and on the other hand, we can continue to
change some jobs' agent from T4 to A4 too.

In the future, T4 will mainly be used for the scenarioes that big GPU
memory is needed, multiple GPU cards or some special cases.


Test runs:

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=401786&view=logs&j=8048494c-e6eb-5e47-5e87-ff0aa863325d

cc @YUNQIUGUO @snnn
2024-01-24 14:15:07 +08:00
Chi Lo
c10be1848c
[TensorRT EP] Avoid calling unavailable function with cpu python package (#19251)
C.register_tensorrt_plugins_as_custom_ops() is only available in gpu
python package.
Add condition to avoid calling it in cpu python package.
2024-01-23 21:30:22 -08:00
Ye Wang
6a424ccf8c
Fix AMD pipeline test failures (#19250)
### Description
<!-- Describe your changes. -->

Fix amd test failure

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-23 19:33:49 -08:00
aciddelgado
cbb29d80ff
GQA Rotary and Packed QKV with Flash (#18906)
### Description
These changes add rotary embedding and packed qkv input to gqa. As of
now, the changes are only supported with Flash-Attention (SM >= 80) but
should soon be supported with Memory Efficient Attention as well.



### Motivation and Context
With the fusion of rotary embedding into this Attention op, we hope to
observe some perf gain. The packed QKV should also provide some perf
gain in the context of certain models, like Llama2, that would benefit
from running ops on the fused QKV matrix, rather than the separate Q, K,
and V.

---------

Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
2024-01-23 16:34:26 -08:00
Wei-Sheng Chin
532f8c642c
Fix a backend test by using local backend (#19230)
The decomposition pass (e.g., converting torch.add to aten.add) in DORT
no longer exists. Therefore, we have to use `use_aot_autograd=True` to
enable Dynamo's built-in operator decomposition. I think we need to add
the decomposition pass back to DORT or remove `use_aot_autograd` (remove
because it will always be `true`).
2024-01-23 14:57:30 -08:00
petermcaughan
f53068446e
Add Temperature to WhisperBeamSearch input (#19188)
### Description
<!-- Describe your changes. -->
Add `temperature` as an input to WhisperBeamSearch op and initialize
correctly in parameter setup.


### Motivation and Context
Currently, temperature is included as an attribute to the BeamSearch op,
which doesn't let the model act dynamically in a single inference
session. By including this variable as an input, the temperature value
can be altered in any inference call (important for 1P teams)

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
2024-01-23 13:44:34 -08:00
Yi Zhang
54871a2773
Replace T4 to A10 in Linux GPU workflow (#19205)
### Description
1. Update Linux GPU  machine from T4 to A10, sm=8.6
2. update the tolerance 

### Motivation and Context
1. Free more T4 and test with higher compute capability.
2. ORT enables TF32 in GEMM for A10/100. TF32 will cause precsion loss
and fail this test
```
2024-01-19T13:27:18.8302842Z [ RUN      ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
2024-01-19T13:27:25.8438153Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:25.8438641Z Expected equality of these values:
2024-01-19T13:27:25.8438841Z   COMPARE_RESULT::SUCCESS
2024-01-19T13:27:25.8439276Z     Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:25.8439464Z   ret.first
2024-01-19T13:27:25.8445514Z     Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:25.8445962Z expected 0.145984 (3e157cc1), got 0.975133 (3f79a24b), diff: 0.829149, tol=0.0114598 idx=375. 20 of 388 differ
2024-01-19T13:27:25.8446198Z 
2024-01-19T13:27:25.8555736Z [  FAILED  ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12, where GetParam() = "cuda_../models/zoo/opset12/SSD/ssd-12.onnx" (7025 ms)
2024-01-19T13:27:25.8556077Z [ RUN      ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_YOLOv312_yolov312
2024-01-19T13:27:29.3174318Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:29.3175144Z Expected equality of these values:
2024-01-19T13:27:29.3175389Z   COMPARE_RESULT::SUCCESS
2024-01-19T13:27:29.3175812Z     Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:29.3176080Z   ret.first
2024-01-19T13:27:29.3176322Z     Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:29.3178431Z expected 4.34958 (408b2fb8), got 4.51324 (40906c80), diff: 0.16367, tol=0.0534958 idx=9929. 22 of 42588 differ

```
3. some other test like SSD throw other exception, so skip them
'''
2024-01-22T09:07:40.8446910Z [ RUN ]
ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
2024-01-22T09:07:51.5587571Z
/onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:358:
Failure
2024-01-22T09:07:51.5588512Z Expected equality of these values:
2024-01-22T09:07:51.5588870Z   COMPARE_RESULT::SUCCESS
2024-01-22T09:07:51.5589467Z     Which is: 4-byte object <00-00 00-00>
2024-01-22T09:07:51.5589953Z   ret.first
2024-01-22T09:07:51.5590462Z     Which is: 4-byte object <01-00 00-00>
2024-01-22T09:07:51.5590841Z expected 1, got 63
'''
2024-01-23 10:49:24 -08:00
Heflin Stephen Raj
0ea48fc73e
Modified the condition to load the optimiser model (#18891) 2024-01-23 10:10:54 -08:00
Xu Xing
61610ff986
[js/webgpu] Add FusedConv clip test case (#18900)
Bug: https://github.com/microsoft/onnxruntime/issues/18899
2024-01-23 08:25:05 -08:00
Tianlei Wu
6ca7c1a933
unet fusion for stable diffusion webui (#19227)
### Description
Update unet fusion for [stable diffusion webui
extension](https://github.com/tianleiwu/Stable-Diffusion-WebUI-OnnxRuntime):
(1) Update fusion pattern to support fp16 unet model.
(2) Add progress bar
(3) Use a cached map to speed up dtype or shape lookup in shape
inference result.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-22 20:42:30 -08:00
Jeff Daily
b2aec41a83
[ROCm] enable hipGraph (#18382)
This ports the cudaGraph support from the CUDA EP to the ROCM EP's
hipGraph.
2024-01-23 11:17:04 +08:00
Adrian Lizarraga
37d14d7896
[QNN EP] Create Windows ARM64 nightly python package (#19128)
### Description
Adds a job to create a nightly python package for ORT/QNN on Windows
ARM64.
Must build onnxruntime-qnn with python 3.11 and numpy 1.25.

**Note: pipeline run may take up to 3 hrs**

### Motivation and Context
Make it possible to get a nightly python package with the latest updates
to QNN EP.
Issue #19161
2024-01-22 18:14:41 -08:00
Jiajia Qin
d226e40856
[js/webgpu] set query type in onRunStart (#19202)
### Description
<!-- Describe your changes. -->
`env.webgpu.profiling` is a global flag. It may change before each
session.run. So the best place is to update it in `onRunStart` event.
After this, we can directly check `this.queryType`'s value. Without this
pr, we need to make sure that `getCommandEncoder()` is called before
checking `this.queryType`. Otherwise, it may happen that
`pendingKernels`'s length is not equal to `pendingDispatchNumber`'s
length. See the two ugly workarounds
[1)](e630dbf528 (diff-006fc84d3997f96a29b8033bd2075d6a0a9509211bd5812a6b934fc74fedfd9dR267-R268))
and
[2)](e630dbf528 (diff-618fe297fbe7a1da586380163b8fd2627311ccc217640a3c5cdc9c17a33472c1R73-R80))
if we don't introduce `onRunStart`. Or we need to call `setQueryType` in
each kernel run.
2024-01-22 16:08:55 -08:00
Jiajia Qin
2e0a388c36
[js/webgpu] Add HardSigmoid support (#19215)
### Description
This op is required in mobilenetv3-small-100. With this PR,
mobilenetv3-small-100 model becomes less than 10 ms from over 100 ms on
ADL.
2024-01-22 15:53:26 -08:00
Yifan Li
e283cdb218
Fix Fuzz Testing CI (#19228)
### Description
<!-- Describe your changes. -->
Add BuildArch

To verify:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=400952&view=logs&j=5b022bb4-70a7-5401-8766-a8a7802c7150&t=291e85c7-5547-590b-50de-4e01fcd4eba3&l=14

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-22 15:44:57 -08:00
Linnea May
24b74aebcb
[DML] Register DML operators for opset 19 (#16939)
### Description
<!-- Describe your changes. -->
Register DML operators for opset 19. 
- Cast19
- Castlike19
- Constant19 
- Equal19
- Identity19
- QuantizeLinear19
- DequantizeLinear19
- Reshape19
- Shape19
- Size


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: linnealovespie <linneamay@microsoft.com>
2024-01-22 15:37:09 -08:00
snadampal
77da2ef278
[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 (#17031)
### Description
This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to
implement matrix multiplication with bfloat16 SIMD instructions (bfmmla)
and MatMul operator changes to invoke the Sbgemm kernel. To enable
Sbgemm kernel, set the following session option:
"kOrtSessionOptionsGemmFastMathMode"

The PR also adds new test cases for mlas and ort.

### Motivation and Context

This is to improve MatMul performance on aarch64 platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x
-1.76x performance improvement compared to sgemm (fp32) kernel
performance.

```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
And the unit test precision results are matching to sgemm kernel
results.
`./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync `
2024-01-22 14:43:06 -08:00
Yi Zhang
780acda7b4
Add Big models pipeline (#19222)
### Description
2 models are added in CI.
Stabe diffusion Model stage is based on
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md

LLama2 FP16 is based on https://github.com/microsoft/Llama-2-Onnx.
12G GPU memory is not enough, so I choose T4 to run it.

### Motivation and Context
Add regular E2E test for big models. 
It will be triggered in main build, that is, it'll run after one PR is
merged.

More models will be added later.

### Test Runs ###

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1275191&view=results
2024-01-22 14:02:56 -08:00
Adrian Lizarraga
8d9d751179
[QNN EP] Expose device-level session options (#19212)
### Description
- Adds the following session options to configure the device:
- `soc_model`: The SoC model number. Refer to the QNN SDK documentation
for valid values. Defaults to "0" (unknown).
- `htp_arch`: The minimum HTP architecture the driver will use to select
compatible QNN operators.
- `device_id`: The ID of the device to use when setting 'htp_arch'.
Defaults to "0" (for single device).

### Motivation and Context
Allow more configuration.
2024-01-22 12:47:42 -08:00
Zhang Lei
373ebac167
Zhalei/fix seqoutput type (#18765)
After refactoring beamsearch, all scores become fp32. Yet it need
support fp16 according to original specs.
2024-01-22 10:40:48 -08:00
Ye Wang
21034a2c37
phi2 contrib ops changes (#19112)
### Description
<!-- Describe your changes. -->
1. support causal mask in MHA cpu
2. support custom rotary_dim in rotary_emb
3. add bf16 for rotary_emb
4. fix a bug in attention rotary


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-22 10:17:11 -08:00
Chi Lo
f3402de01e
[TensorRT EP] Enhance EP context configs in session options and provider options (#19154)
Several changes:

1. To align with other EPs' setting of EP context configs in session
options, for example [QNN
EP](https://github.com/microsoft/onnxruntime/pull/18877), EP context
configs for TRT EP can be configured through:
1. Session Options: `ep.context_enable`, `ep.context_file_path` and
`ep.context_embed_mode`
2. Provider Options: `trt_dump_ep_context_model`,
`trt_ep_context_file_path` and `trt_dump_ep_context_embed_mode`
3. Above setting has 1:1 mapping and provider options has higher
priority over session options.
    
```
    Please note that there are rules for using following context model related provider options:

     1. In the case of dumping the context model and loading the context model,
        for security reason, TRT EP doesn't allow the "ep_cache_context" node attribute of EP context node to be
        the absolute path or relative path that is outside of context model directory.
        It means engine cache needs to be in the same directory or sub-directory of context model.

     2. In the case of dumping the context model, the engine cache path will be changed to the relative path of context model directory.
        For example:
        If "trt_dump_ep_context_model" is enabled and "trt_engine_cache_enable" is enabled,
           if "trt_ep_context_file_path" is "./context_model_dir",
           - if "trt_engine_cache_path" is "" -> the engine cache will be saved to "./context_model_dir"
           - if "trt_engine_cache_path" is "engine_dir" -> the engine cache will be saved to "./context_model_dir/engine_dir"
```    

2. User can decide the naming of the dumped "EP context" model by using
`trt_ep_context_file_path`, please see GetCtxModelPath() for more
details.

3. Added suggested comments from
https://github.com/microsoft/onnxruntime/pull/18217
2024-01-21 10:51:58 -08:00
Edward Chen
c8ce83967e
Download protoc for all Apple host builds, remove protoc build from iOS packaging pipeline. (#19209) 2024-01-19 15:30:09 -08:00
Hector Li
6e17571f2f
Fix issue that the generated context cache model inputs/outputs order is not guaranteed (#19195)
Fix issue that the generated context cache model inputs/outputs order is not guaranteed

### Description
Currently, QNN EP generate the context cache model in Compile() method which only get access to the partitioned graph. And the inputs/outputs order for the partitioned graph is not guaranteed. And EP doesn't have the view of the input user model. Have to move the context cache model generation to a higher level in GraphPartitioner which has the view of the partitioned model.
This is also a break down of PR for multi-partition support.
https://github.com/microsoft/onnxruntime/pull/18865
2024-01-19 15:16:17 -08:00
kunal-vaishnavi
a3ecb63267
Update LLaMA attention fusions (#19200)
### Description
This PR updates the LLaMA-2 attention fusions by adding the following.

- Loading the PyTorch model from Hugging Face with the `LlamaAttention`
class before exporting
- Updating the attention mask pattern matching to support another case

This PR also fixes [this
issue](https://github.com/microsoft/onnxruntime/issues/19040).

### Motivation and Context
Recent changes to Hugging Face's `transformers` library break the
existing pattern matching. Since the attention fusions aim to change the
graph from `LayerNorm Op --> Set of Attention Nodes --> LayerNorm Op` to
`LayerNorm Op --> Attention Op --> LayerNorm Op` per layer, ultimately
it does not matter what nodes comprise the `Set of Attention Nodes`
because they will all be removed and replaced by the `Attention Op` in
the end.

Therefore, it does not matter whether the `LlamaAttention` class or a
different attention class is used to load the PyTorch model before
exporting because the expected graphs after the attention fusions will
look identical no matter the attention class chosen. By loading the
PyTorch model with the `LlamaAttention` class instead of other attention
classes (e.g. `LlamaFlashAttention2` or `LlamaSdpaAttention`) and then
exporting it to ONNX, the existing pattern matching will continue to
work.
2024-01-19 11:09:24 -08:00