Commit graph

10406 commits

Author SHA1 Message Date
snadampal
77da2ef278
[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 (#17031)
### Description
This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to
implement matrix multiplication with bfloat16 SIMD instructions (bfmmla)
and MatMul operator changes to invoke the Sbgemm kernel. To enable
Sbgemm kernel, set the following session option:
"kOrtSessionOptionsGemmFastMathMode"

The PR also adds new test cases for mlas and ort.

### Motivation and Context

This is to improve MatMul performance on aarch64 platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x
-1.76x performance improvement compared to sgemm (fp32) kernel
performance.

```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
And the unit test precision results are matching to sgemm kernel
results.
`./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync `
2024-01-22 14:43:06 -08:00
Yi Zhang
780acda7b4
Add Big models pipeline (#19222)
### Description
2 models are added in CI.
Stabe diffusion Model stage is based on
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md

LLama2 FP16 is based on https://github.com/microsoft/Llama-2-Onnx.
12G GPU memory is not enough, so I choose T4 to run it.

### Motivation and Context
Add regular E2E test for big models. 
It will be triggered in main build, that is, it'll run after one PR is
merged.

More models will be added later.

### Test Runs ###

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1275191&view=results
2024-01-22 14:02:56 -08:00
Adrian Lizarraga
8d9d751179
[QNN EP] Expose device-level session options (#19212)
### Description
- Adds the following session options to configure the device:
- `soc_model`: The SoC model number. Refer to the QNN SDK documentation
for valid values. Defaults to "0" (unknown).
- `htp_arch`: The minimum HTP architecture the driver will use to select
compatible QNN operators.
- `device_id`: The ID of the device to use when setting 'htp_arch'.
Defaults to "0" (for single device).

### Motivation and Context
Allow more configuration.
2024-01-22 12:47:42 -08:00
Zhang Lei
373ebac167
Zhalei/fix seqoutput type (#18765)
After refactoring beamsearch, all scores become fp32. Yet it need
support fp16 according to original specs.
2024-01-22 10:40:48 -08:00
Ye Wang
21034a2c37
phi2 contrib ops changes (#19112)
### Description
<!-- Describe your changes. -->
1. support causal mask in MHA cpu
2. support custom rotary_dim in rotary_emb
3. add bf16 for rotary_emb
4. fix a bug in attention rotary


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-22 10:17:11 -08:00
Chi Lo
f3402de01e
[TensorRT EP] Enhance EP context configs in session options and provider options (#19154)
Several changes:

1. To align with other EPs' setting of EP context configs in session
options, for example [QNN
EP](https://github.com/microsoft/onnxruntime/pull/18877), EP context
configs for TRT EP can be configured through:
1. Session Options: `ep.context_enable`, `ep.context_file_path` and
`ep.context_embed_mode`
2. Provider Options: `trt_dump_ep_context_model`,
`trt_ep_context_file_path` and `trt_dump_ep_context_embed_mode`
3. Above setting has 1:1 mapping and provider options has higher
priority over session options.
    
```
    Please note that there are rules for using following context model related provider options:

     1. In the case of dumping the context model and loading the context model,
        for security reason, TRT EP doesn't allow the "ep_cache_context" node attribute of EP context node to be
        the absolute path or relative path that is outside of context model directory.
        It means engine cache needs to be in the same directory or sub-directory of context model.

     2. In the case of dumping the context model, the engine cache path will be changed to the relative path of context model directory.
        For example:
        If "trt_dump_ep_context_model" is enabled and "trt_engine_cache_enable" is enabled,
           if "trt_ep_context_file_path" is "./context_model_dir",
           - if "trt_engine_cache_path" is "" -> the engine cache will be saved to "./context_model_dir"
           - if "trt_engine_cache_path" is "engine_dir" -> the engine cache will be saved to "./context_model_dir/engine_dir"
```    

2. User can decide the naming of the dumped "EP context" model by using
`trt_ep_context_file_path`, please see GetCtxModelPath() for more
details.

3. Added suggested comments from
https://github.com/microsoft/onnxruntime/pull/18217
2024-01-21 10:51:58 -08:00
Edward Chen
c8ce83967e
Download protoc for all Apple host builds, remove protoc build from iOS packaging pipeline. (#19209) 2024-01-19 15:30:09 -08:00
Hector Li
6e17571f2f
Fix issue that the generated context cache model inputs/outputs order is not guaranteed (#19195)
Fix issue that the generated context cache model inputs/outputs order is not guaranteed

### Description
Currently, QNN EP generate the context cache model in Compile() method which only get access to the partitioned graph. And the inputs/outputs order for the partitioned graph is not guaranteed. And EP doesn't have the view of the input user model. Have to move the context cache model generation to a higher level in GraphPartitioner which has the view of the partitioned model.
This is also a break down of PR for multi-partition support.
https://github.com/microsoft/onnxruntime/pull/18865
2024-01-19 15:16:17 -08:00
kunal-vaishnavi
a3ecb63267
Update LLaMA attention fusions (#19200)
### Description
This PR updates the LLaMA-2 attention fusions by adding the following.

- Loading the PyTorch model from Hugging Face with the `LlamaAttention`
class before exporting
- Updating the attention mask pattern matching to support another case

This PR also fixes [this
issue](https://github.com/microsoft/onnxruntime/issues/19040).

### Motivation and Context
Recent changes to Hugging Face's `transformers` library break the
existing pattern matching. Since the attention fusions aim to change the
graph from `LayerNorm Op --> Set of Attention Nodes --> LayerNorm Op` to
`LayerNorm Op --> Attention Op --> LayerNorm Op` per layer, ultimately
it does not matter what nodes comprise the `Set of Attention Nodes`
because they will all be removed and replaced by the `Attention Op` in
the end.

Therefore, it does not matter whether the `LlamaAttention` class or a
different attention class is used to load the PyTorch model before
exporting because the expected graphs after the attention fusions will
look identical no matter the attention class chosen. By loading the
PyTorch model with the `LlamaAttention` class instead of other attention
classes (e.g. `LlamaFlashAttention2` or `LlamaSdpaAttention`) and then
exporting it to ONNX, the existing pattern matching will continue to
work.
2024-01-19 11:09:24 -08:00
Xavier Dupré
eaf047c820
Increment year to 2024 in conf.py (python documentation) (#19107)
### Description
Update copyright in python documentation.
2024-01-19 19:36:19 +01:00
Adrian Lizarraga
28a16c223c
[QNN EP] Update QNN pipelines to use QNN SDK 2.18 by default (#19129)
### Description
Update QNN pipelines to use QNN SDK 2.18 by default



### Motivation and Context
Test with the latest version of QNN SDK by default.
2024-01-18 14:59:23 -08:00
Yulong Wang
d69b622ef4
[js/web] upgrade dependency packages version (#19193)
### Description
upgrade packages version.

```
# npm audit report

electron  23.0.0-alpha.1 - 23.3.13
Severity: moderate
ASAR Integrity bypass via filetype confusion in electron - https://github.com/advisories/GHSA-7m48-wc93-9g85
fix available via `npm audit fix --force`
Will install electron@28.1.4, which is a breaking change
node_modules/electron

get-func-name  <2.0.1
Severity: high
Chaijs/get-func-name vulnerable to ReDoS - https://github.com/advisories/GHSA-4q6p-r6v2-jvc5
fix available via `npm audit fix`
node_modules/get-func-name

semver  <=5.7.1 || 6.0.0 - 6.3.0 || 7.0.0 - 7.5.1
Severity: moderate
semver vulnerable to Regular Expression Denial of Service - https://github.com/advisories/GHSA-c2qf-rxjj-qqgw
semver vulnerable to Regular Expression Denial of Service - https://github.com/advisories/GHSA-c2qf-rxjj-qqgw
semver vulnerable to Regular Expression Denial of Service - https://github.com/advisories/GHSA-c2qf-rxjj-qqgw
fix available via `npm audit fix`
node_modules/cross-spawn/node_modules/semver
node_modules/global-agent/node_modules/semver
node_modules/semver
```
2024-01-18 13:45:42 -08:00
Yi Zhang
dc1fed7268
[Fix] Dual Cuda version isn't supported as expected in Linux Gpu pipeline (#19192)
### Description
<!-- Describe your changes. -->


### Motivation and Context
It isn't support expected dual cuda version 

cuda 12 link

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1272235&view=logs&j=f2f63060-d9d6-52d0-adee-b97db5a9ab91
2024-01-18 13:26:26 -08:00
luoyu-intel
459c750b03
Update x64 template kernel library for 'sqnbitgemm' (#19016)
### Description
<!-- Describe your changes. -->
1. Make JBLAS codes an external module of ORT.
2. Move q4 gemm code to contrib_ops.
3. Update template kernel library to v0.1 release.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We found that the current LLM model performance is far below our
expectations. Here is some performance data collected on Mistral-7B
model with Xeon-8480:
8 threads | prompt length=32 past_len=32 | prompt length=1   past_len=32
-- | -- | --
ORT-main | 1220ms | 263ms
Neural-speed | 564ms | 87ms
ORT-this PR|597ms|120ms

Although `Neural-speed` and `ORT-this PR` use the same int4 kernel code,
there is a 33ms(87ms vs. 120ms) latency gap between the two frameworks.
Through some statistics analysis, the summary latency of `MatMulNBits`
is 86.7ms
The summary latency of all int4 GEMMs in `Neural-speed` is 84.8ms. So
other OPs introduce an extra 30ms latency.

The performance of MatMulNBits in this PR meets our expectations.

### Remain Issues
1. For hybrid CPUs, like core 12900K, the ONNXRuntime thread pool uses
TaskGranularityFactor to scale its number of threads. This is not
expected in our code design. It may slow down the hybrid CPU performance
by 30~40%.
2. Prepack uses a single thread which is very slow to init a session.
3. MatMulNBits with zero points will fall through to COMP_FP32 even
accuracy_level=4. Our COMP_INT8 IGemmCore with zero points process is
not optimized for now. It will be updated in the future. So, for an int4
model with zero points, whether the accuracy_level is 0 or 4 will be no
difference.
2024-01-18 13:16:34 -08:00
Guenther Schmuelling
dd2177c5d7
enable webnn in ci build (#19163)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-18 13:11:47 -08:00
Hector Li
dadd3ea704
Check the ep_cache_context and don't allow access outside the directory (#19174)
### Description
Check the ep_cache_context node property for EPContext node, and don't
allow relative path like "../file_path"
2024-01-18 11:11:14 -08:00
Jian Chen
9da3e36138
Fix buildJava from Zip-Nuget-Java-Nodejs Packaging Pipeline (#19187)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-17 17:20:42 -08:00
Yulong Wang
f87e69801f
[js/web] show warning when numThreads is set but threads is not supported (#19179)
### Description
show warning when numThreads is set but threads is not supported.
Resolves #19148, #18933

for web: when crossOriginIsolated is false.
for node: always disable.
2024-01-17 15:04:22 -08:00
Yulong Wang
146ebaf91e
[js/web] allow proxy to load model with 1GB <= size < 2GB (#19178)
### Description

allow proxy to load model with 1GB <= size < 2GB

resolves #19157.
2024-01-17 15:03:43 -08:00
Maximilian Müller
bc219ed553
[TensorRT EP] Enable a minimal CUDA EP compilation without kernels (#19052)
Adresses https://github.com/microsoft/onnxruntime/issues/18542.
I followed the advice given by @RyanUnderhill
[here](https://github.com/microsoft/onnxruntime/pull/18731#issuecomment-1848261925)
and went with a minimal CUDA EP for now.
2024-01-17 11:33:34 -08:00
Rachel Guo
bd9d8fb2a5
[ORT 1.17.0 release] Bump up version to 1.18.0 (#19170)
### Description
<!-- Describe your changes. -->

Bump up version to 1.18.0 since the release branch has been cut.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
2024-01-17 11:18:32 -08:00
Xavier Dupré
63dd605d33
Fix untyped float values in quantization tool missing from PR #18043 (#19182)
### Description
Extends the code coverage to Entroy, Histogram and Distribution
calibration method, fix bugs while doing it.



### Motivation and Context
Bugs detected in [Olive](https://github.com/microsoft/OLive).
2024-01-17 19:00:36 +01:00
wejoncy
9876cc7c4f
more inputs support for LLM exporter (#19005)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-17 15:46:19 +08:00
Wanming Lin
07d3aed3aa
[WebNN EP] Fixed build issue with disable_rtti (#19173)
Previously building webnn ep with --disable_rtti will throw
unboundTypeError since unbound type names are illegal with RTTI disabled
in Embind API, we can fix it by adding a
-DEMSCRIPTEN_HAS_UNBOUND_TYPE_NAMES=0 flag.
2024-01-16 21:35:13 -08:00
Changming Sun
81d363045b
Upgrade Ubuntu machine pool from 20.04 to 22.04 (#19117)
### Description
Upgrade Ubuntu machine pool from 20.04 to 22.04
2024-01-16 17:25:18 -08:00
Hector Li
e61861b0a1
Clean up generated files in QNN UTs (#19127)
### Description
Clean up generated files in QNN UTs
2024-01-16 16:36:28 -08:00
moyo1997
c935c8fbd2
remove unnecessary environment variable (#19166)
remove unnecessary environment variable when building as arm64x
2024-01-16 16:24:37 -08:00
Jian Chen
8e272b9cac
Update build.py to remove unused functions and update python to 3.8 (#19164)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-16 13:53:15 -08:00
Patrice Vignola
80f274ca6f
Fix SkipLayerNormalization shape inference (#18724)
SkipLayerNorm has more than one input, so `propagateShapeAndTypeFromFirstInput` is not enough.
2024-01-16 09:42:59 -08:00
Changming Sun
e2e488d6f8
Revert "iOS packaging pipeline stability" (#19135)
Reverts microsoft/onnxruntime#19097 because it broken Android CI
pipeline.
2024-01-16 09:18:35 -08:00
Jian Chen
c92f72ebeb
Merge Linux Nuget GPU pipeline with zip-nuget (#19120)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-16 08:59:03 -08:00
Jeff Bloomfield
8d4369b77e
Update DirectML nuget version to 1.13.1 (#19122)
### Description
Update DML version to 1.13.1



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-15 19:04:41 -08:00
Wanming Lin
1bab98988b
[WebNN EP] Fixed bug in int8 data type processing (#19134) 2024-01-15 18:44:25 -08:00
Guenther Schmuelling
9dee543bed
fix gemm beta for fp16 (#19153)
per onnx spec beta is always fp32 so we need to cast it
2024-01-15 18:40:38 -08:00
Jeff Bloomfield
9f87c5c41d
Fix build error due to merge with DML adapter enumeration macro defined (#19121)
### Description
Fix build error when ENABLE_NPU_ADAPTER_ENUMERATION is defined


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-15 17:10:58 -08:00
pengwa
1150b1f81e
ORTModule memory improvement (#18924)
## Dependency

https://github.com/microsoft/onnxruntime/pull/19007

## ORTModule memory efficient gradient management

Previously I have tried to solve the coarsed-grained gradient
accumulation/update problem in ORTModule with
https://github.com/microsoft/onnxruntime/pull/8979, while that
resolution somehow is not fully validated with DDP or there is user
hooks on the gradient accumulation on torch parameter.

This PR is addressing the problem in the similar approach as PR 8979,
e.g. trigger gradient accumulation once ORT computed the grad, but
instead of use a AccumulateGrad op, this time with a ONNX operator
PythonOp, internally it will call param.backward(grad), which will help
handle all related hooks correctly.


## Design

Check the details from


https://microsoftapc-my.sharepoint.com/:p:/g/personal/pengwa_microsoft_com/EaaBq4EzsFhOmsDEXCG7Ba4Bb9bwd0O2sFV_JXJ4jBLYLA?e=7Sz2g8&nav=eyJzSWQiOjI3MSwiY0lkIjozMjE4NzI1NDIzfQ

## Convergence Validation:


![image](https://github.com/microsoft/onnxruntime/assets/10530022/ccf3a213-e815-4b23-b759-165033b2d9fe)

differences are on mostly 0.000x, sometimes 0.00x, which may comes from
the different order gradient apply happens before or after this change
(on deepspeed zero stage 2)


## TODO

Consolidate the logic with Stage3's similar logic.
2024-01-16 08:57:37 +08:00
Adam Pocock
191525301f
[java] Updating TensorInfo so it contains the named dimensions (#18962)
### Description
The Java `TensorInfo` object which is used to describe a tensor's shape,
along with the input and output placeholders for a model couldn't show
any symbolic/named dimensions in that tensor. Now this information is
stored in Java strings on construction and included in the toString.

### Motivation and Context
Setting symbolic dimensions required external information in Java, the
names were not discoverable from within the API.
2024-01-15 14:42:50 -08:00
Ben Niu
a97199c62d
Fix Arm64EC build for test_q4qdq.cpp (#18523)
### Description
Fix ifdef guards in test_q4qdq.cpp to exclude code blocks intended only
for native x64 compilation instead of x64 + Arm64EC.
2024-01-15 14:29:19 -08:00
Yi Zhang
922a2f00e3
Extend timeout in Nuget-CUDA-Packaging-Pipeline (#19138)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Linux_GPU_x64 job in the pipeline has been canceled due to timeout since
0112.
2024-01-15 14:37:22 +08:00
Scott McKay
b2ce3eedb9
Fix build error for CoreML Split op (#19099)
### Description
<!-- Describe your changes. -->
The `split` input of the Split op is int64_t. Fixing that resolves a
type mismatch build error on Windows when CoreML is enabled (for
debugging the partitioning code).

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix build error

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-01-15 15:09:49 +10:00
Adam Pocock
71657d1eb8
[java] Fix double close (#19133)
### Description
The `OnnxValue` and `OrtProviderOptions` implementations now check to
see if they've been closed before accessing the native pointer, and also
before close is called.

### Motivation and Context
Before they could be closed twice which SIGSEGV'd the JVM. Fixes #19125.
2024-01-14 14:53:26 -08:00
Jian Chen
c3ce9df80c
Disabling python3.12 on training python packaging pipleines (#19123) 2024-01-14 14:51:00 -08:00
Jian Chen
76797127d6
Always download cuda and trt libraries from Azure blob (#19118)
### Description
This way, we will not need to update the windows images constantly and
allow more flexibility to choose the cuda version in the future.
2024-01-14 11:37:26 -08:00
Changming Sun
bb4011b2b1
Set default flags nvcc and do not set default compile flags for ROCM EP (#19124)
### Description
Set default flags nvcc and do not set the flags for ROCM EP. 


### Motivation and Context
1. To meet a BinSkim requirement for CUDA EP.

https://github.com/microsoft/binskim/blob/main/docs/BinSkimRules.md#rule-BA2024EnableSpectreMitigations

2. The ROCM EP's pipeline is broken since PR #19073 . Unit tests failed
to load the EP with the following error message:

Failed to load library libonnxruntime_providers_rocm.so with error:
/build/Release/libonnxruntime_providers_rocm.so: undefined symbol:
vtable for onnxruntime::InsertMaxPoolOutput .

This PR is a hot fix to bring the pipeline back. So far I don't know why
the error happened. The symbol "InsertMaxPoolOutput" is in
onnxruntime_optimizers. I don't see any EP code references it directly.
2024-01-14 11:36:49 -08:00
Yulong Wang
f917dde717
[web] remove xnnpack from web backends (#19116)
### Description
XNNPACK is already disabled in web assembly build. This change removes
the xnnpack backend registration in JS.
2024-01-13 23:04:02 -08:00
Edward Chen
e1e45901e2
iOS packaging pipeline stability (#19097)
- Remove protoc build step which sometimes times out. Download protoc instead.
- Use macOS-12 image in the set variables stage. It seems more stable.
2024-01-13 19:27:44 -08:00
Changming Sun
5558912d7b
Disable ccache in Windows CPU CI pipeline (#19131)
### Description
Disable ccache for all the jobs in in Windows CPU CI pipeline.
Before disabling it, the build has a warning that:

"MSIL .netmodule or module compiled with /GL found; restarting link with
/LTCG; add /LTCG to the link command line to improve linker performance"

After disabling it, the warning is gone and the build doesn't use /GL or
/LTCG.

Cache itself should not cause this difference. 

### Motivation and Context
2024-01-13 18:40:43 -08:00
Adrian Lizarraga
65893ef382
Add --parallel to QNN EP NuGet pipeline build command (#19126)
### Description
Add --parallel to QNN EP NuGet pipeline build command

### Motivation and Context
Improve build times for pipeline.
2024-01-13 02:38:40 -08:00
Yang Gu
e803f8eb0f
[js/webgpu] Refactor timestamp-query and introduce timestamp-query-inside-passes (#18894)
We submit kernels in a batch (a fixed number 16 is used except for the
last batch) for better performance. However, timestamp query support is
at pass level so we disable the batch execution in profiling mode in
previous implementation. Actually we can have multiple passes in a batch
so that we don't have to disable batch execution, which is the first
enhancement of this PR.
Furthermore, WebGPU has an extension to support timestamp query inside
passes, which isn't supported by all the platforms (e.g., Windows
supports it, while macOS doesn't). This is expected to have lower cost
compared with multiple passes solution. So this PR also introduce this
support when available.
This PR also refactors some implementation related to kernelInfo, and
try to unify the related kernel names.
2024-01-13 00:23:17 -08:00
Jian Chen
78e796bb27
Fixing issue where unzip package froim 'onnxruntime-win-x64-gpu' was also uploaded. (#19096)
### Description
Fixing issue where unzip package froim 'onnxruntime-win-x64-gpu' was
also uploaded.


For example,
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=396440&view=artifacts&pathAsName=false&type=publishedArtifacts
2024-01-12 22:30:43 -08:00