Commit graph

10965 commits

Author SHA1 Message Date
maggie1059
dfd4bce36e
Use compute queues by default in DML EP (#20438)
### Description
We originally only use compute queues for compute-only devices; this
change sets the default for DX12 devices to use compute queues as well.



### Motivation and Context
There have been issues with TDRs occurring when using the current
default queues, which doesn't happen on compute queues.
2024-04-24 10:44:16 -07:00
Xavier Dupré
f78215adad
Fix quantization tools for issue #19529 (#19591)
### Description
Fix issue #19529, the code was using a variable loop outside a loop.
2024-04-24 19:16:27 +02:00
Scott McKay
a46bab6364
Update podspec url to use AFD hostname (#20452)
Update to use AFD url when generating podspec
2024-04-24 09:37:24 -07:00
Satya Kumar Jandhyala
ae78cdb5d7
[JS/WebGPU] MultiheadAttention bugfix (#20447)
### Description
Fixed pastkey, key and pastvalue, value concatenation condition and
fixed index error. Added new test cases.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-24 08:43:14 -07:00
Guenther Schmuelling
33d5ea39b3
[js/webgpu] fixes for fp16 attention (#20440) 2024-04-24 08:01:28 -07:00
Xavier Dupré
80213a9e66
Add implementation for ScatterND (#19540)
### Description
onnxruntime switches to CPU for ScatterND after opset 13. This extends
the implementation of higher opsets.
2024-04-24 14:08:50 +02:00
Rachel Guo
14fcf0a52d
Support visionos build (#20365)
### Description
<!-- Describe your changes. -->

This PR supports a build of onnxruntime.xcframework for xros/xrsimulator
for visionos via the build command of

`python3 tools/ci_build/github/apple/build_apple_framework.py --config
Release/Debug
tools/ci_build/github/apple/default_vision_os_framework_build_settings.json`.

For officially include visionos in ios cocoapods package and testing in
CI, would require separate work for upgrading the Xcode version &
upgrade macOS CI agent to macos-13-arm64 or higher.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

visionos support:
https://github.com/microsoft/onnxruntime/discussions/19313

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2024-04-23 18:15:07 -07:00
Adam Louly
4ce7bbf6f1
Add LayerSpec Support to ORTPipelineModule (#20410)
### Description
In Deepspeed's Pipeline Parallel Implementation, there is a class used
to instantiate the object after it's moved to the device and assigned in
a stage.

This approach helps reduce peak memory usage. 

In this PR, we're adding support to ORT for wrapping this LayerSpec.
2024-04-23 17:57:08 -07:00
Yulong Wang
5055dc0aa8
[js/web] add diagnose log for chrome (#20439)
### Description

Add logs to further diagnose the pipeline issue.
2024-04-23 17:18:54 -07:00
Maximilian Müller
b4e50758c0
Fix shape conv fuse opt (#20282)
FIx:
- Multiples Convs into an Add+Relu will fuse the op although intermediates are needed

![image](https://github.com/microsoft/onnxruntime/assets/44298237/0c85a30c-5f41-4e62-ae2e-f41eada6c2c3)
- Also fixes an issue with Shape Initializers Merge as input, that
occurs when the input initializer is the same across multiple nodes but
not all nodes are Shape nodes.
2024-04-23 16:19:57 -07:00
Yulong Wang
8f53957bcf
[js/web] add "browser" field to support parcel v2 (#20422)
### Description

As described in latest discussion in #19915, parcel v2 without using the
[new resolver](https://parceljs.org/blog/v2-9-0/#new-resolver) will not
work correctly with onnxruntime-web. There are still users who uses
parcel with default resolver, so add this deprecated field "browser"
back for backward compatibility. This PR also corrects the "main" field,
which is for old resolver for Node.js.
2024-04-23 13:10:11 -07:00
Yulong Wang
13bda11583
[Node.js binding] Fix install script (#20416)
### Description
Fix a few bugs of the install script of onnxruntime-node package.

This change is integrated from branch `rel-1.17.3` (#20397)
2024-04-23 13:01:16 -07:00
Satya Kumar Jandhyala
d42ac7f0c6
[JS/WebGPU] Multihead attention improvements (#20286)
### Description
Enabled more usecases



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-23 12:39:49 -07:00
Edward Chen
76461c8f4d
Increase timeout for iOS packaging pipeline jobs. (#20434) 2024-04-23 11:55:55 -07:00
Guenther Schmuelling
b8e6684313
more conservitive gpu-buffer cache algo (#20312)
tuned based on 80 models to keep performance impact minimal
2024-04-23 09:07:04 -07:00
guyang3532
ffb9c8d598
fix embedding sparsity log bug of -1% density (#20420)
### Description
When not checked valid embedding sparsity, the log print a wrong info of
"-1% density", this pr is to fix it.
2024-04-23 20:37:50 +08:00
Scott McKay
ed6f1adcb8
Fix overflow causing test failure on x86 (#20425)
### Description
<!-- Describe your changes. -->
Fix comparison that was not updated when the threshold was converted to
bytes.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI failure
2024-04-23 21:33:59 +10:00
Maximilian Müller
5eae33fc6b
[CUDA EP] RNN check if tf32 is allowed (#20338)
Respect the use_tf32 flag.
2024-04-23 00:19:09 -07:00
Yi Zhang
7ebc653f04
Revert "Nuget .NET changes for Mac Catalyst (#19923)" (#20418)
This reverts commit f396748ed6.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-23 15:08:12 +08:00
Adrian Lizarraga
e6a677f6b7
[QNN EP] Download QNN SDK from azure blob in packaging pipelines (#20359)
### Description
- Updates Windows QNN Nuget and Python packaging pipelines to download
QNN SDK from blob storage.
- Makes the QNN SDK version configurable when launching the python
packaging pipeline.



### Motivation and Context
Removes the need to rebuild images to update QNN SDK. Only applies to
Windows pipelines. Linux pipelines still get the SDK from disk.
2024-04-22 22:32:55 -07:00
aciddelgado
94c69f55d4
GQA 4 CPU (#20299)
### Description
Support GQA operator on CPU with FP32.



### Motivation and Context
Right now, models generated for CPU and GPU must be different. GQA CPU
allows these models to be the same.
2024-04-22 19:57:05 -07:00
Scott McKay
c47a6ce70b
XNNPACK: Support 1D input for Conv and ConvTranspose (#20349)
### Description
<!-- Describe your changes. -->
Support 1D input to XNNPACK Conv and ConvTranspose by using faking
height of 1 to convert to 2D input.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable speech model with 1D input to use XNNPACK. There is no CPU EP
quantized ConvTranspose, so this fills that gap.
2024-04-23 11:50:31 +10:00
Edward Chen
3270a002fa
Fix handling of nodes that get assigned to kMSInternalNHWCDomain when loading an ORT format model. (#20379)
Fix handling of nodes that get assigned to kMSInternalNHWCDomain when loading an ORT format model. The ORT format model doesn't contain information about kMSInternalNHWCDomain since it is set during layout transformation. Fall back to known domains instead.
2024-04-22 18:34:01 -07:00
Preetha Veeramalai
c7de4de501
OVEP Bug fix 1.18 (#20408)
### Description
Contains critical bug fix 



### Motivation and Context
This PR handles the bug fix wrt OV caching and blob generation.
This also handles the precision for AUTO plugin.

---------

Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
2024-04-22 18:31:05 -07:00
pengwa
a7787a0bad
Introduce memory efficient topological sort (#20258)
### Introduce memory efficient topo sort (for training)

~~and laze initialize Priority-Based and Memory-Efficient topo sort.
Because in most cases, they are not needed, so we free the overheads of
GraphViewer construction for most use cases.~~

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-23 08:00:23 +08:00
Scott McKay
9372e9a0a3
Support >2GB of Tensor data in training checkpoint (#20077)
### Description
<!-- Describe your changes. -->
Add ability to store initializer data in an external file.
Update training checkpoint code to use external file if data > ~2GB.

I don't see a way for the flatbuffers 64-bit offsets to be used, as they
don't support storing 'table' types with 64-bit offsets (and our Tensor
is a 'table' type not a simple struct).


0cfb7eb80b/tests/64bit/test_64bit.fbs (L38-L39)

Allowing a Tensor to have its raw_data in an external file should
hopefully work with the least friction. As it's an extra field it's
backwards compatible.

Please feel free to suggest alternative approaches. 

Side note: the diffs in the generated *.fbs.h files are unexpectedly
large. Maybe they weren't re-generated when the new flatbuffers version
was checked in. I updated by running:
`python .\compile_schema.py -f <build output
dir>\_deps\flatbuffers-build\Debug\flatc.exe`
from onnxruntime\core\flatbuffers\schema which I thought was the correct
way but maybe that's out of date.

I think you can ignore all the diffs in the generated files and just
worry about the changes to the .fbs files in
onnxruntime/core/flatbuffers/schema. Basically start at the bottom of
the files changed and work up as all the 'real' diffs are there.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: carzh <wolfivyaura@gmail.com>
2024-04-22 15:17:43 -07:00
Yulong Wang
4385602386
[js/web] fix test runner with optional input/output (#20399)
### Description
fix test runner with optional input/output.

This change fixes the OP test runner (.jsonc format test) with optional
input(s) and/or output(s).

this fix reveals a problem of dealing with optional outputs:

> Take SkipSimplifiedLayerNorm as example: 
>
> if in the ONNX model, the node's outputs are: [ 'output_0', '' ]
instead of [ 'output_0' ], the current implementation will fail. The
difference is, in the first case, context.outputCount == 2, and then the
typescript implementation will try to create a tensor for output[1]. It
will eventually call to C++ function (OpKernelContext::Output), and the
output.DataRaw() will be nullptr. WebGPU backend will fail because it
cannot deal with a TensorView with data == 0.
>

This problem may need to be fixed or workaround in separated PR. This PR
does not fix this problem. Failed test cases are modified to work -
please note this PR does not break those test cases as they never work.
2024-04-22 12:53:10 -07:00
aamajumder
d0e33d2078
[DML EP] Register opset 20 operators (#20092)
### Description
This PR registers the following opset 20 operators to the DML EP:
-IsNaN-20
-IsInf-20
-ReduceMax-20


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-22 12:01:59 -07:00
Yi Zhang
197b3f1d90
Enable Whisper Test with OMP_FFMPEG (#20402)
### Description
 Installing OMP_FFMPEG in the docker  and Readd Whisper Test
Download OMP_FFMPEG in restricted accessed Azure blob.
2024-04-22 10:55:56 -07:00
Yulong Wang
a457c1df80
upgrade emsdk to 3.1.57 (#20295)
### Description
upgrade emsdk to 3.1.57
2024-04-19 23:05:18 -07:00
Adrian Lizarraga
77b7619a3d
[QNN EP] Support float16 BatchNormalization on the HTP backend (#20391)
### Description
- Adds support for float16 BatchNormalization to the HTP backend.
- Fixes float32 support for BatchNormalization on the HTP backend when
`enable_htp_fp16_precision` is enabled.


### Motivation and Context
Support more models on the QNN HTP backend.
2024-04-19 21:49:39 -07:00
Patrice Vignola
8fbb8a149f
[DML EP] Add MatMulNBits (#20308) 2024-04-19 15:05:37 -07:00
Hector Li
55e0aaeeef
fix android build issue (#20389)
fix android build issue
2024-04-19 14:21:34 -07:00
Rachel Guo
f396748ed6
Nuget .NET changes for Mac Catalyst (#19923)
### Description
<!-- Describe your changes. -->

Add Nuget package changes for adding new 'net6.0-maccatalyst' platform.

The output ORT Nuget package was manually tested and verified in a .NET
MAUI app setup.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2024-04-19 14:20:03 -07:00
Guenther Schmuelling
497a627a69
fix fp16 for skiplayernorm (#20381) 2024-04-19 12:12:02 -07:00
Dmitri Smirnov
42b700d463
Eliminate stray vector and the contention it creates (#20377)
### Description
Unused vector allocating large memory chunk within a concurrent routine
creates heap contention and is eliminated.

### Motivation and Context
This partially addresses
https://github.com/microsoft/onnxruntime/issues/20373.
2024-04-19 10:27:42 -07:00
Patrice Vignola
4d98f06f93
[DML EP] Add GroupQueryAttention (#20327) 2024-04-19 10:25:29 -07:00
Wanming Lin
7c80c39f74
[WebNN EP] WebNN CPU backend only support up to 4 Split outputs (#20350) 2024-04-19 08:31:22 -07:00
sfatimar
4d1963c2a2
OpenVINO EP Rel 1.18 Changes (#20337)
### Description
These changes include
Support to OpenVINO 2024.1 
Import PreCompiled Blobs with EPContext Blob 
Separate Device/Precision as input
Deprecate CPU_FP32 , GPU_FP32 terminology , introduce CPU, GPU 
AUTO GPU, CPU will only create GPU Blob and not CPU Blob. 



### Motivation and Context
- OpenVINO 2024.1 will be out soon
- Import Precompiled Blob can greatly reduce FEIL/FIL Time. 
- Separating Device/Precision will make the input cleaner
-

---------

Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2024-04-19 00:31:38 -07:00
Yueqing Zhang
9001c69b84
[VitisAI] Add Version Check. Requsted by Microsoft (#20347)
### Description
<!-- Describe your changes. -->
Add version for onnxruntime_providers_vitisai.dll. So, the
onnxruntime_vitisai_ep.dll can check if the version is compatible.
To make sure the old onnxruntime_vitisai_ep.dll still work, we would
offset the api struct by version field.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
This is the direct request from Microsoft. The following is the problem
we try to solve:

How would you describe the dependency between (a)
onnxruntime_vitisai_ep.dll and (b) onnxruntime_providers_vitisai.dll?
E.g. for each version of (a) there is a minimum required version of (b),
or for each version of (b) there is minimum required version of (a).

Please note that in practice we won't be able to use the exact version
of ORT/EP that you tested against (because we might need to update ORT
for other reasons), but we might be able to accommodate some version
constraints that you specify. As we approach shipping, we'll lock the
version of ORT/EP to allow for stabilization and more detailed testing
(and work with you if it needs to be updated).
2024-04-18 23:05:44 -07:00
Patrice Vignola
12569626cb
Update DML to 1.14.1 (#20380)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-18 22:43:41 -07:00
Patrice Vignola
b8c90beef2
[DML EP] Add SimplifiedLayerNorm and SkipSimplifiedLayerNorm (#20326) 2024-04-18 22:17:31 -07:00
Chi Lo
a747a00cd3
[TensorRT EP] Use protobuf with debug build on Windows (#20378)
TRT EP implicitly uses oss_parser with debug build on Windows, therefore
it should use protobuf rather than protobuf-lite.
2024-04-18 19:39:08 -07:00
Patrice Vignola
745b426c60
[DML] Update DML to 1.14 (#20304)
I am prefiring this change to pre-run the non-dml checks, and also to
give folks the time to review it before DML gets released. When DML 1.14
officially releases, we'll only need to run the DML pipeline to
automatically pick up the nuget package. This should save us some
valuable time.

Note that DML 1.14 is the release needed for ORT 1.17.4, and DML 1.15
will come soon after.
2024-04-18 16:22:57 -07:00
Adrian Lizarraga
e4c0cb2b9a
[Quant tool] Do not default to contrib Q/DQ ops for 16-bit (#20376)
### Description
Updates the QDQ quantizer to use ONNX Q/DQ ops for 16-bit quantization
if opset >= 21.

### Motivation and Context
The QDQ quantizer previously set the 'com.microsoft' domain on inserted
Q/DQ ops when the model needed 16-bit support. ONNX 1.16.0 added
int16/uint16 support to the QuantizeLinear and DequantizeLinear
operators, so we can change the default behavior.
2024-04-18 15:26:07 -07:00
Chi Lo
a8f74e3ec7
[TensorRT EP] TensorRT 10 support (#20167)
This PR has the change of supporting INT64 tensor type for TRT 10.
This PR is also **compatible with TRT 8.6 and TRT 10** meaning user can
build ORT TRT against TRT 8.6 or TRT 10.

Due to the timeline for TRT 10 GA and ORT 1.18 release is very tight (We
don't have enough time to get our CIs installed with TRT 10 GA libraries
and run the build/tests), as well as Nvidia new Triton release (The
timeline is also very close to the timeline of TRT 10 GA) wants to
integrate TRT EP with TRT 10.

Therefore, our approach is to make this PR into ORT 1.18 first, so
everything is fully tested with TRT 8.6 CIs, and user can still manually
build ORT 1.18 against TRT 10 like the Triton case.

As for testing TRT 10, once TRT 10 GA is released, we will have another
branch which includes change at this PR as well as whatever changes
needed and update our CIs with TRT 10.
2024-04-18 14:03:04 -07:00
Yulong Wang
3577a4bd02
[Node.js binding] Allow installation to download CUDA binaries via script (#20364)
### Description
Currently we try to include all prebuilt binaries into the NPM packages.
This was working until we added libonnxruntime_providers_cuda.so
(>400MB) into the NPM package. The NPM registry refuses to accept new
package publishment because the file is too large.

To make the new NPM package working, we have to remove the large file
from the package, and add a new script on package installation. This
script will try to dynamically install onnxruntime CUDA dynamic library
for Linux/x64.
2024-04-18 13:44:42 -07:00
Guenther Schmuelling
7b017cf9f8
fix web ci: csum tests need fp64 which is not supported on webgpu (#20374) 2024-04-18 12:30:26 -07:00
Adam Louly
ee74fb6908
Introducing ORTPipelineModule - DeepSpeed Parallel Pipeline Support. (#20287)
### Description
Introducing a new class ORTPipelineModule to handle wrapping layers in
DeepSpeed pipeline parallel.


### Motivation and Context
To support pipeline parallelism on ORTModule.

This PR will include an initial support of deepspeed Pipeline
parallelism.

- [x] Support Pipeline parallel where layers are nn Modules in
Sequential.
- [ ] Support LayerSpec and TiedLayerSpec
- [ ] Enable partitioning to accept List
- [ ] Full-GPU Graph Consolidation
- [ ] Subgraph Merging for Inference
2024-04-18 11:30:15 -07:00
Sumit Agarwal
f664f91298
[DML EP] Expose NPU macro via build command (#20306)
### Description
This fixes following things:
- Expose `ENABLE_NPU_ADAPTER_ENUMERATION` macro via build command, so
that a user can enable NPU support for DML EP seamlessly.
- Add keyword `_dmlEp_` as part of the node name, which would be useful
for debugging purpose.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-18 11:23:13 -07:00