### Description
We originally only use compute queues for compute-only devices; this
change sets the default for DX12 devices to use compute queues as well.
### Motivation and Context
There have been issues with TDRs occurring when using the current
default queues, which doesn't happen on compute queues.
### Description
Fixed pastkey, key and pastvalue, value concatenation condition and
fixed index error. Added new test cases.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This PR supports a build of onnxruntime.xcframework for xros/xrsimulator
for visionos via the build command of
`python3 tools/ci_build/github/apple/build_apple_framework.py --config
Release/Debug
tools/ci_build/github/apple/default_vision_os_framework_build_settings.json`.
For officially include visionos in ios cocoapods package and testing in
CI, would require separate work for upgrading the Xcode version &
upgrade macOS CI agent to macos-13-arm64 or higher.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
visionos support:
https://github.com/microsoft/onnxruntime/discussions/19313
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
In Deepspeed's Pipeline Parallel Implementation, there is a class used
to instantiate the object after it's moved to the device and assigned in
a stage.
This approach helps reduce peak memory usage.
In this PR, we're adding support to ORT for wrapping this LayerSpec.
### Description
As described in latest discussion in #19915, parcel v2 without using the
[new resolver](https://parceljs.org/blog/v2-9-0/#new-resolver) will not
work correctly with onnxruntime-web. There are still users who uses
parcel with default resolver, so add this deprecated field "browser"
back for backward compatibility. This PR also corrects the "main" field,
which is for old resolver for Node.js.
### Description
Enabled more usecases
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix comparison that was not updated when the threshold was converted to
bytes.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI failure
This reverts commit f396748ed6.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Updates Windows QNN Nuget and Python packaging pipelines to download
QNN SDK from blob storage.
- Makes the QNN SDK version configurable when launching the python
packaging pipeline.
### Motivation and Context
Removes the need to rebuild images to update QNN SDK. Only applies to
Windows pipelines. Linux pipelines still get the SDK from disk.
### Description
Support GQA operator on CPU with FP32.
### Motivation and Context
Right now, models generated for CPU and GPU must be different. GQA CPU
allows these models to be the same.
### Description
<!-- Describe your changes. -->
Support 1D input to XNNPACK Conv and ConvTranspose by using faking
height of 1 to convert to 2D input.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable speech model with 1D input to use XNNPACK. There is no CPU EP
quantized ConvTranspose, so this fills that gap.
Fix handling of nodes that get assigned to kMSInternalNHWCDomain when loading an ORT format model. The ORT format model doesn't contain information about kMSInternalNHWCDomain since it is set during layout transformation. Fall back to known domains instead.
### Description
Contains critical bug fix
### Motivation and Context
This PR handles the bug fix wrt OV caching and blob generation.
This also handles the precision for AUTO plugin.
---------
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
### Introduce memory efficient topo sort (for training)
~~and laze initialize Priority-Based and Memory-Efficient topo sort.
Because in most cases, they are not needed, so we free the overheads of
GraphViewer construction for most use cases.~~
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add ability to store initializer data in an external file.
Update training checkpoint code to use external file if data > ~2GB.
I don't see a way for the flatbuffers 64-bit offsets to be used, as they
don't support storing 'table' types with 64-bit offsets (and our Tensor
is a 'table' type not a simple struct).
0cfb7eb80b/tests/64bit/test_64bit.fbs (L38-L39)
Allowing a Tensor to have its raw_data in an external file should
hopefully work with the least friction. As it's an extra field it's
backwards compatible.
Please feel free to suggest alternative approaches.
Side note: the diffs in the generated *.fbs.h files are unexpectedly
large. Maybe they weren't re-generated when the new flatbuffers version
was checked in. I updated by running:
`python .\compile_schema.py -f <build output
dir>\_deps\flatbuffers-build\Debug\flatc.exe`
from onnxruntime\core\flatbuffers\schema which I thought was the correct
way but maybe that's out of date.
I think you can ignore all the diffs in the generated files and just
worry about the changes to the .fbs files in
onnxruntime/core/flatbuffers/schema. Basically start at the bottom of
the files changed and work up as all the 'real' diffs are there.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: carzh <wolfivyaura@gmail.com>
### Description
fix test runner with optional input/output.
This change fixes the OP test runner (.jsonc format test) with optional
input(s) and/or output(s).
this fix reveals a problem of dealing with optional outputs:
> Take SkipSimplifiedLayerNorm as example:
>
> if in the ONNX model, the node's outputs are: [ 'output_0', '' ]
instead of [ 'output_0' ], the current implementation will fail. The
difference is, in the first case, context.outputCount == 2, and then the
typescript implementation will try to create a tensor for output[1]. It
will eventually call to C++ function (OpKernelContext::Output), and the
output.DataRaw() will be nullptr. WebGPU backend will fail because it
cannot deal with a TensorView with data == 0.
>
This problem may need to be fixed or workaround in separated PR. This PR
does not fix this problem. Failed test cases are modified to work -
please note this PR does not break those test cases as they never work.
### Description
This PR registers the following opset 20 operators to the DML EP:
-IsNaN-20
-IsInf-20
-ReduceMax-20
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adds support for float16 BatchNormalization to the HTP backend.
- Fixes float32 support for BatchNormalization on the HTP backend when
`enable_htp_fp16_precision` is enabled.
### Motivation and Context
Support more models on the QNN HTP backend.
### Description
<!-- Describe your changes. -->
Add Nuget package changes for adding new 'net6.0-maccatalyst' platform.
The output ORT Nuget package was manually tested and verified in a .NET
MAUI app setup.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
Unused vector allocating large memory chunk within a concurrent routine
creates heap contention and is eliminated.
### Motivation and Context
This partially addresses
https://github.com/microsoft/onnxruntime/issues/20373.
### Description
These changes include
Support to OpenVINO 2024.1
Import PreCompiled Blobs with EPContext Blob
Separate Device/Precision as input
Deprecate CPU_FP32 , GPU_FP32 terminology , introduce CPU, GPU
AUTO GPU, CPU will only create GPU Blob and not CPU Blob.
### Motivation and Context
- OpenVINO 2024.1 will be out soon
- Import Precompiled Blob can greatly reduce FEIL/FIL Time.
- Separating Device/Precision will make the input cleaner
-
---------
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
### Description
<!-- Describe your changes. -->
Add version for onnxruntime_providers_vitisai.dll. So, the
onnxruntime_vitisai_ep.dll can check if the version is compatible.
To make sure the old onnxruntime_vitisai_ep.dll still work, we would
offset the api struct by version field.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
This is the direct request from Microsoft. The following is the problem
we try to solve:
How would you describe the dependency between (a)
onnxruntime_vitisai_ep.dll and (b) onnxruntime_providers_vitisai.dll?
E.g. for each version of (a) there is a minimum required version of (b),
or for each version of (b) there is minimum required version of (a).
Please note that in practice we won't be able to use the exact version
of ORT/EP that you tested against (because we might need to update ORT
for other reasons), but we might be able to accommodate some version
constraints that you specify. As we approach shipping, we'll lock the
version of ORT/EP to allow for stabilization and more detailed testing
(and work with you if it needs to be updated).
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
I am prefiring this change to pre-run the non-dml checks, and also to
give folks the time to review it before DML gets released. When DML 1.14
officially releases, we'll only need to run the DML pipeline to
automatically pick up the nuget package. This should save us some
valuable time.
Note that DML 1.14 is the release needed for ORT 1.17.4, and DML 1.15
will come soon after.
### Description
Updates the QDQ quantizer to use ONNX Q/DQ ops for 16-bit quantization
if opset >= 21.
### Motivation and Context
The QDQ quantizer previously set the 'com.microsoft' domain on inserted
Q/DQ ops when the model needed 16-bit support. ONNX 1.16.0 added
int16/uint16 support to the QuantizeLinear and DequantizeLinear
operators, so we can change the default behavior.
This PR has the change of supporting INT64 tensor type for TRT 10.
This PR is also **compatible with TRT 8.6 and TRT 10** meaning user can
build ORT TRT against TRT 8.6 or TRT 10.
Due to the timeline for TRT 10 GA and ORT 1.18 release is very tight (We
don't have enough time to get our CIs installed with TRT 10 GA libraries
and run the build/tests), as well as Nvidia new Triton release (The
timeline is also very close to the timeline of TRT 10 GA) wants to
integrate TRT EP with TRT 10.
Therefore, our approach is to make this PR into ORT 1.18 first, so
everything is fully tested with TRT 8.6 CIs, and user can still manually
build ORT 1.18 against TRT 10 like the Triton case.
As for testing TRT 10, once TRT 10 GA is released, we will have another
branch which includes change at this PR as well as whatever changes
needed and update our CIs with TRT 10.
### Description
Currently we try to include all prebuilt binaries into the NPM packages.
This was working until we added libonnxruntime_providers_cuda.so
(>400MB) into the NPM package. The NPM registry refuses to accept new
package publishment because the file is too large.
To make the new NPM package working, we have to remove the large file
from the package, and add a new script on package installation. This
script will try to dynamically install onnxruntime CUDA dynamic library
for Linux/x64.
### Description
Introducing a new class ORTPipelineModule to handle wrapping layers in
DeepSpeed pipeline parallel.
### Motivation and Context
To support pipeline parallelism on ORTModule.
This PR will include an initial support of deepspeed Pipeline
parallelism.
- [x] Support Pipeline parallel where layers are nn Modules in
Sequential.
- [ ] Support LayerSpec and TiedLayerSpec
- [ ] Enable partitioning to accept List
- [ ] Full-GPU Graph Consolidation
- [ ] Subgraph Merging for Inference
### Description
This fixes following things:
- Expose `ENABLE_NPU_ADAPTER_ENUMERATION` macro via build command, so
that a user can enable NPU support for DML EP seamlessly.
- Add keyword `_dmlEp_` as part of the node name, which would be useful
for debugging purpose.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->