### Description
Updates the version of numpy required by onnxruntime to >=1.21.6,<2.0
### Motivation and Context
Numpy released version 2.0. The onnxruntime 1.18.1 release is using
numpy < 2.0, so we need to update requirement files to only install
versions between 1.21.6 and 2.0 (non-inclusive).
### Description
<!-- Describe your changes. -->
Adding critical TensorRT EP support
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: pengwa <pengwa@microsoft.com>
Co-authored-by: wejoncy <wejoncy@163.com>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: Yi Zhang <your@email.com>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: inisis <46103969+inisis@users.noreply.github.com>
Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com>
Co-authored-by: mo-ja <60505697+mo-ja@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Sumit Agarwal <sumitagarwal330@gmail.com>
Co-authored-by: Atanas Dimitrov <70822030+neNasko1@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: Dhruv Matani <dhruvbird@gmail.com>
Co-authored-by: Dhruv Matani <dhruv.matani@grammarly.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com>
Co-authored-by: Xu Xing <xing.xu@intel.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Rachel Guo <35738743+YUNQIUGUO@users.noreply.github.com>
Co-authored-by: Sai Kishan Pampana <sai.kishan.pampana@intel.com>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Shubham Bhokare <32080845+shubhambhokare1@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Andrew Fantino <15876180+afantino951@users.noreply.github.com>
Co-authored-by: Thomas Boby <thomas@boby.uk>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Michal Guzek <mguzek@nvidia.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com>
Co-authored-by: Your Name <you@example.com>
### Description
<!-- Describe your changes. -->
Update order of steps
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. --> Fix CI
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
The Key and Value inputs could be 4-dims
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add bf16 support for below ops:
ConstantOfShape
Exp
Erf
convolution
PythonOp
### Motivation and Context
phimm model works on bf16, ORT need support bf16 on previous ops to work
with phimm on bf16
The prefetching instructions (_mm_prefetch) is used to anticipate memory
accesses by prefetching the next row of the input buffer. This
optimization is designed to reduce the impact of memory latency, thereby
enhancing the performance of the MlasComputeSoftmax function. As a
result, the worst-case performance of the OCR model has improved by
approximately 50ms, which equates to a 3% improvement.
For TensorRT 10 GA onwards, the TensorRT libraries will have major
version appended to the end on Windows, for example, nvinfer_10.dll,
nvinfer_plugin_10.dll, nvonnxparser_10.dll ...
Change cmake file accordingly.
### Description
1. Update the image name to avoid docker image wouldn't be overwrite.
there was an mistake that variables.CUDA_VERSION_MAJOR is always empty
14fcf0a52d/tools/ci_build/github/azure-pipelines/stages/nuget-linux-cuda-packaging-stage.yml (L120)
3. set one artifact name as variable to make the job rerunnable
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The current code is calling one method with a missing argument.
### Motivation and Context
It breaks Olive's unittests.
---------
Co-authored-by: Xavier Dupré <xavier.dupre@gmail.com>
### Description
We originally only use compute queues for compute-only devices; this
change sets the default for DX12 devices to use compute queues as well.
### Motivation and Context
There have been issues with TDRs occurring when using the current
default queues, which doesn't happen on compute queues.
### Description
Fixed pastkey, key and pastvalue, value concatenation condition and
fixed index error. Added new test cases.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This PR supports a build of onnxruntime.xcframework for xros/xrsimulator
for visionos via the build command of
`python3 tools/ci_build/github/apple/build_apple_framework.py --config
Release/Debug
tools/ci_build/github/apple/default_vision_os_framework_build_settings.json`.
For officially include visionos in ios cocoapods package and testing in
CI, would require separate work for upgrading the Xcode version &
upgrade macOS CI agent to macos-13-arm64 or higher.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
visionos support:
https://github.com/microsoft/onnxruntime/discussions/19313
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
In Deepspeed's Pipeline Parallel Implementation, there is a class used
to instantiate the object after it's moved to the device and assigned in
a stage.
This approach helps reduce peak memory usage.
In this PR, we're adding support to ORT for wrapping this LayerSpec.
### Description
As described in latest discussion in #19915, parcel v2 without using the
[new resolver](https://parceljs.org/blog/v2-9-0/#new-resolver) will not
work correctly with onnxruntime-web. There are still users who uses
parcel with default resolver, so add this deprecated field "browser"
back for backward compatibility. This PR also corrects the "main" field,
which is for old resolver for Node.js.
### Description
Enabled more usecases
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix comparison that was not updated when the threshold was converted to
bytes.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI failure
This reverts commit f396748ed6.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Updates Windows QNN Nuget and Python packaging pipelines to download
QNN SDK from blob storage.
- Makes the QNN SDK version configurable when launching the python
packaging pipeline.
### Motivation and Context
Removes the need to rebuild images to update QNN SDK. Only applies to
Windows pipelines. Linux pipelines still get the SDK from disk.
### Description
Support GQA operator on CPU with FP32.
### Motivation and Context
Right now, models generated for CPU and GPU must be different. GQA CPU
allows these models to be the same.
### Description
<!-- Describe your changes. -->
Support 1D input to XNNPACK Conv and ConvTranspose by using faking
height of 1 to convert to 2D input.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable speech model with 1D input to use XNNPACK. There is no CPU EP
quantized ConvTranspose, so this fills that gap.
Fix handling of nodes that get assigned to kMSInternalNHWCDomain when loading an ORT format model. The ORT format model doesn't contain information about kMSInternalNHWCDomain since it is set during layout transformation. Fall back to known domains instead.
### Description
Contains critical bug fix
### Motivation and Context
This PR handles the bug fix wrt OV caching and blob generation.
This also handles the precision for AUTO plugin.
---------
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
### Introduce memory efficient topo sort (for training)
~~and laze initialize Priority-Based and Memory-Efficient topo sort.
Because in most cases, they are not needed, so we free the overheads of
GraphViewer construction for most use cases.~~
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add ability to store initializer data in an external file.
Update training checkpoint code to use external file if data > ~2GB.
I don't see a way for the flatbuffers 64-bit offsets to be used, as they
don't support storing 'table' types with 64-bit offsets (and our Tensor
is a 'table' type not a simple struct).
0cfb7eb80b/tests/64bit/test_64bit.fbs (L38-L39)
Allowing a Tensor to have its raw_data in an external file should
hopefully work with the least friction. As it's an extra field it's
backwards compatible.
Please feel free to suggest alternative approaches.
Side note: the diffs in the generated *.fbs.h files are unexpectedly
large. Maybe they weren't re-generated when the new flatbuffers version
was checked in. I updated by running:
`python .\compile_schema.py -f <build output
dir>\_deps\flatbuffers-build\Debug\flatc.exe`
from onnxruntime\core\flatbuffers\schema which I thought was the correct
way but maybe that's out of date.
I think you can ignore all the diffs in the generated files and just
worry about the changes to the .fbs files in
onnxruntime/core/flatbuffers/schema. Basically start at the bottom of
the files changed and work up as all the 'real' diffs are there.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: carzh <wolfivyaura@gmail.com>
### Description
fix test runner with optional input/output.
This change fixes the OP test runner (.jsonc format test) with optional
input(s) and/or output(s).
this fix reveals a problem of dealing with optional outputs:
> Take SkipSimplifiedLayerNorm as example:
>
> if in the ONNX model, the node's outputs are: [ 'output_0', '' ]
instead of [ 'output_0' ], the current implementation will fail. The
difference is, in the first case, context.outputCount == 2, and then the
typescript implementation will try to create a tensor for output[1]. It
will eventually call to C++ function (OpKernelContext::Output), and the
output.DataRaw() will be nullptr. WebGPU backend will fail because it
cannot deal with a TensorView with data == 0.
>
This problem may need to be fixed or workaround in separated PR. This PR
does not fix this problem. Failed test cases are modified to work -
please note this PR does not break those test cases as they never work.
### Description
This PR registers the following opset 20 operators to the DML EP:
-IsNaN-20
-IsInf-20
-ReduceMax-20
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->