### Description
Include Support for neg.int32
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Commit fffefb1c22 (#16969) optimized
matmul and also fixes broadcasting. So #17191 is no longer needed.
However, the newly added operator test file from the PR by @dakenf is
helpful so pick and add it to enhance the tests.
### Description
clean up JSDoc for onnxruntime-common:
- replace "@internal" to "@ignore" as JSDoc do not use "@internal".
Using "@ignore" will let the content not show on the generated doc.
### Description
Remove unnecessary cargo:rerun-if-changed declaration.
### Motivation and Context
'cargo:rerun-if-changed' declarations tell Cargo when to re-run the
build script. The intention is that if the build script depends on other
files, then Cargo knows to re-run if those files change. It stores the
output and checks it before each build. The intention is that one emits
the declarations for _inputs_ of the build.
This rerun-if-changed declaration is a declaration on the _output_ of
the build, and stores the absolute path of the output. This is not a
useful declaration because the output path is unique to the build script
- there is no way for anything else to change it.
However, this does generate unnecessary rebuilds in some cases, for
example if the dependent repository is moved in the filesystem. This
causes me some issues when using https://crane.dev, as due to some
implementation details, if a crate being moved triggers a rebuild, by
default the build is broken.
To summarise:
- declaration is redundant
- causes issues in niche cases.
### Description
1. Update docker files and their build instructions.
ARM64 and x86_64 can use the same docker file.
2. Upgrade Linux CUDA pipeline's base docker image from CentOS7 to UBI8
AB#18990
### Description
<!-- Describe your changes. -->
Remove onnxruntime_test_all from emulator once tests have finished as
it's 1.2GB and takes up too much space given the 2GB maximum partition
size for the emulator.
Side issue is the java build isn't able to strip the binaries in the
java apk which causes that to be 800MB (exceeding the 2GB max). That may
require an Android/Gradle fix as I don't think we can hardcode an NDK
version into our build files.
https://issuetracker.google.com/issues/237187538?pli=1
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix Android CI build failures for
### Description
This PR changes the Whisper export scripts to further optimize the
process of removing duplicate initializers from two subgraphs.
The current Greedy approach is quicker by a large factor, but results in
some duplicate initializers not being caught and removed. This not only
results in a slightly larger Whisper model, but also a model that uses
more GPU memory.
The approach in this PR uses data hashes and caches to keep a quick
export but no longer rely on a greedy approach.
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
### Description
Validate outputs type and shapes. Make sure sparse initializers are
taken into account.
### Motivation and Context
ORT currently does not validate output types or shapes. Further, neither
inputs or outputs take into account sparse initializers that are
converted from dense.
It is currently possible to pre-allocate a wrong type/shape buffer for
output.
Cc: @Craigacp
### Description
There are 8 cu files under [flash
attention](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/bert/flash_attention)
and 4 cu files under [cutlass
fmha](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/bert/cutlass_fmha)
need a lot of memory to compile.
Previously, the default value is same as parallel - number of CPU cores.
Standard_NC4as_T4_v3 has 4 CPUs and 28 GB memory, and we launched 16
nvcc threads in total (4 parallel jobs, and 4 nvcc threads per job).
Each thread might take 4 GB on average (peak is around 6GB, but threads
are not started at same time). OOM happens since 16 threads might need
close to 64 GB in worst case. When build machine has 64GB or larger
memory, OOM is rare.
Here we set a proper nvcc --threads based on available memory to avoid
OOM.
### Motivation and Context
Fix `Python Packaging Pipeline (Training Cuda 11.8)`
### Description
Git commands producing `git-commid-id` and `git-branch` are always run
in `CMAKE_CURRENT_SOURCE_DIR` (i.e. `onnxruntime/cmake`)
### Motivation and Context
Please refer to corresponding issue
[#17197](https://github.com/microsoft/onnxruntime/issues/17197).
### Description
<!-- Describe your changes. -->
For the conv2dByMatMul path, the simulated matmul output shape is the
reshape of the original conv2d. So we should pass this information to
`createMatmulProgramInfo` so that it can process it correctly.
### Description
<!-- Describe your changes. -->
- Fix missing optional input checks originally coming from a github
issue for no shape on Resize Op.
- Exclude Antialias support for Opset 18 + Resize for NNAPI
- Unblock Android CI pipeline tests failure.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Bug fixes.
Issue:
https://github.com/microsoft/onnxruntime/issues/17035
thanks @skottmckay for pointing out the cause.
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
<!-- Describe your changes. -->
As title.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Now we have multiple data types that we want to disable for minimal
build and to reduce binary size. may be worth adding an argument in the
build script for specifying that.
Also for fp16 type stuff, it may be too restrict to disable that for all
minimal build.
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
When users run inference with cuda graph enable with multithreading,
only the main thread creating the inference session will successfully
initialize cuda graph instance, for other threads executing the
inference run directly, they will hit segfault due to not calling
allocation/initialization for cuda graph instance.
This PR fixes this issue.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The warning is:
```
/onnxruntime_src/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc:1202:41: error: call to non-‘constexpr’ function ‘bool onnx_transpose_optimization::TransposeQuantizeDequantizeAxis(const onnx_transpose_optimization::api::GraphRef&, const std::vector<long int>&, onnx_transpose_optimization::api::NodeRef&)’
return TransposeQuantizeDequantizeAxis(graph, perm, node);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
```
The function TransposeQuantizeDequantizeAxis is not constexpr.
\
### Description
Fix issues:
(1) When the output of Add before LayerNormalization node is a graph
output, we shall output it in SkipLayerNormalization, but currently not.
(2) When there is Cast before Add bias, the Cast output (instead of
input) shall be used as SkipLayerNormalization input.
(3) The skip input is not at the second input of fused node. According
to op spec, skip shall be the second. It could bring issue when we add
skip broadcasting support later.
### Motivation and Context
Fusion for Clip model of SDXL failed since the last hidden state is a
graph output.
### Description
Add the compiler cache in linux GPU tensorRT CI.
Save about 30 minutes in the GPU machine. (52 minutes -> 24 minutes)
PS.
There're only white-space differences in the dockerfile.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
In CopyInputAcrossDevice() function, we assign each feed a stream to
copy across device, once the copy is done, each stream will trigger the
Flush() function which is undesired. Same stream should be only flushed
once
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is to address a perf issue of TLNGv4 inference which
contains subgraph with many input feeds.
The ONNX's pads is [beginning_height, beginning_width, ending_height,
ending_width], while WebNN's padding is [beginning_height,
ending_height, beginning_width, ending_width]. We should permute the
ONNX's pads to [0, 2, 1, 3] for WebNN.
### Description
- Fix incorrect zero-point calculation in unit tests. Affects int8(signed) QDQ models.
- Replace flaky MatMul test that occasionally fails on main branch with a version that uses explicit inputs.
### Motivation and Context
Fix bug and improve test accuracy and stability.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Get the latest gcc 12 by default
---------
Co-authored-by: Changming Sun <chasun@microsoft.com>
### Description
There was an Init() method that does exactly like the lines I replaced,
so I switched to it.
### Motivation and Context
Simpler with no drawbacks.
[//]: # (## Work In Progress. Feedbacks are welcome!)
### Description
This PR adds a few properties, methods and factories to Tensor type to
support IO-binding feature. This will allow user to create tensor from
GPU/CPU bound data without a force transferring of data between CPU and
GPU.
This change is a way to resolve#15312
### Change Summary
1. Add properties to `Tensor` type:
a. `location`: indicating where the data is sitting. valid values are
`cpu`, `cpu-pinned`, `texture`, `gpu-buffer`.
b. `texture`: sit side to `data`, a readonly property of `WebGLTexture`
type. available only when `location === 'texture'`
c. `gpuBuffer`: sit side to `data`, a readonly property of `GPUBuffer`
type. available only when `location === 'gpu-buffer'`
2. Add methods to `Tensor` type (usually dealing with inference
outputs):
- async function `getData()` allows user to download data from GPU to
CPU manually.
- function `dispose()` allows user to release GPU resources manually.
3. Add factories for creating `Tensor` instances:
a. `fromTexture()` to create a WebGL texture bound tensor data
b. `fromGpuBuffer()` to create a WebGPUBuffer bound tensor data
c. `fromPinnedBuffer()` to create a tensor using a CPU pinned buffer
### Examples:
create tensors from texture and pass to inference session as inputs
```js
// when create session, specify we prefer 'image_output:0' to be stored on GPU as texture
const session = await InferenceSession.create('./my_model.onnx', {
executionProviders: [ 'webgl' ],
preferredOutputLocation: { 'image_output:0': 'texture' }
});
...
const myImageTexture = getTexture(); // user's function to get a texture
const myFeeds = { input0: Tensor.fromTexture(myImageTexture, { width: 224, height: 224 }) }; // shape [1, 224, 224, 4], RGBA format.
const results = await session.run(myFeeds);
const myOutputTexture = results['image_output:0'].texture;
```