### Description
<!-- Describe your changes. -->
For the conv2dByMatMul path, the simulated matmul output shape is the
reshape of the original conv2d. So we should pass this information to
`createMatmulProgramInfo` so that it can process it correctly.
### Description
<!-- Describe your changes. -->
- Fix missing optional input checks originally coming from a github
issue for no shape on Resize Op.
- Exclude Antialias support for Opset 18 + Resize for NNAPI
- Unblock Android CI pipeline tests failure.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Bug fixes.
Issue:
https://github.com/microsoft/onnxruntime/issues/17035
thanks @skottmckay for pointing out the cause.
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
<!-- Describe your changes. -->
As title.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Now we have multiple data types that we want to disable for minimal
build and to reduce binary size. may be worth adding an argument in the
build script for specifying that.
Also for fp16 type stuff, it may be too restrict to disable that for all
minimal build.
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
When users run inference with cuda graph enable with multithreading,
only the main thread creating the inference session will successfully
initialize cuda graph instance, for other threads executing the
inference run directly, they will hit segfault due to not calling
allocation/initialization for cuda graph instance.
This PR fixes this issue.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The warning is:
```
/onnxruntime_src/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc:1202:41: error: call to non-‘constexpr’ function ‘bool onnx_transpose_optimization::TransposeQuantizeDequantizeAxis(const onnx_transpose_optimization::api::GraphRef&, const std::vector<long int>&, onnx_transpose_optimization::api::NodeRef&)’
return TransposeQuantizeDequantizeAxis(graph, perm, node);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
```
The function TransposeQuantizeDequantizeAxis is not constexpr.
\
### Description
Fix issues:
(1) When the output of Add before LayerNormalization node is a graph
output, we shall output it in SkipLayerNormalization, but currently not.
(2) When there is Cast before Add bias, the Cast output (instead of
input) shall be used as SkipLayerNormalization input.
(3) The skip input is not at the second input of fused node. According
to op spec, skip shall be the second. It could bring issue when we add
skip broadcasting support later.
### Motivation and Context
Fusion for Clip model of SDXL failed since the last hidden state is a
graph output.
### Description
Add the compiler cache in linux GPU tensorRT CI.
Save about 30 minutes in the GPU machine. (52 minutes -> 24 minutes)
PS.
There're only white-space differences in the dockerfile.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
In CopyInputAcrossDevice() function, we assign each feed a stream to
copy across device, once the copy is done, each stream will trigger the
Flush() function which is undesired. Same stream should be only flushed
once
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is to address a perf issue of TLNGv4 inference which
contains subgraph with many input feeds.
The ONNX's pads is [beginning_height, beginning_width, ending_height,
ending_width], while WebNN's padding is [beginning_height,
ending_height, beginning_width, ending_width]. We should permute the
ONNX's pads to [0, 2, 1, 3] for WebNN.
### Description
- Fix incorrect zero-point calculation in unit tests. Affects int8(signed) QDQ models.
- Replace flaky MatMul test that occasionally fails on main branch with a version that uses explicit inputs.
### Motivation and Context
Fix bug and improve test accuracy and stability.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Get the latest gcc 12 by default
---------
Co-authored-by: Changming Sun <chasun@microsoft.com>
### Description
There was an Init() method that does exactly like the lines I replaced,
so I switched to it.
### Motivation and Context
Simpler with no drawbacks.
[//]: # (## Work In Progress. Feedbacks are welcome!)
### Description
This PR adds a few properties, methods and factories to Tensor type to
support IO-binding feature. This will allow user to create tensor from
GPU/CPU bound data without a force transferring of data between CPU and
GPU.
This change is a way to resolve#15312
### Change Summary
1. Add properties to `Tensor` type:
a. `location`: indicating where the data is sitting. valid values are
`cpu`, `cpu-pinned`, `texture`, `gpu-buffer`.
b. `texture`: sit side to `data`, a readonly property of `WebGLTexture`
type. available only when `location === 'texture'`
c. `gpuBuffer`: sit side to `data`, a readonly property of `GPUBuffer`
type. available only when `location === 'gpu-buffer'`
2. Add methods to `Tensor` type (usually dealing with inference
outputs):
- async function `getData()` allows user to download data from GPU to
CPU manually.
- function `dispose()` allows user to release GPU resources manually.
3. Add factories for creating `Tensor` instances:
a. `fromTexture()` to create a WebGL texture bound tensor data
b. `fromGpuBuffer()` to create a WebGPUBuffer bound tensor data
c. `fromPinnedBuffer()` to create a tensor using a CPU pinned buffer
### Examples:
create tensors from texture and pass to inference session as inputs
```js
// when create session, specify we prefer 'image_output:0' to be stored on GPU as texture
const session = await InferenceSession.create('./my_model.onnx', {
executionProviders: [ 'webgl' ],
preferredOutputLocation: { 'image_output:0': 'texture' }
});
...
const myImageTexture = getTexture(); // user's function to get a texture
const myFeeds = { input0: Tensor.fromTexture(myImageTexture, { width: 224, height: 224 }) }; // shape [1, 224, 224, 4], RGBA format.
const results = await session.run(myFeeds);
const myOutputTexture = results['image_output:0'].texture;
```
### Description
Changes in this PR:
1) use the optimized version `makeMatMulPacked[Vec4]Source` to support
matmul.
2) enable the conv2dByMatMul path.
3) support broadcast
4) use IndicesHelper.
MatMul with M = 512, K = 512, N = 512 becomes 2ms from 15ms when
enabling profilingMode on my ADL.
### Description
Tested with stable diffusion unet models exported by both pytorch 2.1.0
(nightly) and pytorch 1.13.1, with and without LoRA weights.
### Motivation and Context
LoRA weights modifiy the unet model by adding matmul and scale
operations to every q/k/v/out tensors, which breaks the current MHA
pattern recognition.
In flatbuffers@v23.5.9 was broken forward declaration for
FlatBufferBuilder. Trying to compile onnxruntime falls with the
following error:
```
flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder')
typedef FlatBufferBuilderImpl<false> FlatBufferBuilder;
^
onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here
class FlatBufferBuilder;
```
This PR removes these declarations and puts includes instead
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
When original model has external data in current directory, saving the
optimized model will raise File not found exception during looking for
external data file under root directory "/". This fix will look under
current directory for this case.
I manually tested an extra case and it is working: Original model with
external data in root directory ("/"), and save optimized to current
directory.
BTW, there is another bug found: when
"session.optimized_model_external_initializers_min_size_in_bytes" is set
a large value, some tensor is still pointed to the original external
data file. Add a TODO in unit test for this bug. Possible solution: load
external data into memory before saving model.
### Description
* Created `wasm/training_api` source and header files & modified
WebAssembly CMake to include training flags
* The `wasm/training_api` files use an `OrtTrainingManager` handle which
is a struct of an OrtCheckpointState and an OrtTrainingSession, rather
than creating a CheckpointState handle & a separate TrainingSession
handle.
* This is so that the TypeScript side only has to manage one handle that
will be passed between TrainingSession & CheckpointState
representations, rather than the TypeScript side managing separate
CheckpointStateHandle and TrainingSessionHandle.
### Motivation and Context
WASM API needs to be updated with ORT training API function calls so
that ORT training web bindings can be added for on-device training.
---------
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: carzh <carolinezhu@microsoft.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
### Description
Sort the candidates in rocBLAS/hipBLASLt to make sure that they are
properly ordered and can be correctly fetched by saved indices in
offline tuning cases.
### Description
This PR adds kernel implementation for operator "Not" and "Equal". Also
removed download cache in gpu data manager.
**Why removing download cache**
The following test case failed. ("Or" is on CPU, "Greater" and "Equal"
are on JSEP)

after debugging, I found that both "Equal" and "Greater" are using the
same output GPU Data ID. This is because when ORT executes the graph, it
first run "Equal", allowing its shader to write into GPU Data ID 2; then
a Gpu2Cpu copy for it is issued (because currently "Or" is on CPU EP);
at this point, ORT thinks GPU Data ID=2 is free to use; so it reuse it
as output for "Greater". This means there is no allocation for output of
"Greater" kernel, and both kernel writes to GPU Data ID=2.
For gpu data manager, there will be 2 downloads from the same GPU
buffer. Previously I think this is a waste of resource so I cached the
data. But now it shoes that we need to perform 2 downloads because the
GPU data is already different. The download data cache should be
removed.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This is a followup to PR #15519 that is closed in favor of this one.
### Motivation and Context
The current implementation of TRT cache has no code execution path
possible so that an encrypted TRT engine cache could be created when
flags engine_cache_enable and engine_decryption_enable are true. This
was originally raised in issue #12551.