### Description
QNN can't run MatMul if both inputs are dynamic inputs with uint16 quantized on v68. Make it run by inserting Convert op to convert 1 input to int8
### Description
<!-- Describe your changes. -->
Update usability checker and related infrastructure to support checking
models > 2GB.
- Add ability to set flag to keep initializers as external data
- we optimize the model as part of the checking so need to write out a
new copy.
- Handle issue with ONNX shape inferencing silently failing
- use API that supports large models but requires writing the model to a
new file
- automate cleanup of that copy of the model
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Allow analysis of LLMs to determine gaps for mobile usage.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Allow empty shapes and do not validate them for inputs/outputs at the
InferenceSession::ValidateInputsOutputs().
### Motivation and Context
https://github.com/microsoft/onnxruntime/pull/17301 disallowed empty
shapes.
However, many models depend on them as a way to pass shapes of different
ranks.
### Description
Support uniforms in Slice op
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve ferformance
in LoRA code, it will use conv1d to do projection for qkv, while the
conv1d calculation is mathematically equivalent to matmul, and matmul is
much faster than conv1d.
The subsitution of the graph optimizer is: 1 conv1d >> 2 split + 1
squeeze + group_num matmul + 1 concat
with this optimizer, we see 10%+ in one 1P model
The TRT builder instantization is slow (see
[here](https://github.com/microsoft/onnxruntime/issues/18071)).
In current TRT EP, we instantiate builder object every time we need it.
There are multiple places need the TRT builder so this causes huge
performance overhead.
### Description
Our function inliner converts call nodes to a proto. `Node::ToProto()`
function recreates optional NodeArgs into a `NodeProto`. While handling
missing input parameters, our inliner simply renames them as empty
strings.
`Graph::InlineFunctionProto()` recreates missing NodeArgs even though
the original call node did not have them.
This results in the below mentioned issue. The inlined model has the
following entries, notice the second argument is present, but has no
value in `ReduceSum` call (from a Dynamo exported model).
>
InsertedPrecisionFreeCast__inlfunc__aten_linalg_vector_norm_no_dim_onnx_result_12
= ReduceSum <keepdims: int = 0, noop_with_empty_axes: int = 0>
(InsertedPrecisionFreeCast__inlfunc_ReduceL1_data_abs, )
We now allow second input to ReduceSum to be nullptr and ignore it as it
is optional.
### Motivation and Context
This seeks to address
https://github.com/microsoft/onnxruntime/issues/18338
- Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs.
- Connect MatMulNBits contrib op to MLAS function.
### Description
- set tsconfig "noUnusedParameters" to `true` and fix a few bugs
discovered by typescript.
how unused parameter is fixed:
- for most code (webgl), add underscore as prefix, which is the standard
ignore pattern for typescript check.
- remove unused parameter from function and modify corresponding
function calls (jsep)
- fix a bug in ArgMinMax: this 2 operators do not have more than one
input(s) so the `createArgMinMaxAttributesFromInputs()` is removed.
- add proxy main.ts into typescript check and fix a bug in parameter
passing
- fixed `run()` function call and add typecheck fix (hack)
Support graph input and initializer for GatherToSplit fusion. Previously
the fusion requires Gather nodes consume some other node which cannot be
graph input or initializer.
This helps some model training with such case so that we will not have
GatherGrad in the final graph. GatherGrad is super inefficient in kernel
implementation.
### Description
A few refinements:
(1) Use fixed optimized shape for dynamic engine of TRT.
(2) Use same seed in base and refiner.
(3) Save metadata to png file so that it is easy to reproduce.
(4) Disable EulerA scheduler for XL since it has issue in refiner with 1.16.2.
(5) Limit height and width to be divisible by 64.
(6) Update document to add a link of downloading optimized model.
---------
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
This also set the Path variable for the downloaded libraries.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
1. Introduce MoE CUDA op to ORT based on FT implementation.
2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows.
Remove patch file for cutlass 3.0.0.
3. Sharded MoE implementation will come with another PR
limitation: __CUDA_ARCH__ >= 700
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
add CI steps to log info for test failure investigating.
Currently Web CI is marked as 'optional'. This change adds some script
to dump debug info for investigating the random test failure
### Description
Renames `OnnxInputInfo` struct to `TensorInfo` because this struct can
be used for both input and output tensors.
### Motivation and Context
Clean up TODO item
The issues found in yolov3, tiny-yolov3 etc where it has control flow
ops.
Two modifications:
1. In GetCapability/GetSupporedtList, only if the newly built graph has
control flow op as well as it has parent node, it needs to handle outer
scope values before calling graph.Resolve().
2. Two graph/subgraphs has the chance to have the same graph->Name().
Add a function to get the unique graph name.
### Description
Make all build_wasm tasks (NPM packaging and post merge)run on Linux.
Enable web gpu test in npm package pipeline too.
### Motivation and Context
Even on Windows, build_wasm is running in cygwin.
So, it could save a lot of time to run it on Linux.
### Description
<!-- Describe your changes. -->
Set DML package name correctly so the build doesn't try and include mobile targets.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix packaging pipeline.
### Description
<!-- Describe your changes. -->
Fix bad delegates.
Add script to detect mismatch, and run in CI and when creating nuget
package.
Ignore whitespace when looking at the diff to the .cs file as
clang-format ran.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#18363
### Description
When and if `If` condition proves to be a constant value, inline the
corresponding subgraph yielding to more constant folding and
optimization.
### Motivation and Context
Newly converted models feature lots of nested `If` nodes that can be
inlined and collapsed.
In particular, for the sample models we are gaining on TorchScript
exported models.
For `HF Mobile Bert Dynamo` runtime went down from 0.069 -> 0.046. In
total, AOT inlining + `If` constant folding
yields improvement of about 50% 0.102 -> 0.046. Brining us very close to
TorchScript exported models.
`HF Bart Dynamo` further improves 0.668 -> 0.45. AOT + `If` constant
folding improves 0.98 -> 0.45
Earlier the size of
HF Mobile Bert **161Mb+**, now **98Mb**
HF Bart Dynamo pre-optimized model was about **1.2Gb**. It is now
**710MB**

### Description
Only one of "--cuda_version" and "--cuda_home" is needed. If they were
both specified, the first one will take precedence. Since we download
cuda SDKs on-the-fly now, the machines will not need to have a
preinstalled CUDA SDK therefore will not have VS-CUDA integration
extension. Therefore the "--cuda_version" flag will not work. This PR
deletes such usages.
Related PR: #15915
### Description
Updates QNN EP to force Split operators to use the same quant params for
all input/outputs (only if they were already nearly equal). This can be
necessary for the sequence Sigmoid -> Split because QNN requires Sigmoid
ops to override output quant params to specific values.
Also did the same for the following operators that do not change input
data:
- Expand
- Gather
- MaxPool
- Reshape/Flatten/Squeeze/Unsqueeze
- Resize
- Split
- Tile
### Motivation and Context
The QNN HTP backend employs certain optimizations when all the
quantization parameters for the Split operator are equal. We need to
ensure they are equal to get better inference latency performance.
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
Update SDXL demo to test more configurations (including every scheduler).
Update documents to add instructions for running demo in docker.
Update package version in requirements.
Enable custom fp16 VAE in TensorRT for fair comparison.
### Description
<!-- Describe your changes. -->
Use different march flag to workaround what appears to be a clang issue.
See https://github.com/tensorflow/tensorflow/issues/59970 for links to
various relevant pieces of info/discussions.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. fix dist setting bug for LLaMA2-70b distributed convert and benchmark
2. Add instruction in README for how to benchmark LLaMA2-70b distribute
inference
### Description
This PR fixes the TypeScript type check.
Previously, when I use esbuild to replace webpack (#17745), typescript
typecheck was disabled. This causes a few TypeScript type error checked
in into the code base. This PR fixes the followings:
- Use "Node16" as default "module" value in tsconfig.json, because in
TypeScript v5, `(module == "ES2015" && moduleResolution == "Node16")` is
an invalid combination.
- Set `noUnusedParameters` to true as default. in web override it to
false because multiple code need to be updated ( a following-up PR will
do this )
- set correct project file for 'web/lib/**/*.ts' for ESLint (otherwise
WebGPU types are not populated correctly)
- fix type error in file js/web/lib/wasm/jsep/webgpu/program-manager.ts
- upgrade "@webgpu/types" to latest to fix type error in file
js/web/lib/wasm/jsep/backend-webgpu.ts
- add package script "prebuild" for web to run tsc type check
- add type check in CI yml file
### Description
<!-- Describe your changes. -->
Add 32-bit patch binary and infra to fallback to it. The Azure devops
Windows CIs are missing patch.exe from their git install for some reason
so the default `find_package(Patch)` fails as that is where it expects
to find it.
Remove Eigen patch. Underlying issue was fixed in source 3 years ago by
c6c84ed961
and the patch command is invalid (args are for git apply not patch).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make usage of patch consistent across all CIs
Fix https://github.com/microsoft/onnxruntime/issues/15248