### Description
Renames `OnnxInputInfo` struct to `TensorInfo` because this struct can
be used for both input and output tensors.
### Motivation and Context
Clean up TODO item
The issues found in yolov3, tiny-yolov3 etc where it has control flow
ops.
Two modifications:
1. In GetCapability/GetSupporedtList, only if the newly built graph has
control flow op as well as it has parent node, it needs to handle outer
scope values before calling graph.Resolve().
2. Two graph/subgraphs has the chance to have the same graph->Name().
Add a function to get the unique graph name.
### Description
Make all build_wasm tasks (NPM packaging and post merge)run on Linux.
Enable web gpu test in npm package pipeline too.
### Motivation and Context
Even on Windows, build_wasm is running in cygwin.
So, it could save a lot of time to run it on Linux.
### Description
<!-- Describe your changes. -->
Set DML package name correctly so the build doesn't try and include mobile targets.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix packaging pipeline.
### Description
<!-- Describe your changes. -->
Fix bad delegates.
Add script to detect mismatch, and run in CI and when creating nuget
package.
Ignore whitespace when looking at the diff to the .cs file as
clang-format ran.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#18363
### Description
When and if `If` condition proves to be a constant value, inline the
corresponding subgraph yielding to more constant folding and
optimization.
### Motivation and Context
Newly converted models feature lots of nested `If` nodes that can be
inlined and collapsed.
In particular, for the sample models we are gaining on TorchScript
exported models.
For `HF Mobile Bert Dynamo` runtime went down from 0.069 -> 0.046. In
total, AOT inlining + `If` constant folding
yields improvement of about 50% 0.102 -> 0.046. Brining us very close to
TorchScript exported models.
`HF Bart Dynamo` further improves 0.668 -> 0.45. AOT + `If` constant
folding improves 0.98 -> 0.45
Earlier the size of
HF Mobile Bert **161Mb+**, now **98Mb**
HF Bart Dynamo pre-optimized model was about **1.2Gb**. It is now
**710MB**

### Description
Only one of "--cuda_version" and "--cuda_home" is needed. If they were
both specified, the first one will take precedence. Since we download
cuda SDKs on-the-fly now, the machines will not need to have a
preinstalled CUDA SDK therefore will not have VS-CUDA integration
extension. Therefore the "--cuda_version" flag will not work. This PR
deletes such usages.
Related PR: #15915
### Description
Updates QNN EP to force Split operators to use the same quant params for
all input/outputs (only if they were already nearly equal). This can be
necessary for the sequence Sigmoid -> Split because QNN requires Sigmoid
ops to override output quant params to specific values.
Also did the same for the following operators that do not change input
data:
- Expand
- Gather
- MaxPool
- Reshape/Flatten/Squeeze/Unsqueeze
- Resize
- Split
- Tile
### Motivation and Context
The QNN HTP backend employs certain optimizations when all the
quantization parameters for the Split operator are equal. We need to
ensure they are equal to get better inference latency performance.
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
Update SDXL demo to test more configurations (including every scheduler).
Update documents to add instructions for running demo in docker.
Update package version in requirements.
Enable custom fp16 VAE in TensorRT for fair comparison.
### Description
<!-- Describe your changes. -->
Use different march flag to workaround what appears to be a clang issue.
See https://github.com/tensorflow/tensorflow/issues/59970 for links to
various relevant pieces of info/discussions.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. fix dist setting bug for LLaMA2-70b distributed convert and benchmark
2. Add instruction in README for how to benchmark LLaMA2-70b distribute
inference
### Description
This PR fixes the TypeScript type check.
Previously, when I use esbuild to replace webpack (#17745), typescript
typecheck was disabled. This causes a few TypeScript type error checked
in into the code base. This PR fixes the followings:
- Use "Node16" as default "module" value in tsconfig.json, because in
TypeScript v5, `(module == "ES2015" && moduleResolution == "Node16")` is
an invalid combination.
- Set `noUnusedParameters` to true as default. in web override it to
false because multiple code need to be updated ( a following-up PR will
do this )
- set correct project file for 'web/lib/**/*.ts' for ESLint (otherwise
WebGPU types are not populated correctly)
- fix type error in file js/web/lib/wasm/jsep/webgpu/program-manager.ts
- upgrade "@webgpu/types" to latest to fix type error in file
js/web/lib/wasm/jsep/backend-webgpu.ts
- add package script "prebuild" for web to run tsc type check
- add type check in CI yml file
### Description
<!-- Describe your changes. -->
Add 32-bit patch binary and infra to fallback to it. The Azure devops
Windows CIs are missing patch.exe from their git install for some reason
so the default `find_package(Patch)` fails as that is where it expects
to find it.
Remove Eigen patch. Underlying issue was fixed in source 3 years ago by
c6c84ed961
and the patch command is invalid (args are for git apply not patch).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make usage of patch consistent across all CIs
Fix https://github.com/microsoft/onnxruntime/issues/15248
### Description
Some CMake scripts reference Microsoft.GSL::GSL. Most of the time, the
GSL package that is found on the system is used. However, when cuda is
enabled, it is downloaded and patched. Most CMake scripts rely on the
first case and forget about the second. This patch makes the second case
behave like the first case.
### Motivation and Context
This is an issue that occurs 'in the wild'. For example, I had to patch
this to be able to enable the CUDA provider for the onnxruntime conan
package (see https://github.com/conan-io/conan-center-index/pull/20392).
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Adds the extra option `MinimumRealRange` to the quantization script:
```python3
"""
MinimumRealRange= float|None :
Default is None. If set to a floating-point value, the calculation of the quantization parameters
(i.e., scale and zero point) will enforce a minimum range between rmin and rmax. If (rmax - rmin)
is less than the specified minimum range, rmax will be set to rmin + QuantMinRealRange. This is
necessary for EPs like QNN that require a minimum floating-point range when determining
quantization parameters.
"""
```
### Motivation and Context
QNN requires a minimum floating-point range of 0.0001.
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
### Description
This PR gets the onnxruntime Rust bindings to a foundation where they
can be extended and validated as the onnxruntime progresses.
Specifically, the PR does the following.
- fixes some of the existing compilation issues due to missing some
enums output tensor data types.
- introduces a `just vendor` task that will vendor the source code from
the onnxruntime to enable a common base directory within the crate
directory rather than using a relative parent path. This enables `crate
package` to be able to archive the onnxruntime native code, which will
enable consumers of the onnxruntime-sys crate to be able to compile on
their target.
- introduces a GH action to lint the Rust code (rustfmt, clippy), build
the library, validate through tests, and validate crate can package
correctly.
TODOs:
- [x] This PR is based on #18200 and will need to be rebased once that
PR is merged.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is the first step to getting new onnxruntime Rust crates published
through this project, which will unblock community Rust projects which
would like to take a dependency on onnxruntime Rust.
Follow up work to enable publication of onnxruntime Rust crates:
- change name of the crates to be published (onnxruntime-rs and
onnxruntime-sys are already taken and we'll need new names)
- update authors / license to reflect contributions from previous
maintainer(s) and new maintainers
- introduce a crate publish GH action or ADO pipeline
---------
Signed-off-by: David Justice <david@devigned.com>
### Description
<!-- Describe your changes. -->
Adding static int8 quantization support for MIGraphX Execution Provider
- Allows for parsing in calibration tables generated by Onnxruntime or
TensorRT's toolsets
- Add proper environment variables into the MIGraphX EP
- Update python API to include updating execution provider flags -> was
missing on python side
- Hook into MIGraphX's int8 quantitation and optimization of models
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Required so that we can get onnxruntime to pass in models while
leveraging the existing tooling for int8 static QDQ quantization.
First step in a series of PRs which will add further static quantization
on the operator level as MIGraphX releases further support.
These changes drew heavily from the tensorRT EP should allow for similar
functionality for GPU based (versus CPU) quantization of models before
an inference is performed.
---------
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Enable option qnn_context_priority to set QNN context priority, options:
"low", "normal", "normal_high", "high".
### Description
Enable option qnn_context_priority to set QNN context priority, options:
"low", "normal", "normal_high", "high".
This feature guarantees the model inference with higher priority. Tested
with onnxruntime_perf_test tool using same model.
1. Run the model on the NPU with single instance, the latency is 300ms.
2. Run the same model on NPU with 2 instance at same time.
Case 1:
both with same priority (high ) -- latency is 600ms
Case 2:
1 with low priority -- latency is 30,000ms
1 with high priority -- latency is 300ms
Case 3:
1 with normal priority -- latency is 15,000ms
1 with high priority -- latency is 300ms
### Description
<!-- Describe your changes. -->
Registers BFloat16 datatype as valid input type for CUDA Neg Kernel.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.
---------
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
Add an op named `FlattenAndUnpad`.
This op implements functions:
1. Flatten the first two dims of input tensor.
2. Gather valid value from input tensor with index tensor,.
### Motivation and Context
The grad op of `PadAndUnflatten` was `GatherGrad` which is inefficient
in performance.
I implement this `FlattenAndUnpad` just to replace the `GatherGrad` as
grad of `PadAndUnflatten`.
With this op, we also can simplify the "Reshape + ShrunkenGather"
pattern to `PadAndUnflatten` in padding elimination optimizer, which
will also improve performance.
### Description
<!-- Describe your changes. -->
Helper to run clang-format on lines that are > 120 chars.
We disable clang-format enforcing 120 chars by default because it's
formatting can negatively impact readability. If a developer has not
manually kept a line within the 120 char limit this tool will fix it. It
will leave all other lines alone to honor the formatting the developer
chose.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Help developers fix lint errors.
Preferred is to use a vertical ruler/guideline in your editor when
actually writing the code.
### Description
1. Add a build validation for Linux ARM64/ARM32 cross-compile to catch
issues listed in #18195 .
2. Revert eigen's commit id back to what we had before.
### Motivation and Context
To catch cross-compile issues.
Added a TODO item for fixing the compile warnings in Linux ARM32 build: AB#21639
Exempt all issues w/ assignees from stale bot, increase days before
issue close, + add start date to address issue w/ GH API rate limiting
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix the broken pieces due to the latest Abseil update.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
Make the debugging bearable.
### Description
Adds the QNN session option `htp_graph_finalization_optimization_mode`
to enable QNN graph optimizations at the expense of longer preparation
time.
### Motivation and Context
Allow enabling QNN graph optimizations per app/model.
This PR updates replacing MHA with GQA and updates the LLaMA scripts for
the modified GQA op. It is related to the changes in [this
PR](https://github.com/microsoft/onnxruntime/pull/18283).
### Motivation and Context
This PR allows us to run LLaMA with the GQA op end-to-end using ragged
batching (i.e. batched inputs of different lengths).
### Description
<!-- Describe your changes. -->
Registers BFloat16 datatype as valid input type for CUDA QuickGeluGrad
Kernel.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.
---------
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Tune logging experience a bit
After last time we update the ORTModule log experience, we found few
issues:
1. `INFO` level output too many things, including PyTorch exporter
verbose logs (tracing graphs) on every ranks. On this level, we only
want to
- Output a little bit more information to Users than `WARNING` level,
for example the memory recomputation recommendations or other
not-fully-ready features.
- Output a little bit more information for a quick diagnostic, collected
on rank-0 only.
2. ONNX Runtime logging filter during graph build, session init
sometimes will hide the issues (for example segement fault), there is no
useful information in `WARNING`/`INFO` for users to report to us. This
is not good!
3. Some of our devs like using `pdb` to debug Python code, but if we add
`import pdb; pdb.set_trace()` in models' code might hang when they use
`INFO` or `WARNING`, where exporter happens and all output got
redirected due to log filtering. The only workaround is to switch to
VERBOSE, which output toooooooooooo many logs.
The corresponding changes proposed here are:
1. For `INFO` logging,
- We only logs rank-0.
- We restricted the ORT backend logging level to be WARNING in this
case, because ORT backend code output way too many logs that should be
under verbose, while we cannot guarantee we can get them cleaned up
immediately once they are added.
- We output the PyTorch exporter verbose log (including tracing graph),
which is useful for a quick diagnostic when an issue happens.
2. Remove all logging filtering on ORT backend, then the segment fault
issue details will not be hidden once it happens again.
3. Introduced a `DEVINFO` logging,
- Log logs on all ranks
- Log ORT backend logging level INFO
- PyTorch exporter logging filtering are all turned OFF (to unblock the
pdb debugging).
4. Currently, to use Memory Optimizer, need use DEVINFO (which will
output ORT backend INFO log). So update memory optimizer document to
reflect this. https://github.com/microsoft/onnxruntime/pull/17481 will
update the requirement back to INFO for show memory optimization infos.
You can check
https://github.com/microsoft/onnxruntime/blob/pengwa/devinfo_level/docs/ORTModule_Training_Guidelines.md#log-level-explanations
for a better view of different log levels.
This PR also extract some changes from a bigger one
https://github.com/microsoft/onnxruntime/pull/17481, to reduce its
complexity for review.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>