### Description
Re-implement stacktrace. The new implementation doesn't directly use
Windows API, hence can avoid problems regarding to
initialize/uninitialize the dbghelp library.
### Motivation and Context
### Description
For windows headers are not duplicated to the normal cuda include. For
linux they are:
```
(base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include/nvtx3 | grep nvTool
nvToolsExt.h
nvToolsExtCuda.h
nvToolsExtCudaRt.h
nvToolsExtOpenCL.h
nvToolsExtSync.h
(base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include | grep nvTool
nvToolsExt.h
nvToolsExtCuda.h
nvToolsExtCudaRt.h
nvToolsExtOpenCL.h
nvToolsExtSync.h
```
Is the preference via those added defines or should the include just be
changed to be `nvtx3/` ?
Also there is no library linking needed on Windows and the library is
not even present.
### Description
Adds continuous integration and pull-requestion validation triggers
directly to the yaml file for the Windows x64 QNN CI Pipeline.
### Motivation and Context
There have been various unit tests failures that break the
QNN_Windows_Nuget pipeline, which builds QNN EP for Windows x64. This PR
ensures that QNN EP is built and tested on a Windows x64 image for every
pull request.
Dump statistics of input and/or output tensors of each node. It could
help to find out why a model outputs NaN.
To use this tool, just add `--cmake_extra_defines
onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1` when build onnxruntime package.
Then set some environment varaibles before running model with
onnxruntime:
```
export ORT_DEBUG_NODE_IO_DUMP_INPUT_DATA=1
export ORT_DEBUG_NODE_IO_DUMP_OUTPUT_DATA=1
export ORT_DEBUG_NODE_IO_DUMP_STATISTICS_DATA=1
```
Then statistics data will be appended after the dumping of input and
output tensors.
One possible cause of a FP16 or mixed precision model outputs NaN: some
number exceeds the limit of FP16 (like max FP16 value is 65504). When a
fp32 model has value > 65504 in a node output, it will become INF when
converting the node to FP16. In this case, you need keep related nodes
in FP32 to avoid the issue. You can dump tensor statistics of FP32 model
to find out such candidate nodes.
### Description
Fix a typo. LayerNormalization takes 2 or 3 inputs. The third input,
bias, is optional.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Bug fix for OVEP graph provider options and fallback
### Motivation and Context
A bug fix logic is added to handle the fallback to CPU EP.
Corner case Assertions are added for ProviderOptions in OpenVINO.
---------
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
(1) Move attention test data from code to file to avoid prefast crash
(which blocks python packaging pipeline)
(2) Enable some test cases that previously disabled in Windows
(3) Fix an assertion error in
`MultiHeadAttentionTest.CrossAttention_WithPastPassedInDirectly_NoMask`
This test case is for Whisper cross attention. When Memory efficient
attention was enabled, format is converted to BNSH, which trigger
assertion error since memory efficient attention asserts BSNH format.
Temporarily disable memory efficient attention for this case. I also
disabled the test since Whisper does not use it anymore, and ROCm fails
in the test.
### Description
1. allows passing session options to operator test (eg. graph
optimization level)
2. add a short flag '-x' for '--wasm-number-threads' as it is frequently
used.
### Description
Since WebGPU supports only float32 and int32, having Gather, Reshape,
Shape, Squeeze and Unsqueeze ops with other data types create additional
MemCpy ops and slow down the overall execution as all other OPs with
other tensor types will be done on CPU.
Before this patch SD Unet had these numbers:
Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1141
Node(s) placed on [JsExecutionProvider]. Number of nodes: 4025
memcpy tokens: 2001
After patch:
Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1735
Node(s) placed on [JsExecutionProvider]. Number of nodes: 2243
memcpu tokens: 813
It also gives more than 5X performance benefit. From 12sec for one Unet
step to 2.2sec on RTX 3090 Ti, so we are almost getting to native
performance.
UPD: with latest changes from main branch and multi-threading it went
down to 1.6sec. Will try re-exporting my model to onnx with maximum
optimizations, like using MultiHeadAttention to decrease node count.
Maybe after implementing that it can go in less than 1 sec
### Description
The onnxruntime-CI-nightly-ort-pipeline encounters occasional failures
due to synchronization discrepancies between the ACPT nightly image and
the repository. We are addressing this by executing tests using the
commit ID associated with the ort build within the ACPT image.
---------
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
1. Clean up cmake files. Remove some unused code
2. Remove the "Semmle" task from
tools/ci_build/github/azure-pipelines/templates/win-ci.yml. Semmle is
deprecated and replaced by CodeQL.
### Description
- Removed one unused import
- Escaped a backslash in a path
### Motivation and Context
I see this `DeprecationWarning` when I import `onnxruntime`:
```
onnxruntime/capi/_pybind_state.py:28: DeprecationWarning: invalid escape sequence '\S'
"(other than %SystemRoot%\System32), "
```
A future version of Python (maybe 3.13?) will raise a `SyntaxError` for
invalid escape sequences.
### Description
1. `onnxruntime_fetchcontent_makeavailable` works around unconditional
install commands so that can be used instead of `FetchContent_Populate`
2. This dependency is Windows specific, mark it as such.
### Motivation and Context
1. This simplifies `cmake/external/wil.cmake` not to do anything
specific wether WIL was fetched or found
2. Given it's specific to Windows, it might not be available on other OS
in specific air-gapped environment such as
[conan-center-index](https://github.com/conan-io/conan-center-index).
This allows downstream builds not to require specific patches for
something not required by the build in the first place.
The notebooks are not up to update.
(1) Update BERT and GPT-2 optimization notebooks for CPU EP with latest
PyTorch and ONNX Runtime.
(2) Add links to quantization example
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/16515
### PythonOp Enhancement: Bool and Tuple[Bool] Constants, Materialize
Grads, Empty Inputs, Save In Context
1. Support `bool` or `Tuple[bool]` constant type in inputs.
2. Support `ctx.set_materialize_grads(True|False)`
3. Backward op can accept empty input (that don't require grad)
4. Special handling for ORT tensors are saved in context
**Scenario**: a tensor is generated by ORT, then it might be saved for
backward by `ctx.save_for_backward(tensor)`, while `tensor`'s reference
count is not increased in ORT's allocation plan, so it is possible ORT
release the tensor data, before backward usage.
**Currently**: we copy every tensor before running
autograd.Function.forward(), this might be a problem for cases there are
many PythonOp (for example zero stage 3).
**Proposal**: To avoid those unnecessary copies for tensors that are not
saved in context, this change introduced a `_GlobalOpKernelInfoMap`.
During the kernel first run, we will anyway copy all tensors generated
from ORT, and give it to torch.autograd.Function for run, then we check
whether the inputs needs to be saved in context, and save the input
index that needs saving in `_GlobalOpKernelInfoMap`. Then for later
iterations, we just copy what is needed.
### Description
- Disables Resize tests that use nearest mode on QNN CPU.
- Fixes indentation problems on yaml for win x64 qnn pipeline.
### Motivation and Context
The QNN windows Nuget pipeline does not run due to failing unit tests on
Windows x64. These tests should not be enabled until we determine the
rounding behavior of QNN's ResizeNearestNeighbor operator.
On Windows, clang-format has a bug when AlignTrailingComments.Kind is
set to `Leave`
(https://clang.llvm.org/docs/ClangFormatStyleOptions.html#aligntrailingcomments),
where it will keep adding indentation to comments after each formatting
runs.
This PR changes to always align comments so we do not hit the bug.
As a consequence of the options change we need to reformat some of the
files. Note that this option is aligned with the rest of the repository.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ROCm python package pipeline failed because this
PR(https://github.com/microsoft/onnxruntime/pull/16325) changed onnx
version to a commit and we need to build onnx from source. Low protobuf
version will cause build errors.
This PR remove `cmake ` and `protobuf ` from Dockerfile, these two will
install by `install_os_deps.sh`.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Remove the onnxruntime-extensions submodule since it now was used via
cmake FetchContent
### Motivation and Context
The submodule relies on an outdated version of the extensions, and the
build instructions should be updated to eliminate any confusion.
### Description
The correct name should be onnxruntimecpubuildpython
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Allow defining customized PythonOp shape inferer
For `torch.autograd.Function`, we converted it to PythonOp in MSDomain,
there are two places to do shape inferencing for it:
1. in SymbolicShapeInfer, there is one.
2. in PythonOp op definition.
For common PythonOp, since we don't know the relation ship between
inputs and outputs, so we only infer the rank from output ranks, and
generate symbolic dimensions for each dim. While this will introduce
many meaningless symbolic dimensions, sometimes blocking our graph
transformers to do op fusion.
This PR provide a way to define custom shape inferencing for
`torch.autograd.Function` we defined, to propagate the original
dimensions across the PythonOp at the best efforts.
But the 2rd one is not covered yet, we could refine that later. Fixing
1st one is enough for ORTModule training/evaluation.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
test case 'test_batchnorm_epsilon_training_mode' on webgpu is failing.
the issue need time to investigate so comment this off and re-enable it
when the root cause is fixed.
### Description
Adding int4 quantization code in python
### Motivation and Context
Python quantization tool no-longer needs to invoke shell to call a
native exe
### Description
This PR fixes build break for WebAssembly introduced in
6986981482
(435ad2b1d8).
This change updates onnx.patch in onnxruntime repo. the corresponding PR
in onnx repo is: https://github.com/onnx/onnx/pull/5495.
It may takes a while for the next onnx version bump.