### Description
Add single test step in Window GPU Reduced Ops workflow
### Motivation and Context
The old workflow's building and testing were running in one command.
In PR #17263, the test step was removed by mistake.
So, readd it.
How to consolidate the test step is in consideration.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It's skipped in the PR
```
2023-08-25T02:37:48.7772670Z 1: [ RUN ] ModelTests/ModelTest.Run/cuda__models_opset9_Candy_candy
2023-08-25T02:37:48.7824755Z 1: D:\a\_work\1\s\onnxruntime\test\providers\cpu\model_tests.cc(91): Skipped
2023-08-25T02:37:48.7825343Z 1: Skipping single test It's in broken_tests
```
### Description
Release OrtEnv before main function returns. Before this change, OrtEnv
is deleted when C/C++ runtime destructs all global variables in ONNX
Runtime's core framework.
The callstack is like this:
```
* frame #0: 0x00007fffee39f5a6 libonnxruntime.so.1.16.0`onnxruntime::Environment::~Environment(this=0x00007fffee39fbf2) at environment.h:20:7
frame #1: 0x00007fffee39f614 libonnxruntime.so.1.16.0`std::default_delete<onnxruntime::Environment>::operator()(this=0x00007ffff4c30e50, __ptr=0x0000000005404b00) const at unique_ptr.h:85:2
frame #2: 0x00007fffee39edca libonnxruntime.so.1.16.0`std::unique_ptr<onnxruntime::Environment, std::default_delete<onnxruntime::Environment>>::~unique_ptr(this=0x5404b00) at unique_ptr.h:361:17
frame #3: 0x00007fffee39e2ab libonnxruntime.so.1.16.0`OrtEnv::~OrtEnv(this=0x00007ffff4c30e50) at ort_env.cc:43:1
frame #4: 0x00007fffee39fa96 libonnxruntime.so.1.16.0`std::default_delete<OrtEnv>::operator()(this=0x00007fffefff8f78, __ptr=0x00007ffff4c30e50) const at unique_ptr.h:85:2
frame #5: 0x00007fffee39f394 libonnxruntime.so.1.16.0`std::unique_ptr<OrtEnv, std::default_delete<OrtEnv>>::~unique_ptr(this=0x7ffff4c30e50) at unique_ptr.h:361:17
frame #6: 0x00007ffff78574b5 libc.so.6`__run_exit_handlers + 261
frame #7: 0x00007ffff7857630 libc.so.6`exit + 32
frame #8: 0x00007ffff783feb7 libc.so.6`__libc_start_call_main + 135
frame #9: 0x00007ffff783ff60 libc.so.6`__libc_start_main@@GLIBC_2.34 + 128
frame #10: 0x0000000000abbdee node`_start + 46
```
After this change, OrtEnv will be deleted before the main function
returns and nodejs is still alive.
### Fix layernorm and softmax axis after upstream
For Gather (the slicing is a scalar), the output rank is small than its
inputs.
When we upstream this kind of Gather before softmax or layernorm, we
should also update the axis attribute.
Otherwise, the axis might be out-of-date and incorrect for the updated
rank.
```
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_fallback.py", line 157, in handle_exception
raise exception
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 280, in forward
self._build_graph(graph_transformer_config)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_logger.py", line 158, in wrapper
result = func(graph_execution_manager, *args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_logger.py", line 273, in wrapper
result = func(graph_execution_manager, *args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 361, in _build_graph
super()._build_graph(graph_transformer_config)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 184, in _build_graph
self._graph_builder.build(config)
RuntimeError: /onnxruntime/orttraining/orttraining/python/orttraining_pybind_state.cc:823 onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*, const onnxruntime::training::TrainingGraphTransformerConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : Node (Softmax_2904) Op (Softmax) [ShapeInferenceError] 'axis' must be in [-3 , 2]. Its actual value is: 3
```
### Description
[QNN EP] Support non-quantized Op on HTP
1. Remove the limitation in GetCapability that always require QDQ node
unit group to partition the node on NPU backend. So that we can support
non-quantized Slice op with int32 data input on HTP.
2. Enable Where QDQ node unit
3. Separate out the flag is_npu_backend & is_quantized_node to make it
clear
4. Separate output QuantizeLinear, DequantizeLinear to QdqOpBuilder to
better identify quantized/un-quantized input/output tensor
5. Separate out a TransposeOpBuilder to make it simple for Transpose
node processing. Especially for Single Transpose node in QDQ model, for
case like Q->Tranpose->DQ, Transpose is not QDQ node unit group, it's
single node. But we should treat it as quantized node. Output should has
same data type and quantization parameter with input. Another case is to
support non-quantized data for Transpose in QDQ model.
6. Remove is_npu_backend flag from OpBuilder interface. Set the backend
type in QnnBackendManager, QnnMOdel & QnnModelWrapper, so that
OpBuilders can always get it from QnnModelWrapper.
7. Add unit tests for quantized/non-quantized Transpose (int32, float32)
on HTP backend
### Fix build - redefinition of default argument for ‘long unsigned int
Extent’
One of the training customer env, building ORT, there is such a build
error. The GCC version are
```
aiscuser@node-0:/tmp/onnxruntime$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
aiscuser@node-0:/tmp/onnxruntime$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
```
But on our dev node using same GCC/G++, we don't have build issue., not
sure what's the difference but giving an explict type when creating
`gsl::span` fixed the problem.
```
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:394:7: error: redefinition of default argument for ‘long unsigned int Extent’
394 | class span
| ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span_ext:46:51: note: original definition appeared here
46 | template <class ElementType, std::size_t Extent = dynamic_extent>
| ^~~~~~~~~~~~~~~
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:82:93: error: return type ‘class gsl::span<const std::byte>’ is incomplete
82 | [[nodiscard]] inline gsl::span<const std::byte> AsByteSpan(const void* data, size_t length) {
| ^
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h: In function ‘void onnxruntime::AsByteSpan(const void*, size_t)’:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: class template argument deduction failed:
83 | return gsl::span(reinterpret_cast<const std::byte*>(data), length);
| ^
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: no matching function for call to ‘span(const std::byte*, size_t&)’
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note: candidate: ‘template<class Type, long unsigned int Extent> gsl::span(Type (&)[Extent])-> gsl::span<ElementType, FirstExtent>’
740 | span(Type (&)[Extent]) -> span<Type, Extent>;
| ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note: template argument deduction/substitution failed:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note: mismatched types ‘Type [Extent]’ and ‘const std::byte*’
83 | return gsl::span(reinterpret_cast<const std::byte*>(data), length);
| ^
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note: candidate: ‘template<class Type, long unsigned int Size> gsl::span(std::array<_Tp, _Nm>&)-> gsl::span<ElementType, FirstExtent>’
743 | span(std::array<Type, Size>&) -> span<Type, Size>;
| ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note: template argument deduction/substitution failed:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note: mismatched types ‘std::array<_Tp, _Nm>’ and ‘const std::byte*’
83 | return gsl::span(reinterpret_cast<const std::byte*>(data), length);
| ^
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Introduce ZeROOffloadSubscriber for ORTModule
As part of the work: integrate ORTModule with DeepSpeed stage3, this PR
mainly focus on moving original PyTorch-based (leveraging hooks) param
partition/offload implementation to ORTModule compatible implementation.
Changes include:
1. Refactor `SubscriberBase`/`SubcriberManager` to support
pre-forward/post_forward hooks.
2. Implement new `ZeROOffloadSubscriber` by re-using DeepSpeed hook
function as much as possible. Since all hook functions are defined in
`DeepSpeedZeRoOffload._register_hooks_recursively` and
`DeepSpeedZeRoOffload.setup_zero_stage3_hooks`, and the good thing is,
the closure is not complex, all hooks are referencing the owning
`DeepSpeedZeRoOffload` instance, so we can create new hook function with
`FunctionType` by binding the owning `DeepSpeedZeRoOffload` instance,
then call the new created function in subscriber's
`pre_forward_module_apply_impl` and `post_forward_module_apply_impl`
interfaces.
3. Monkey patch `DeepSpeedZeRoOffload.setup_zero_stage3_hooks` to
register the `ZeROOffloadSubscriber` for the model, then we don't need
change any code on the DeepSpeed repo (at least so far).
4. Fix the ATen embedding custom symbolic exporter function by
tolerating weights size be (0) (changed by DeepSpeed zero stage 3).
UT will be added once stage3 is fully supported.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Temporarily disable symbol tables.
### Motivation and Context
Local symbol tables mark unrelated shapes re-use and cause inference to
error out.
https://github.com/microsoft/onnxruntime/issues/17061
### Description
<!-- Describe your changes. -->
As title.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reduce minimal build binary size for mobile to meet office team
requirement.
cc @chenfucn
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
### Description
<!-- Describe your changes. -->
- allocation planner was breaking if graph had no nodes
- in this particular model a branch of an If node returned an outer
scope value directly.
- if model used non-tensor types and sparse tensors are disabled the
call to IsSpareTensor causes an exception when prematurely terminates
the code.
- it's perfectly fine to check if a value is a sparse tensor when
support for them is disabled. we just can't do anything with that
OrtValue which is what the current ifdef's after the call to
IsSparseTensor handle.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix model execution failure for partner with model that uses sequences
in a minimal build with sparse tensors disabled.
### Description
<!-- Describe your changes. -->
Updated the code to pass in the missing parameter
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Compile error. See https://github.com/microsoft/onnxruntime/issues/17139
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Prevent `GSL_SUPPRESS` arguments from being modified by clang-format and update existing usages.
clang-format was changing something like `GSL_SUPPRESS(r.11)` to `GSL_SUPPRESS(r .11)`.
For some compilers (e.g., clang), the `gsl::suppress` attribute takes a quoted string argument. We don't want to insert spaces there.
### Description
This PR adds benchmark scripts for Whisper. It is a follow-up to [this
PR](https://github.com/microsoft/onnxruntime/pull/17020) that adds the
LLaMA scripts.
### Motivation and Context
This PR enables benchmarking Whisper across various configurations.
### Description
Added JSEP Gemm registration for opset 13. It was falling back to CPU
provider as CPU has it for 13
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
### Description
This PR adds the following scripts for LLaMA:
- LLaMA conversion (support for TorchScript and Dynamo exporters)
- LLaMA parity
- LLaMA benchmark
- LLaMA quantization
- LLaMA integration with [Hugging Face
Optimum](https://github.com/huggingface/optimum)
### Motivation and Context
This PR adds scripts for using LLaMA. There is a [follow-up
PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding
scripts for Whisper.
Enable verbose logging in unit test program with environment variable.
E.g., `ORT_UNIT_TEST_MAIN_LOG_LEVEL=0 ./onnxruntime_test_all --gtest_filter="<test that I want to see more logs for>"`.
### Description
Unify some pre-build common steps.
### Motivation and Context
In the long run, other devs should only focus on build option and test
commands.
It would reduce mistakes and maintenance cost to use common template
steps.
There will be more PRs to achieve the goal.
### Description
Fix comment reference to a renamed public API.
### Motivation and Context
Avoid confusion of incorrect docs.
We want this in 1.16 release
### Description
1. Add a CUDA 12.x pipeline
2. Improve install_third_party_deps.ps1: avoid using Start-process.
Directly call the command instead.
### Motivation and Context
Since our official packages and all CI pipelines still use CUDA 11.x, we need extra pipelines to validate our source code level compatibility with CUDA 12.x. BTW for sure the prebuilt binaries in our release page are not compatible with CUDA 12.x. Do not report bugs for that.
AB#15152
### Description
<!-- Describe your changes. -->
This PR fixes broken hyperlinks in the documentation that should lead
users to Jupyter notebooks. Currently, the hyperlinks are not working as
intended. The PR resolves this issue by updating the hyperlinks to
correctly direct users to the Jupyter notebooks.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
It fixes broken hyperlinks leading to the Jupyter notebooks.
### Description
When I worked on PR #17173, I didn't notice that
onnxruntime\core\platform\windows\debug_alloc.cc also needs to call
dbghelp functions like SymInitialize. So, if we use vc runtime's
stacktrace functionality, vc runtime will initialize/uninitialize the
dbghelp library independently and vc runtime's stacktrace helper DLLs
get unloaded before our memory leak checker starts get work. Then we
call SymSetOptions, it crashes.
More details:
In VC runtime the C++23 stacktrace functions are implemented on top of
dbgeng.dll. In C:\Program Files\Microsoft Visual
Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\crt\src\stl\stacktrace.cpp,
you can see it has:
```
dbgeng = LoadLibraryExW(L"dbgeng.dll", nullptr, LOAD_LIBRARY_SEARCH_SYSTEM32);
```
The dbgeng.dll is a wrapper around dbghelp.dll. It calls SymInitialize
and SymCleanup. dbgeng.dll gets unloaded before our memory leak check
starts to run. In theory we should be able to call SymInitialize again
if the previous user who called SymInitialize has also called
SymCleanup. However, users can use
SymRegisterCallback/SymRegisterCallback64/SymRegisterCallbackW64 to
register callback functions to dbghelp.dll. These callback functions
need to be alive when SymSetOptions(and some other dbghelp APIs) get
called.
### Motivation and Context