### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option:
- Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.
Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs
Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)
### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
### Description
1. SkipLayerNormalization has a new output
(https://github.com/microsoft/onnxruntime/pull/13988) and the symbolic
shape inference script needs corresponding updates
2. The greedy sampling op
(https://github.com/microsoft/onnxruntime/pull/13426) shouldn't re-use
the logits buffer as its corresponding kernel doesn't seem to support it
yet.
### Motivation and Context
Fix some transformer issues
### Description
Force instance norm inputs to be 4D to better target metacommands
### Motivation and Context
This may improve performance on some hardware by allowing the driver to
return valid layouts to DML when querying for metacommand support.
### Description
Force layer norm inputs to be 4D to better target metacommands
### Motivation and Context
This may improve performance on some hardware by allowing the driver to
return valid layouts to DML when querying for metacommand support.
Implement CloudEP for hybrid inferencing.
The PR introduces zero new API, customers could configure session and
run options to do inferencing with Azure [triton
endpoint.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=azure-cli%2Cendpoint)
Sample configuration in python be like:
```
sess_opt.add_session_config_entry('cloud.endpoint_type', 'triton');
sess_opt.add_session_config_entry('cloud.uri', 'https://cloud.com');
sess_opt.add_session_config_entry('cloud.model_name', 'detection2');
sess_opt.add_session_config_entry('cloud.model_version', '7'); // optional, default 1
sess_opt.add_session_config_entry('cloud.verbose', '1'); // optional, default '0', meaning no verbose
...
run_opt.add_run_config_entry('use_cloud', '1') # 0 for local inferencing, 1 for cloud endpoint.
run_opt.add_run_config_entry('cloud.auth_key', '...')
...
sess.run(None, {'input':input_}, run_opt)
```
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
It's from the PR #14085
On multiple running msbuilds , it throws the exception of
```
22-12-30T16:35:34.2423207Z ##[error]C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(155,5): Error MSB3073: The command "setlocal
"C:\Program Files\CMake\bin\cmake.exe" -E copy D:/a/_work/1/b/RelWithDebInfo/dnnl/install/bin/dnnl.dll D:/a/_work/1/b/RelWithDebInfo/RelWithDebInfo
if %errorlevel% neq 0 goto :cmEnd
:cmEnd
endlocal & call :cmErrorLevel %errorlevel% & goto :cmDone
:cmErrorLevel
exit /b %1
:cmDone
if %errorlevel% neq 0 goto :VCEnd
:VCEnd" exited with code 1.
```
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=847423&view=logs&j=249e9d58-0012-5814-27cf-6a201adbd9cf&t=182b9780-832e-5dcb-3957-d6aa3ece582f
It should make sure that the onnxruntime_test_all project depends on
dnnl project.
### Description
Fix a regression failure for cuda EP test
### Motivation and Context
CudaEP test is a special test case under EP folder, not in test folder.
when refactor the code during multi-stream work, we missed it. This PR
is to fix the test.
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
PyTorch removed upsample_nearest related backward functions with "vec"
overload name since 1.13. The functions without overload name are
available for all versions, though they are not that convienent to use.
This PR changes the gradient builder code to use functions without
overload name for ATen upsample_nearest nodes.
This PR also fixed a bug for ORTModule's corner case introduced by the
multi-stream PR. There is some code to execute the barrier step for
triggered downsteam is the barrier is out of range. But this should be
applied to triggered downstream only. If it's a normal run with start
step as a barrier step but out of range, we should not apply the logic.
For example, for ORTModule, if the barrier is the 1st step of whole CPU
plan, and the forward part is empty, then the forward normal run will
run step from start-0 to end-0 (actually nothing), and step-0 is the
barrier, then we should not execute the barrier in such case.
### Description
<!-- Describe your changes. -->
Fix an error:
`onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException:
[ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during
initialization:
/onnxruntime_src/onnxruntime/core/framework/allocation_planner.cc:819
onnxruntime::common::Status
onnxruntime::PlannerImpl::ComputeValueLocation() allocator was false.`
This error happens when we run huggingface models with DDP on
multi-GPUs. In a thread with rank>0, it will attempt to obtain a CPU
memory allocator with device_id>0, which causes the error. There is a
workaround judges whether node’s output is on the CPU or not. If the
output is on CPU, we set device_id = 0.
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
Remove Abseil module placement specifications
### Motivation and Context
Allow Cmake defaults take place and possible redirection of all
submodules for sharing between the local builds.
### Description
Adds support for variadic inputs and outputs to custom operators.
### Motivation and Context
Needed for custom ops that wrap external runtimes/models and maybe TensorRT plugins.
### Description
Deprecate one step beam search since it lacks maintenance (some tests
failed) and its performance is not optimal.
For users who still need this feature, please use older version
(<=1.13.1) of onnxruntime to export one step beam search model, and the
model can run in latest onnxruntime.
It is recommend to use
[convert_generation.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/convert_generation.py)
to generate beam search onnx model for better performance.
### Description
<!-- Describe your changes. -->
newly added test cases break the parity check. disable them temporarily
during investigation.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
Sampling op for cpu and cuda
support huggingface case and custom case
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
`numpy.bool` has been removed as from 1.24.0.
It was before an alias for python's `bool`.
Fixes https://github.com/huggingface/optimum/issues/610
### Motivation and Context
Numpy 1.24.0 breaks for example IO binding helpers.
To perform QAT in onnxruntime, `FakeQuant` op was introduced in #13649.
The onnxruntime quantization tool generates a post training static
quantization onnx model with `QuantizeLinear`->`DequantizeLinear` nodes.
To perform QAT, this pattern needs to be transformed to `FakeQuant`.
This pull request introduces a graph transformer that looks for the
`Q->DQ` pattern and fuses it to a `FakeQuant` node.
### Description
Add support of ONNX conversion of GPT-2 for two stages:
* Stage 1 is the initial stage that has empty past state.
* Stage 2 has non-empty past state and sequence_length is 1.
Add a parameter --stage to specify such stage. For stage 1, we will
enable mask_index for Attention so that we can use fused attention in
CUDA.
Other changes:
(1) use int32 inputs as default (otherwise, there is error in inference)
(2) update gpt2_parity to include SkipLayerNormalization (see
https://github.com/microsoft/onnxruntime/pull/13988) and
EmbedLayerNormalization
(3) get all environment variables that might impact GPT-2 latency in
benchmark_gpt2
### Motivation and Context
To test fused attention for GPT-2 model for
https://github.com/microsoft/onnxruntime/pull/13953.
### Optimize computation orders
In `Roberta/Electra`, when `ClassificationHead` is used, there is
slicing operation on features on sequence_length dimensions, then loss
calculations only depend on this sliced data. This is a slicing at axis
1. Before slicing the shape is [batch, sequence_length, hidden], after
slicing, it becomes [batch , hidden_stage]
We had opportunities to bring this slicing earlier as much as possible,
by passing through simple elementwise ops (like Add/Div), or
Layernorm/Softmax(if their reduce axis is after the slicing axis), or
even MatMul's the left operand (if only it did not affect the last
dims).
For operators like Reshape/Transpose, it is special since they have
either data specified (after slicing we need update), or they have perm
specified, which requires the input rank remain unchanged. So for those
kinds of operators, we can remain the original rank, but just leave the
sliced dim to be 1, after the compute completed, we do a Squeeze.
```
class RobertaClassificationHead(nn.Module):
"""Head for sentence-level classification tasks."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
classifier_dropout = (
config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
)
self.dropout = nn.Dropout(classifier_dropout)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, features, **kwargs):
x = features[:, 0, :] # take <s> token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
return x
```
src\transformers\models\roberta\modeling_roberta.py
src\transformers\models\electra\modeling_electra.py
#### Benchmark
A simple benchmark shows Robeta training latency dropped from 208ms ~
199ms. 4.5+% reduction.
More comprehensive tests are on the way.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Allows the user to select from supported backends for gpt2/convert_to_onnx.py. Default behavior is preserved if no provider is selected. This allows the ROCm EP to be selected.
Update Dockerfiles of ROCm and MIGraphX to ROCm5.4
Update README.md
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset
12+ is not supported yet.
With the way these kernels are currently registered, the documentation
shows support for opset 11+. This is not accurate.
### Motivation and Context
Fix#13781
This PR Registers the following operators for opset 16 to the DML EP:
- LeakyRelu-16
- PRelu-16
- Where-16
- GreaterOrEqual-16
- LessOrEqual-16
Identity-16 was not added in this PR due to pipeline failures
Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Numpy1.24.0 removed the np.float.
```
/opt/hostedtoolcache/Python/3.8.15/x64/bin/python onnxruntime_test_python_sparse_matmul.py
EE.
======================================================================
ERROR: testRunContribSparseMatMul (__main__.TestSparseToDenseMatmul)
Mutliple sparse COO tensor to dense
----------------------------------------------------------------------
Traceback (most recent call last):
File "onnxruntime_test_python_sparse_matmul.py", line 407, in testRunContribSparseMatMul
np.float,
File "/opt/hostedtoolcache/Python/3.8.15/x64/lib/python3.8/site-packages/numpy/__init__.py", line 284, in __getattr__
raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'float'
======================================================================
ERROR: testRunSparseOutputOnly (__main__.TestSparseToDenseMatmul)
Try running models using the new run_with_ort_values
----------------------------------------------------------------------
Traceback (most recent call last):
File "onnxruntime_test_python_sparse_matmul.py", line 39, in testRunSparseOutputOnly
values = np.array([1.764052391052246, 0.40015721321105957, 0.978738009929657], np.float)
File "/opt/hostedtoolcache/Python/3.8.15/x64/lib/python3.8/site-packages/numpy/__init__.py", line 284, in __getattr__
raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'float'
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Use target name for flatbuffers.
Add version range for flatbuffers. It is similar to #13870
### Motivation and Context
To fix a build error:
```
CMake Error at onnxruntime_graph.cmake:88 (add_dependencies):
The dependency target "flatbuffers" of target "onnxruntime_graph" does not
exist.
Call Stack (most recent call first):
CMakeLists.txt:1490 (include)
```
It happens when flatbuffers library is already installed. For example,
on Ubuntu people may get it from apt-get. But, the one provided by
Ubuntu 20.04 is not compatible with our code. The one in Ubuntu 22.04
works fine.
### Description
Update absl to a new version
### Motivation and Context
The new version contains fixes that are needed for Nvidia GPU build.
Once we update it to that version, we don't need to maintain our private
patches for Nvidia GPU build.
### Description
Explore the possible re-use of the logits buffer in `GreedySearch` for
cases where sequence length == 1 (Post the first decoding run, the
sequence length is guaranteed to be 1). This re-use will ensure that we
do not have to make copies of the logits before processing them.
Currently, we make a copy of the logits even if the sequence length == 1
which is not necessary as we can directly re-use the logits buffer for
the token generation step. A similar optimization exists in
`BeamSearch`, but seems lacking in `GreedySearch`. Since, the logits
buffer may contain padded data, we need to adjust the pieces consuming
the logits buffer directly to account for any padding.
A more invasive change (needs changes in a few places) will be to adjust
the interfaces of `ProcessLogits()` such that it takes a reference to
the logits and not a const reference as (based on my understanding) this
is the only place where the logits from the decoder subgraph will ever
be used and giving the `ProcessLogits()` method license to
mutate/process the underlying buffer of the logits OrtValue seems
reasonable (instead of making a copy and then mutating/processing them).
The will also remove the ugly `const_cast`(s) seen in this change.
### Description
MLAS QGEMM kernel need memory buffer for packing of source tensors. This
change moves these buffers from stack to heap
### Motivation and Context
MLAS QGEMM kernels have packing buffers on the stack since the beginning
of time. Emerging hardware demands larger and larger buffers, causing
potential stack overflow problems down the road. This change moves these
buffers from stack to the heap.
This change also introduces a thread initializer per kernel. For
instance, in the new AMX instruction set (support coming), we need to
initialize the tile registers per thread. This requirement can be easily
satisfied by tapping into this change.
Co-authored-by: Chen Fu <fuchen@microsoft.com>
Implement reuse kv_cache past and present tensor in Attention Ops.
Unit test for abover feature.
Utilize the reuse kv_cache for past and present tensor in Greedy Search.
Correctness test for it.
Co-authored-by: Zhang Lei <phill.zhang@gmail.com>
Fix error: builtin __has_trivial_destructor is deprecated; use __is_trivially_destructible instead [-Werror,-Wdeprecated-builtins]
This is not a clean fix as in 13783, users will need to manually set `CMAKE_HIP_FLAGS="-Wno-deprecated-builtins"` if they want to use self-built hipclang combining with ROCm 5.3.* or older.
### Description
update versions of a few build dependencies for onnxruntime NPM
packages.
update nodejs version to v16.x in linux CI. v12 is too out-of-dated. see
[nodejs release
schedule](https://github.com/nodejs/release#release-schedule)
### Motivation and Context
- upgrade to latest webpack allows using of latest Node.js LTS version.
previous version of webpack does not work on Node.js v18 and it is fixed
in latest version
- upgrade to latest typescript, ts-loader and other dev deps to
accelerate the build and bundling.
- upgrade also helps to resolve security warnings that may be vulnerable
in out-of-dated version
### Description
Add the ability to run graph
### Motivation and Context
A brief description is as follows:
1) If the whole graph is supported, then will be processed by the graph
engine, directly.
2) If the whole graph is not supported, the whole graph will be divided
into subgraphs and single operators; The sub-graphs will be run on graph
engine, and the single operators will fallback to the traditional mode.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add ability to set RunOptions config entries. Largely a cut-and-paste of
the existing code for setting SessionOptions config entries.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#13936