**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism.
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.
**Motivation and Context**
- Why is this change required?
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.
This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Lei Cao <leca@microsoft.com>
### Description
Fix usage of enable_training_ops and reduce ifdef complexity for
training builds.
### Motivation and Context
This is the second refactoring PR towards creating a dedicated build for
on device training. This PR aims to reduce some complexity. We can set
ENABLE_TRAINING_OPS in cmake when either ENABLE_TRAINING or
ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if
defined(ENABLE_TRAINING) || defined(ENABLE_TRAINING_ON_DEVICE )
everywhere in the code.
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR fixes some typos in the training apis.
We need to add more tests and make sure they are all run on the CIs to
capture such issues. These changes are out of scope of this PR.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ashwini Khade <askhade@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
Add cuda support to the on device training python bindings.
### Motivation and Context
Now users can set the execution provider (cpu or cuda) when using python
bindings for on device training apis.
### Description
Decouple strided tensor support from ENABLE_TRAINING
### Motivation and Context
This is step 1 for creating a dedicated build for on device training.
Intention is
1. We can set ENABLE_STRIDED_TENSORS in cmake when either
ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we
dont have to use if defined(ENABLE_TRAINING) ||
defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code.
2. This also paves the way to easily enable strided tensor support for
inference in future (if required).
### Description
The proposed change is useful for ORTModule when the output graph has
multiple outputs.
### Motivation and Context
performance
Signed-off-by: xadupre <xadupre@microsoft.com>
The PR optimizes Slice CUDA kernel by two ways:
- Coalesce dimensions so less divmod during the kernel compute
- Split data load and write for better memory throughput
Below shows some perf results (cycles number from Nsight Compute) in
V100 using real cases from Huggingface's XLNet model:
| Old | New
-- | -- | --
[8,12,2048,1024], axis=2, start=1, end=2048 | 1838687| 1539846
[8,12,1024,2047], axis=3, start=0, end=1024 | 951383| 722203
Right now we fix the warnings in an ad-hoc way. We run static analysis
in nightly builds, then create work items for the finding it found. Our
CI build pipelines run the same scan but do not break the build. So,
this PR will fix the remaining findings in the CPU EP(including the
training part) and enforce the check. Later on we can continue to expand
the scope.
We still have some warnings left in the JNI part. I will try to address
them later in the next month.
Motivation:
PythonOp is saving input for backward, it's risky since ONNX Runtime
backend is not aware of this, the tensor buffer may be "released" by
ORT, then potentially modified by other operators before backward
function executes.
Fix:
This pr just clone all input of PythonOp before forward is invoked. This
may be high overhead, it's just a workaround before a better fix.
### Fix training convergence issues
#### Problem:
Huggingface Transformers: 4.22.0
PyTorch Lightning: 1.6.3
PyTorch: v1.12.1, cuda 11.6
ORT: main branch, cuda 11.6
Model: RobertaForSequenceClassification @
models/roberta/modeling_roberta.py
Mixed Precision training with `torch.autocast`:
a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L99)
Under this amp autocast context, forward + loss computation run. Here is
a snippet of loss computation.
```
if labels is not None:
...
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
...
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
**loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))**
elif self.config.problem_type == "multi_label_classification":
...
return SequenceClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
```
It is found after forward run, loss is 1.0850 in float16, looks good..
Then it did a scaling up here:
a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L62),
the scaler is 65536. then we get a scaled loss 71104 in float type
(because float16 loss multiple fp32 scaler, type got promoted to fp32).
Then backward started with initial grads to be 1, then 1 (float32) *
65536 (float32) as the backward step, generating a float16 gradient,
then we got a `inf`. The problem occurs. With `inf`, the backward feed
the `inf` into crossentropygradient op, generating `nan`s. Then all
gradients got `nan` in back propagation.
So we see training with ORTModule (it almost always `overflow`, the loss
did not drop too much, as compared with PyTorch).
#### Analysis for the UT (when autocast enabled)
PyTorch trace graph looks like this :
```
graph(%0 : Float(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0),
%target : Long(16, strides=[1], requires_grad=0, device=cuda:0),
%2 : Float(3, 3, strides=[3, 1], requires_grad=1, device=cuda:0)):
%9 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%10 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%11 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%12 : NoneType = prim::Constant()
%13 : Half(3, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%2, %9, %10, %11, %12) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%14 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%15 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%16 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%17 : NoneType = prim::Constant()
%18 : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%0, %14, %15, %16, %17) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%19 : NoneType = prim::Constant()
%input : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%21 : NoneType = prim::Constant()
%22 : int = prim::Constant[value=1]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
%23 : int = prim::Constant[value=-100]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
%24 : float = prim::Constant[value=0.]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
%data : Float(requires_grad=0, device=cuda:0) = **aten::cross_entropy_loss(%input, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0**
%27 : Float(requires_grad=0, device=cuda:0) = ^_OutputIdentityOp()(%data) # /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_io.py:430:0
return (%27)
```
The most important lines
%target : Long(16, strides=[1], requires_grad=0, device=cuda:0),
%input : **_Half_**(16, 3, strides=[3, 1], requires_grad=0,
device=cuda:0) = aten::linear(%18, %13, %19) #
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
**_Float_**(requires_grad=0, device=cuda:0) =
aten::cross_entropy_loss(**%_input_**, %target, %21, %22, %23, %24) #
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
`aten::cross_entropy_loss` takes Half input, and return Float output. As
said in doc:
https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32,
`cross_entropy` in autocast mode will run in fp32 mode, e.g. convert its
input to fp32 (if it is not), do the compute and return fp32 result. The
other hand, ORT's `SoftmaxCrossEntropyLossInternal` take same types of
input and output, and our code
31cb3cb254/orttraining/orttraining/python/training/ortmodule/_custom_op_symbolic_registry.py (L68)
when exporting `aten::cross_entropy_loss` assumed this, and set the
output to be fp16 either. So this is the reason we have the problem.
#### Possible Fixes
1. Enhance `SoftmaxCrossEntropyLossInternal` to support different types
of input and output.
2. Check the input and output when exporting, add the input case
explicitly if there is type promotion from input to output.
This PR used the 2nd approach. We can start 1st approach when needed
later.
TODO: revisit all other exporter functions, add the checks, etc.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
`aten::_to_copy` is not exportable to ONNX. In DORT, so it's replaced in
`_replace_to_copy_with_to`. This replacement logic becomes incorrect in latest PyTorch
commit, and this PR is a fix.
Basically, we examine more key-word attributes passed to
`aten::_to_copy` and if they lead to a type casting operator (i.e.,
mapped to ONNX's Cast), we replace that `aten::_to_copy` with
`aten::to`. Unsupported attributes are removed (with a low risk of
breaking FX graph's assumptions).
### Add guidelines for ORTModule
As title.
Feel free to let me know if I missed something.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Customer reported this issue: they see many warnings when doing hte
evaluation using ORTModule.

After investigation, we found the `training_mode` is exported to a wrong
value in evaluation mode, it's value should be 0, but we found it is 1.
Fix:
fix pythonop training mode
if training_mode's type is torch._C._onnx.TrainingMode, then not matter
it is EVAL or TRAINING, "if training_mode" will always be true
**Description**: Subgraph-level recompute
This PR adds an optional capability trading additional re-computation
for better memory efficiency. Specifically, a pre-defined operator list
used to iterate the Graph to find some subgraphs for recompute, to
reduce some stashed activations whose lifetime across forward and
backward pass.
When training with ORTModule, by default, the graph transformer will
scan the execution graph to find all eligible subgraph to recompute,
along with sizes that can save. An example looks like below.
If we want to enable some of them to recompute, we can define env
variable this way:
`export
ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"`
```
[1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary]
[1,0]<stderr>:MemoryAlleviation Summary:
[1,0]<stderr>: User config:
[1,0]<stderr>: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1
[1,0]<stderr>: =================================
[1,0]<stderr>: Subgraph: BitmaskDropout+
[1,0]<stderr>: AlleviationType: Disabled
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 1,024 x Frequency:1
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: BiasGelu+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Reshape[1,0]<stderr>:+
[1,0]<stderr>: AlleviationType: Disabled
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:labels_dim0 x Frequency:1
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>: AlleviationType: Disabled
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:23
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:1
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Mul+Add+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+
[1,0]<stderr>: AlleviationType: Disabled
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Mul+Sub+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Cast+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:1,024 x 1,024 x Frequency:97
[1,0]<stderr>: PatternShape:3 x 1,024 x Frequency:1
[1,0]<stderr>: PatternShape:8 x 64 x Frequency:24
[1,0]<stderr>: PatternShape:1,024 x 4,096 x Frequency:24
[1,0]<stderr>: PatternShape:4,096 x Frequency:24
[1,0]<stderr>: PatternShape:4,096 x 1,024 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: FusedMatMul+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: =================================
```
"Type config:" whether recompute is enabled by users. 0 - disable, 1-
enable.
"Subgraph" means what kind of subgraph will be recomputed, in this case,
it is a single node "Gelu", and it will be "Recompute".
"Shape && Frequency" means, for this recompute, one tensor of size
(batch size, 500) will be saved because it will be recomputed.
**Baseline**
On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100
GPUs. With latest main branch, we can run batch size 16, and the maximum
batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB
memory is used during training. The SamplesPerSec=479.2543353561354.

**With this PR**
Gelu is recomputed for saving memory peak, batch size 32 can be run. The
97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (**1.17X**
of baseline).

**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.
For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

After, FW takes 115us, BW takes 139us, which is much faster.

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]
After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
Add env variable to control disabling custom autogard function support.
When using ORTModule, if the torch model has torch.nn.Function, if user
confirms that it can be exported to ONNX (for example, by inline
PythonOp) and the backward implementation is matched to the forward
impl, user can export "ORTMODULE_DISABLE_CUSTOM_AUTOGRAD_SUPPORT=1" to
disable the custom autograd support so that it won't use ORT's PythonOp
to fallback to PyTorch. Exporting to ONNX sometimes can leverage some
graph optimizations in ORT so that perf is better.
### Description
This is a fix for on device training wheel build.
### Motivation and Context
when building linux wheel it treats PathString same as std::string, but
when trying to build the wheel on windows it fails because we needed to
cast the std::string to a PathString.
This error was found manually because there is no pipeline that uses the
--enable_training_on_device for windows.
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
Use SAS Token to fix error` failed to perform copy command due to error:
no SAS token or OAuth token is present and the resource is not public`
Generate SAS Token of target data, add it into Key vault, and use it as
Pipeline Variable.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
The PR applies some fixes to Hierarchical ORTModule and ORTModule
PythonOp.
For Hierarchical ORTModule:
- Don't wrap module if the caller is to call other function instead of
forward() function
- Support single module instance is call multiple times with different
types of inputs
- Check if module can be warped from top to bottom instead of from
bottom to top
For ORTModule PythonOp:
- Add env variable control to allow using
torch.utils.checkpoint.CheckpointFunction
- Add env variable control to skip register some autograd functions so
that there is no conflict for some models.
### Description
updating the ptca image used in the nightly pipeline
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
**Description**: utils for federated learning.
**Motivation and Context**
- This PR includes utils that will be used on federated learning
scenarios.
- Exposing python bindings to some utils, and added a util to calculate
the difference between two buffers.
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
`python setup.py develop` doesn't install PyTorch as a normal package in
site-packages anymore, and the user must stay at PyTorch's root
directory to call `import torch`. This will break LORT tests because
LORT tests contains `import torch` and are called outside PyTorch root
directory. To make PyTorch a normal package again, this PR build PyTorch
with `python setup.py install`.
This PR is to add support of using env variable to set provider option
cudnn_conv_algo_search so that user can choose better conv algo search
method to run model. This is a quick fix to unblock the test of MoE
model. Will have another PR to design and implement the ORTModule config
so that we can config ORTModule using Python script or config file
instead of env variable.
Model [huggingface's diffusers
library](https://github.com/huggingface/diffusers) has
torch.nn.GroupNorm which will be exported to sub-graph containing ONNX's
InstanceNormalization, which is lack of gradient. The implementation of
ORT's InstanceNormalization will call cuDNN's BatchNorm for part of
computation, which is not efficient compared to PyTorch's
implementation. This PR is to use ATen fallback to support this torch
module, including its forward and backward.
### Description
<!-- Describe your changes. -->
Unit test with ROCm5.3 slower than ROCm5.2.3. Revert to ROCm5.2.3.
We will update to ROCm5.3 when the issue resloved by AMD.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->