Commit graph

1156 commits

Author SHA1 Message Date
Tang, Cheng
a81faee41e
Multi-stream execution support (#13495)
**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism. 
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.

**Motivation and Context**
- Why is this change required? 
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.

This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Lei Cao <leca@microsoft.com>
2022-12-15 07:39:29 -08:00
Baiju Meswani
1fd63487fd
ORTModule support for kwargs input that is a dict (#13910) 2022-12-14 16:23:48 -08:00
Baiju Meswani
5a55fac402
Miscellaneous updates to training apis (#13929) 2022-12-14 13:33:07 -08:00
Baiju Meswani
8c249cc8f7
[QAT] FakeQuantGrad and gradient building for FakeQuant (#13825) 2022-12-14 11:54:02 -08:00
Ashwini Khade
6090d8cd6e
Fix usage of enable_training_ops and reduce ifdef complexity for training builds (#13888)
### Description
Fix usage of enable_training_ops and reduce ifdef complexity for
training builds.




### Motivation and Context
This is the second refactoring PR towards creating a dedicated build for
on device training. This PR aims to reduce some complexity. We can set
ENABLE_TRAINING_OPS in cmake when either ENABLE_TRAINING or
ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if
defined(ENABLE_TRAINING) || defined(ENABLE_TRAINING_ON_DEVICE )
everywhere in the code.

- If it fixes an open issue, please link to the issue here. -->
2022-12-14 08:32:46 -08:00
PeixuanZuo
80a046b36f
[ROCm] update amd CI huggingface model performance number (#13961)
Fix CI test failure.
Test distilbert-base model performance number on gcramdrr1-mi100-08x and
update.
2022-12-14 16:30:25 +08:00
Ashwini Khade
a7bc927b4b
fix typos in training apis (#13908)
### Description
This PR fixes some typos in the training apis.

We need to add more tests and make sure they are all run on the CIs to
capture such issues. These changes are out of scope of this PR.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ashwini Khade <askhade@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-12-09 16:01:11 -08:00
Adam Louly
fb4707f76d
add cuda support to python bindings (#13700)
### Description
Add cuda support to the on device training python bindings.



### Motivation and Context
Now users can set the execution provider (cpu or cuda) when using python
bindings for on device training apis.
2022-12-08 16:03:53 -08:00
Adam Louly
f453d2845e
adding get and set lr for optimizer (#13661)
### Description
Exposing get and set Learning rate for optimizer


### Motivation and Context
you can now set learning rate for optimizer.
2022-12-07 11:59:11 -08:00
Ashwini Khade
983877c712
Decouple strided tensor support from ENABLE_TRAINING (#13829)
### Description
Decouple strided tensor support from ENABLE_TRAINING

### Motivation and Context
This is step 1 for creating a dedicated build for on device training.
Intention is

1. We can set ENABLE_STRIDED_TENSORS in cmake when either
ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we
dont have to use if defined(ENABLE_TRAINING) ||
defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code.

2. This also paves the way to easily enable strided tensor support for
inference in future (if required).
2022-12-07 09:22:21 -08:00
Wei-Sheng Chin
7df8f84228
Improve DORT document (#13790)
1. Refine words based on PyTorch changes.
2. Make the need of inference mode clearer. A test is added.
2022-11-30 16:55:25 -08:00
Wei-Sheng Chin
639d285670
[DORT] Catch up with yesterday's PyTorch change (#13779)
Fix recent CI failures.
2022-11-30 09:23:44 -08:00
Xavier Dupré
441b30b2d2
Move a function call outside a loop in ORTModule (#13771)
### Description
The proposed change is useful for ORTModule when the output graph has
multiple outputs.



### Motivation and Context
performance

Signed-off-by: xadupre <xadupre@microsoft.com>
2022-11-30 12:49:41 +01:00
Baiju Meswani
2c29938846
[QAT] Introduce FakeQuant op (#13649) 2022-11-29 08:43:37 -08:00
pengwa
7c53b6eee8
Skip the tests of saving tensor in backward (#13767)
### skip the tests of saving tensor in backward

The test failed randomly; Let's skip it until the issue got fixed to
unblock the CIs.
2022-11-29 13:02:26 +08:00
Vincent Wang
3c258c878c
[CUDA] Optimize Slice Kernel (#13641)
The PR optimizes Slice CUDA kernel by two ways:
- Coalesce dimensions so less divmod during the kernel compute
- Split data load and write for better memory throughput

Below shows some perf results (cycles number from Nsight Compute) in
V100 using real cases from Huggingface's XLNet model:

  | Old | New
-- | -- | --
[8,12,2048,1024], axis=2, start=1, end=2048 | 1838687| 1539846
[8,12,1024,2047], axis=3, start=0, end=1024 | 951383| 722203
2022-11-29 09:18:03 +08:00
Changming Sun
87e6a26c5d
Enforce Prefast check in Windows CPU CI pipeline (#13735)
Right now we fix the warnings in an ad-hoc way. We run static analysis
in nightly builds, then create work items for the finding it found. Our
CI build pipelines run the same scan but do not break the build. So,
this PR will fix the remaining findings in the CPU EP(including the
training part) and enforce the check. Later on we can continue to expand
the scope.

We still have some warnings left in the JNI part. I will try to address
them later in the next month.
2022-11-23 09:25:02 -08:00
guyang3532
ba9a585fcc
Fix the tensor save for backward release problem (#13679)
Motivation:
PythonOp is saving input for backward, it's risky since ONNX Runtime
backend is not aware of this, the tensor buffer may be "released" by
ORT, then potentially modified by other operators before backward
function executes.

Fix:
This pr just clone all input of PythonOp before forward is invoked. This
may be high overhead, it's just a workaround before a better fix.
2022-11-22 17:32:19 +08:00
pengwa
947aab0ae0
Make HF converge with lighting native amp (#13616)
### Fix training convergence issues 

#### Problem:

Huggingface Transformers: 4.22.0
PyTorch Lightning: 1.6.3 
PyTorch: v1.12.1, cuda 11.6
ORT: main branch, cuda 11.6

Model: RobertaForSequenceClassification @
models/roberta/modeling_roberta.py
Mixed Precision training with `torch.autocast`:
a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L99)

Under this amp autocast context, forward + loss computation run. Here is
a snippet of loss computation.

```
        if labels is not None:
                ...
            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                   ...
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                **loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))**
            elif self.config.problem_type == "multi_label_classification":
                ...

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )
```

It is found after forward run, loss is 1.0850 in float16, looks good..
Then it did a scaling up here:
a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L62),
the scaler is 65536. then we get a scaled loss 71104 in float type
(because float16 loss multiple fp32 scaler, type got promoted to fp32).
Then backward started with initial grads to be 1, then 1 (float32) *
65536 (float32) as the backward step, generating a float16 gradient,
then we got a `inf`. The problem occurs. With `inf`, the backward feed
the `inf` into crossentropygradient op, generating `nan`s. Then all
gradients got `nan` in back propagation.

So we see training with ORTModule (it almost always `overflow`, the loss
did not drop too much, as compared with PyTorch).

#### Analysis for the UT (when autocast enabled)

PyTorch trace graph looks like this :

```
graph(%0 : Float(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0),
      %target : Long(16, strides=[1], requires_grad=0, device=cuda:0),
      %2 : Float(3, 3, strides=[3, 1], requires_grad=1, device=cuda:0)):
  %9 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %10 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %11 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %12 : NoneType = prim::Constant()
  %13 : Half(3, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%2, %9, %10, %11, %12) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %14 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %15 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %16 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %17 : NoneType = prim::Constant()
  %18 : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%0, %14, %15, %16, %17) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %19 : NoneType = prim::Constant()
  %input : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %21 : NoneType = prim::Constant()
  %22 : int = prim::Constant[value=1]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
  %23 : int = prim::Constant[value=-100]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
  %24 : float = prim::Constant[value=0.]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
  %data : Float(requires_grad=0, device=cuda:0) = **aten::cross_entropy_loss(%input, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0**
  %27 : Float(requires_grad=0, device=cuda:0) = ^_OutputIdentityOp()(%data) # /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_io.py:430:0
  return (%27)
```

The most important lines 

%target : Long(16, strides=[1], requires_grad=0, device=cuda:0),
%input : **_Half_**(16, 3, strides=[3, 1], requires_grad=0,
device=cuda:0) = aten::linear(%18, %13, %19) #
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
**_Float_**(requires_grad=0, device=cuda:0) =
aten::cross_entropy_loss(**%_input_**, %target, %21, %22, %23, %24) #
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0


`aten::cross_entropy_loss` takes Half input, and return Float output. As
said in doc:
https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32,
`cross_entropy` in autocast mode will run in fp32 mode, e.g. convert its
input to fp32 (if it is not), do the compute and return fp32 result. The
other hand, ORT's `SoftmaxCrossEntropyLossInternal` take same types of
input and output, and our code
31cb3cb254/orttraining/orttraining/python/training/ortmodule/_custom_op_symbolic_registry.py (L68)
when exporting `aten::cross_entropy_loss` assumed this, and set the
output to be fp16 either. So this is the reason we have the problem.

#### Possible Fixes
1. Enhance `SoftmaxCrossEntropyLossInternal` to support different types
of input and output.
2. Check the input and output when exporting, add the input case
explicitly if there is type promotion from input to output.

This PR used the 2nd approach. We can start 1st approach when needed
later.

TODO: revisit all other exporter functions, add the checks, etc. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-11-22 15:08:30 +08:00
Changming Sun
a5c2047dd1
Fix the remaining Prefast warnings in CPU EP (#13707)
### Description

Fix the remaining Prefast warnings in CPU EP.
2022-11-21 10:21:38 -08:00
Wei-Sheng Chin
6160ba0692
Fix aten::_to_copy in DORT (#13682)
`aten::_to_copy` is not exportable to ONNX. In DORT, so it's replaced in 
`_replace_to_copy_with_to`. This replacement logic becomes incorrect in latest PyTorch
commit, and this PR is a fix.

Basically, we examine more key-word attributes passed to
`aten::_to_copy` and if they lead to a type casting operator (i.e.,
mapped to ONNX's Cast), we replace that `aten::_to_copy` with
`aten::to`. Unsupported attributes are removed (with a low risk of
breaking FX graph's assumptions).
2022-11-18 09:31:18 -08:00
Vincent Wang
07812a2fa6
Fix UT Failure on AMD for ORTModule's Conv Test (#13688)
Currently provider option conv_algo_search is for CUDA only, so remove
the checking for ROCm EP.
2022-11-18 17:52:22 +08:00
cloudhan
9e649d1ac4
Allow CUDA EP enable or disable TunableOp via session options and environment variable (#13601)
This ports #13116 from ROCm EP to CUDA EP
2022-11-15 14:43:54 +08:00
Vincent Wang
2bda3fd341
Gather to Slice Fusion (#13599)
This PR is to optimize the running for below code from Huggingface's
XLNet model.
```
x = torch.index_select(x, 3, torch.arange(klen, device=x.device, dtype=torch.long))
```

The code will be exported to Range->Gather, which can be fused to a
Slice Op. Slice kernel is much faster than Gather, especially for
backward run. The main reason is for Gather, the data in indices can be
duplicated so that it needs sum during backward, but Slice node cannot
have such case.

Use Huggingface's XLNet model for profiling.
- Before the fuse
forward, ~753us

![image](https://user-images.githubusercontent.com/11661208/200758439-63f2f9b5-9610-4df8-98c8-a1ad4dc62f4e.png)
backward, ~46101us

![image](https://user-images.githubusercontent.com/11661208/200758530-fe16a8ec-ea8f-4b79-b3ac-386b72ba1670.png)

- After the fuse
forward, ~627us

![image](https://user-images.githubusercontent.com/11661208/200758654-ab9a6068-c45d-40f4-9c71-3862a56732f8.png)
backward, ~677us

![image](https://user-images.githubusercontent.com/11661208/200758833-aab1b8e1-1b5d-4e55-88cf-03c2a1d9d42b.png)
2022-11-10 13:03:30 +08:00
Edward Chen
9e65f3bfdb
Replace deprecated Python dependency sklearn with scikit-learn. (#13585) 2022-11-08 09:08:29 -08:00
pengwa
ab9ac2acc4
Add guidelines for ORTModule (#13553)
### Add guidelines for ORTModule

As title.

Feel free to let me know if I missed something. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-11-04 19:42:10 +08:00
zhijiang
1977b7ed6a
Fix pythonop training_mode in evaluation mode (#13514)
Customer reported this issue: they see many warnings when doing hte
evaluation using ORTModule.


![image](https://user-images.githubusercontent.com/10530022/199371757-5fed7d05-a951-4f1b-8f88-049c5ab89886.png)

After investigation, we found the `training_mode` is exported to a wrong
value in evaluation mode, it's value should be 0, but we found it is 1.

Fix: 
fix pythonop training mode

if training_mode's type is torch._C._onnx.TrainingMode, then not matter
it is EVAL or TRAINING, "if training_mode" will always be true
2022-11-04 08:47:01 +08:00
pengwa
a3e7da60e7
Trade subgraph recompute for memory (#12852)
**Description**: Subgraph-level recompute

This PR adds an optional capability trading additional re-computation
for better memory efficiency. Specifically, a pre-defined operator list
used to iterate the Graph to find some subgraphs for recompute, to
reduce some stashed activations whose lifetime across forward and
backward pass.

When training with ORTModule, by default, the graph transformer will
scan the execution graph to find all eligible subgraph to recompute,
along with sizes that can save. An example looks like below.
If we want to enable some of them to recompute, we can define env
variable this way:
`export
ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"`
```

[1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary]
[1,0]<stderr>:MemoryAlleviation Summary:
[1,0]<stderr>:  User config:
[1,0]<stderr>:  Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1
[1,0]<stderr>:  =================================
[1,0]<stderr>:  Subgraph: BitmaskDropout+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 1,024 x   Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: BiasGelu+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Reshape[1,0]<stderr>:+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:labels_dim0 x      Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:23
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Add+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x     Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Sub+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:1,024 x 1,024 x    Frequency:97
[1,0]<stderr>:                  PatternShape:3 x 1,024 x        Frequency:1
[1,0]<stderr>:                  PatternShape:8 x 64 x   Frequency:24
[1,0]<stderr>:                  PatternShape:1,024 x 4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x 1,024 x    Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  =================================
```


"Type config:" whether recompute is enabled by users. 0 - disable, 1-
enable.
"Subgraph" means what kind of subgraph will be recomputed, in this case,
it is a single node "Gelu", and it will be "Recompute".
"Shape && Frequency" means, for this recompute, one tensor of size
(batch size, 500) will be saved because it will be recomputed.

**Baseline**

On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100
GPUs. With latest main branch, we can run batch size 16, and the maximum
batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB
memory is used during training. The SamplesPerSec=479.2543353561354.


![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png)

**With this PR**

Gelu is recomputed for saving memory peak, batch size 32 can be run. The
97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (**1.17X**
of baseline).


![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png)


**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
2022-11-03 13:49:41 +08:00
Wei-Sheng Chin
b5904c40dd
Enable ORT in TorchDynamo (#13259)
This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.
2022-11-01 11:19:29 -07:00
PeixuanZuo
6740528b98 [ROCm] Fix bug for rocm ep build using MS GSL 4.0.0 (#13525) 2022-11-01 13:05:55 +08:00
Baiju Meswani
c557a55816
Fix on-device training ExportModelForInferencing api (#13510) 2022-10-31 21:29:06 -07:00
Edward Chen
2ecd1d6622
Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
Vincent Wang
8b0669bf63
QuickGelu Fusion (#12417)
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.

For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png)

After, FW takes 115us, BW takes 139us, which is much faster.

![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png)

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]

After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2022-10-28 18:12:07 +08:00
Baiju Meswani
a46c599a40
Training API to export the eval model to an inference model (#13345) 2022-10-27 09:34:01 -07:00
Vincent Wang
805ec459a0
Fix a PoliCheck finding in _hierarchical_ortmodule.py(#13462) 2022-10-26 15:45:18 -07:00
Vincent Wang
b6a3562ffb
[ORTModule] Add Env Variable to Control Disabling Custom AutoGrad Function Support (#13430)
Add env variable to control disabling custom autogard function support.
When using ORTModule, if the torch model has torch.nn.Function, if user
confirms that it can be exported to ONNX (for example, by inline
PythonOp) and the backward implementation is matched to the forward
impl, user can export "ORTMODULE_DISABLE_CUSTOM_AUTOGRAD_SUPPORT=1" to
disable the custom autograd support so that it won't use ORT's PythonOp
to fallback to PyTorch. Exporting to ONNX sometimes can leverage some
graph optimizations in ORT so that perf is better.
2022-10-25 16:58:04 +08:00
cloudhan
2748f38362
Drop hip_add_library (#13406)
Switching to use CMake's builtin hip language support.
2022-10-25 12:57:48 +08:00
Adam Louly
bed169192d
Windows build fix for on device training training. (#13354)
### Description
This is a fix for on device training wheel build.

### Motivation and Context
when building linux wheel it treats PathString same as std::string, but
when trying to build the wheel on windows it fails because we needed to
cast the std::string to a PathString.

This error was found manually because there is no pipeline that uses the
--enable_training_on_device for windows.

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-10-20 09:58:02 -07:00
cloudhan
fc12abf6b1
Enable/Disbale tunable GEMM by using tunable switch in provider options and env var (#13116)
Related PRs #12853

This allows the user enable/disbale tunable GEMM on demand.
2022-10-19 22:35:08 -07:00
PeixuanZuo
4b2b588895
[ROCm] Fix azcopy issue on ROCm ci pipeline (#13365)
### Description
<!-- Describe your changes. -->

Use SAS Token to fix error` failed to perform copy command due to error:
no SAS token or OAuth token is present and the resource is not public`

Generate SAS Token of target data, add it into Key vault, and use it as
Pipeline Variable.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2022-10-20 12:08:57 +08:00
Vincent Wang
67150baa8d
[ORTModule] ATen Support for aten::upsample_nearest (#13364)
ATen support for aten::upsample_nearest, which is required for
Huggingface's diffusers model training using ORTModule.
2022-10-20 08:30:04 +08:00
Vincent Wang
b6b3f41636
Fixes of Hierarchical ORTModule and ORTModule PythonOp (#13347)
The PR applies some fixes to Hierarchical ORTModule and ORTModule
PythonOp.

For Hierarchical ORTModule:
- Don't wrap module if the caller is to call other function instead of
forward() function
- Support single module instance is call multiple times with different
types of inputs
- Check if module can be warped from top to bottom instead of from
bottom to top

For ORTModule PythonOp:
- Add env variable control to allow using
torch.utils.checkpoint.CheckpointFunction
- Add env variable control to skip register some autograd functions so
that there is no conflict for some models.
2022-10-20 08:16:03 +08:00
Adam Louly
61ee5585b2
update the nightly build to use the latest ptca image. (#13309)
### Description
updating the ptca image used in the nightly pipeline

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-10-17 14:12:03 -07:00
Adam Louly
68eff69ab1
Add Utils for federated learning scenarios (#13014)
**Description**: utils for federated learning.

**Motivation and Context**
- This PR includes utils that will be used on federated learning
scenarios.
- Exposing python bindings to some utils, and added a util to calculate
the difference between two buffers.

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
2022-10-17 12:39:43 -07:00
Jeff Daily
65c67764ae
remove line "ADD model ${WORKSPACE_DIR}/model" in the amdgpu Dockerfile (#12914)
Follow-up to #12707. docker build is broken otherwise; model dir is
gone.
2022-10-14 13:17:28 -07:00
Wei-Sheng Chin
dc324b1d90
[LazyTensor] Make LORT Build Again with Latest PyTorch (#13303)
`python setup.py develop` doesn't install PyTorch as a normal package in
site-packages anymore, and the user must stay at PyTorch's root
directory to call `import torch`. This will break LORT tests because
LORT tests contains `import torch` and are called outside PyTorch root
directory. To make PyTorch a normal package again, this PR build PyTorch
with `python setup.py install`.
2022-10-13 13:56:17 -07:00
Vincent Wang
807b2f4dd5
[ORTModule] Use Env Variable to Set Provider Option cudnn_conv_algo_search (#13296)
This PR is to add support of using env variable to set provider option
cudnn_conv_algo_search so that user can choose better conv algo search
method to run model. This is a quick fix to unblock the test of MoE
model. Will have another PR to design and implement the ORTModule config
so that we can config ORTModule using Python script or config file
instead of env variable.
2022-10-13 15:36:21 +08:00
Vincent Wang
6fb70a82df
[ORTModule] Update Supported DeepSpeed Version for FP16_Optimizer (#13305)
Update supported deepspeed highest version from 0.7.1 to 0.7.3 for
FP16_Optimizer. Also add version info to warning log.
2022-10-13 13:03:01 +08:00
Vincent Wang
afb5f76770
[ORTModule] ATen Support for torch.nn.GroupNorm (#13293)
Model [huggingface's diffusers
library](https://github.com/huggingface/diffusers) has
torch.nn.GroupNorm which will be exported to sub-graph containing ONNX's
InstanceNormalization, which is lack of gradient. The implementation of
ORT's InstanceNormalization will call cuDNN's BatchNorm for part of
computation, which is not efficient compared to PyTorch's
implementation. This PR is to use ATen fallback to support this torch
module, including its forward and backward.
2022-10-13 11:59:03 +08:00
PeixuanZuo
6895918b1c
[ROCm] Revert CI pipeline to ROCm5.2.3 (#13297)
### Description
<!-- Describe your changes. -->

Unit test with ROCm5.3 slower than ROCm5.2.3. Revert to ROCm5.2.3.
We will update to ROCm5.3 when the issue resloved by AMD.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-12 10:47:33 -07:00