Commit graph

335 commits

Author SHA1 Message Date
Kyushick Lee
cd24f0794a
Extend ort_backend.py for another ep (#14349)
### Description
<!-- Describe your changes. -->

This PR extends OrtBackend to allow for configuring an EP based on the
name, and fallbacks to existing mechanism that infers the EP based on
tensor affinity if nothing is provided.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Currently OrtBackend needs `get_ort_device()` with the device tag
inferred from torch.Tensor, but ort device is not yet supported for
dort. The change allows run dort with a supported EP, by configuring
dort with a desired EP and letting the dort (ort InferenceSession) take
CPU-affined pytorch Tensors as inputs then inject data transfer nodes
internally.
2023-01-20 07:30:00 -08:00
Ashwini Khade
cc7799835e
Enable a single build with optimized inference and on device training (#14241)
### Description
Right now prepacking code is not compiled when training is enabled. Our
partners want a single build of ort which can do both optimized
inference + training on device. This PR enables prepacking code in a
training build and controls whether it is enabled or not using already
existing session option - kOrtSessionOptionsConfigDisablePrepacking

For Inference scenarios - prepacking will be turned on by default and
this behavior remains the same after this PR too.
For training scenarios - prepacking will be disabled by default and if
user explicitly enables it then an error will be thrown.



### Motivation and Context
Enable both optimized inference as well as on device training in a
single build. For on device training use flag --enable_training_apis.
2023-01-12 21:36:43 -08:00
Xavier Dupré
79dc39600f
Replace distutils by setuptools to import build_ext (#14108)
### Description
Uses setuptools instead of distutils.



### Motivation and Context
Fixes #14107.
2023-01-09 11:48:01 +01:00
Ashwini Khade
68b5b2d7d3
Refactor training build options (#13964)
### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: 
   -  Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.

Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs

Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)

### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
2023-01-03 13:28:16 -08:00
Vincent Wang
0c3480e565
[ORTModule] ATen upsample_nearest Gradient Bugfix (#14069)
PyTorch removed upsample_nearest related backward functions with "vec"
overload name since 1.13. The functions without overload name are
available for all versions, though they are not that convienent to use.
This PR changes the gradient builder code to use functions without
overload name for ATen upsample_nearest nodes.

This PR also fixed a bug for ORTModule's corner case introduced by the
multi-stream PR. There is some code to execute the barrier step for
triggered downsteam is the barrier is out of range. But this should be
applied to triggered downstream only. If it's a normal run with start
step as a barrier step but out of range, we should not apply the logic.
For example, for ORTModule, if the barrier is the 1st step of whole CPU
plan, and the forward part is empty, then the forward normal run will
run step from start-0 to end-0 (actually nothing), and step-0 is the
barrier, then we should not execute the barrier in such case.
2022-12-27 10:18:30 +08:00
Adam Louly
e49f358686
expose lr scheduler python bindings for on device training. (#13882)
### Description
Exposing LR Scheduler python bindings for on device training.

Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
2022-12-22 18:44:04 -08:00
pengwa
2f5bf75e51
Optimize computation orders (#13672)
### Optimize computation orders

In `Roberta/Electra`, when `ClassificationHead` is used, there is
slicing operation on features on sequence_length dimensions, then loss
calculations only depend on this sliced data. This is a slicing at axis
1. Before slicing the shape is [batch, sequence_length, hidden], after
slicing, it becomes [batch , hidden_stage]

We had opportunities to bring this slicing earlier as much as possible,
by passing through simple elementwise ops (like Add/Div), or
Layernorm/Softmax(if their reduce axis is after the slicing axis), or
even MatMul's the left operand (if only it did not affect the last
dims).

For operators like Reshape/Transpose, it is special since they have
either data specified (after slicing we need update), or they have perm
specified, which requires the input rank remain unchanged. So for those
kinds of operators, we can remain the original rank, but just leave the
sliced dim to be 1, after the compute completed, we do a Squeeze.

```
class RobertaClassificationHead(nn.Module):
    """Head for sentence-level classification tasks."""

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x
```

src\transformers\models\roberta\modeling_roberta.py
src\transformers\models\electra\modeling_electra.py

#### Benchmark

A simple benchmark shows Robeta training latency dropped from 208ms ~
199ms. 4.5+% reduction.
More comprehensive tests are on the way.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-12-22 15:12:52 +08:00
Baiju Meswani
1fd63487fd
ORTModule support for kwargs input that is a dict (#13910) 2022-12-14 16:23:48 -08:00
Baiju Meswani
5a55fac402
Miscellaneous updates to training apis (#13929) 2022-12-14 13:33:07 -08:00
Adam Louly
fb4707f76d
add cuda support to python bindings (#13700)
### Description
Add cuda support to the on device training python bindings.



### Motivation and Context
Now users can set the execution provider (cpu or cuda) when using python
bindings for on device training apis.
2022-12-08 16:03:53 -08:00
Adam Louly
f453d2845e
adding get and set lr for optimizer (#13661)
### Description
Exposing get and set Learning rate for optimizer


### Motivation and Context
you can now set learning rate for optimizer.
2022-12-07 11:59:11 -08:00
Wei-Sheng Chin
7df8f84228
Improve DORT document (#13790)
1. Refine words based on PyTorch changes.
2. Make the need of inference mode clearer. A test is added.
2022-11-30 16:55:25 -08:00
Wei-Sheng Chin
639d285670
[DORT] Catch up with yesterday's PyTorch change (#13779)
Fix recent CI failures.
2022-11-30 09:23:44 -08:00
Xavier Dupré
441b30b2d2
Move a function call outside a loop in ORTModule (#13771)
### Description
The proposed change is useful for ORTModule when the output graph has
multiple outputs.



### Motivation and Context
performance

Signed-off-by: xadupre <xadupre@microsoft.com>
2022-11-30 12:49:41 +01:00
guyang3532
ba9a585fcc
Fix the tensor save for backward release problem (#13679)
Motivation:
PythonOp is saving input for backward, it's risky since ONNX Runtime
backend is not aware of this, the tensor buffer may be "released" by
ORT, then potentially modified by other operators before backward
function executes.

Fix:
This pr just clone all input of PythonOp before forward is invoked. This
may be high overhead, it's just a workaround before a better fix.
2022-11-22 17:32:19 +08:00
pengwa
947aab0ae0
Make HF converge with lighting native amp (#13616)
### Fix training convergence issues 

#### Problem:

Huggingface Transformers: 4.22.0
PyTorch Lightning: 1.6.3 
PyTorch: v1.12.1, cuda 11.6
ORT: main branch, cuda 11.6

Model: RobertaForSequenceClassification @
models/roberta/modeling_roberta.py
Mixed Precision training with `torch.autocast`:
a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L99)

Under this amp autocast context, forward + loss computation run. Here is
a snippet of loss computation.

```
        if labels is not None:
                ...
            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                   ...
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                **loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))**
            elif self.config.problem_type == "multi_label_classification":
                ...

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )
```

It is found after forward run, loss is 1.0850 in float16, looks good..
Then it did a scaling up here:
a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L62),
the scaler is 65536. then we get a scaled loss 71104 in float type
(because float16 loss multiple fp32 scaler, type got promoted to fp32).
Then backward started with initial grads to be 1, then 1 (float32) *
65536 (float32) as the backward step, generating a float16 gradient,
then we got a `inf`. The problem occurs. With `inf`, the backward feed
the `inf` into crossentropygradient op, generating `nan`s. Then all
gradients got `nan` in back propagation.

So we see training with ORTModule (it almost always `overflow`, the loss
did not drop too much, as compared with PyTorch).

#### Analysis for the UT (when autocast enabled)

PyTorch trace graph looks like this :

```
graph(%0 : Float(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0),
      %target : Long(16, strides=[1], requires_grad=0, device=cuda:0),
      %2 : Float(3, 3, strides=[3, 1], requires_grad=1, device=cuda:0)):
  %9 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %10 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %11 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %12 : NoneType = prim::Constant()
  %13 : Half(3, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%2, %9, %10, %11, %12) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %14 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %15 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %16 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %17 : NoneType = prim::Constant()
  %18 : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%0, %14, %15, %16, %17) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %19 : NoneType = prim::Constant()
  %input : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
  %21 : NoneType = prim::Constant()
  %22 : int = prim::Constant[value=1]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
  %23 : int = prim::Constant[value=-100]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
  %24 : float = prim::Constant[value=0.]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
  %data : Float(requires_grad=0, device=cuda:0) = **aten::cross_entropy_loss(%input, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0**
  %27 : Float(requires_grad=0, device=cuda:0) = ^_OutputIdentityOp()(%data) # /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_io.py:430:0
  return (%27)
```

The most important lines 

%target : Long(16, strides=[1], requires_grad=0, device=cuda:0),
%input : **_Half_**(16, 3, strides=[3, 1], requires_grad=0,
device=cuda:0) = aten::linear(%18, %13, %19) #
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
**_Float_**(requires_grad=0, device=cuda:0) =
aten::cross_entropy_loss(**%_input_**, %target, %21, %22, %23, %24) #
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0


`aten::cross_entropy_loss` takes Half input, and return Float output. As
said in doc:
https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32,
`cross_entropy` in autocast mode will run in fp32 mode, e.g. convert its
input to fp32 (if it is not), do the compute and return fp32 result. The
other hand, ORT's `SoftmaxCrossEntropyLossInternal` take same types of
input and output, and our code
31cb3cb254/orttraining/orttraining/python/training/ortmodule/_custom_op_symbolic_registry.py (L68)
when exporting `aten::cross_entropy_loss` assumed this, and set the
output to be fp16 either. So this is the reason we have the problem.

#### Possible Fixes
1. Enhance `SoftmaxCrossEntropyLossInternal` to support different types
of input and output.
2. Check the input and output when exporting, add the input case
explicitly if there is type promotion from input to output.

This PR used the 2nd approach. We can start 1st approach when needed
later.

TODO: revisit all other exporter functions, add the checks, etc. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-11-22 15:08:30 +08:00
Wei-Sheng Chin
6160ba0692
Fix aten::_to_copy in DORT (#13682)
`aten::_to_copy` is not exportable to ONNX. In DORT, so it's replaced in 
`_replace_to_copy_with_to`. This replacement logic becomes incorrect in latest PyTorch
commit, and this PR is a fix.

Basically, we examine more key-word attributes passed to
`aten::_to_copy` and if they lead to a type casting operator (i.e.,
mapped to ONNX's Cast), we replace that `aten::_to_copy` with
`aten::to`. Unsupported attributes are removed (with a low risk of
breaking FX graph's assumptions).
2022-11-18 09:31:18 -08:00
zhijiang
1977b7ed6a
Fix pythonop training_mode in evaluation mode (#13514)
Customer reported this issue: they see many warnings when doing hte
evaluation using ORTModule.


![image](https://user-images.githubusercontent.com/10530022/199371757-5fed7d05-a951-4f1b-8f88-049c5ab89886.png)

After investigation, we found the `training_mode` is exported to a wrong
value in evaluation mode, it's value should be 0, but we found it is 1.

Fix: 
fix pythonop training mode

if training_mode's type is torch._C._onnx.TrainingMode, then not matter
it is EVAL or TRAINING, "if training_mode" will always be true
2022-11-04 08:47:01 +08:00
pengwa
a3e7da60e7
Trade subgraph recompute for memory (#12852)
**Description**: Subgraph-level recompute

This PR adds an optional capability trading additional re-computation
for better memory efficiency. Specifically, a pre-defined operator list
used to iterate the Graph to find some subgraphs for recompute, to
reduce some stashed activations whose lifetime across forward and
backward pass.

When training with ORTModule, by default, the graph transformer will
scan the execution graph to find all eligible subgraph to recompute,
along with sizes that can save. An example looks like below.
If we want to enable some of them to recompute, we can define env
variable this way:
`export
ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"`
```

[1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary]
[1,0]<stderr>:MemoryAlleviation Summary:
[1,0]<stderr>:  User config:
[1,0]<stderr>:  Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1
[1,0]<stderr>:  =================================
[1,0]<stderr>:  Subgraph: BitmaskDropout+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 1,024 x   Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: BiasGelu+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Reshape[1,0]<stderr>:+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:labels_dim0 x      Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:23
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Add+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x     Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Sub+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:1,024 x 1,024 x    Frequency:97
[1,0]<stderr>:                  PatternShape:3 x 1,024 x        Frequency:1
[1,0]<stderr>:                  PatternShape:8 x 64 x   Frequency:24
[1,0]<stderr>:                  PatternShape:1,024 x 4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x 1,024 x    Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  =================================
```


"Type config:" whether recompute is enabled by users. 0 - disable, 1-
enable.
"Subgraph" means what kind of subgraph will be recomputed, in this case,
it is a single node "Gelu", and it will be "Recompute".
"Shape && Frequency" means, for this recompute, one tensor of size
(batch size, 500) will be saved because it will be recomputed.

**Baseline**

On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100
GPUs. With latest main branch, we can run batch size 16, and the maximum
batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB
memory is used during training. The SamplesPerSec=479.2543353561354.


![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png)

**With this PR**

Gelu is recomputed for saving memory peak, batch size 32 can be run. The
97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (**1.17X**
of baseline).


![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png)


**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
2022-11-03 13:49:41 +08:00
Wei-Sheng Chin
b5904c40dd
Enable ORT in TorchDynamo (#13259)
This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.
2022-11-01 11:19:29 -07:00
Baiju Meswani
c557a55816
Fix on-device training ExportModelForInferencing api (#13510) 2022-10-31 21:29:06 -07:00
Baiju Meswani
a46c599a40
Training API to export the eval model to an inference model (#13345) 2022-10-27 09:34:01 -07:00
Vincent Wang
805ec459a0
Fix a PoliCheck finding in _hierarchical_ortmodule.py(#13462) 2022-10-26 15:45:18 -07:00
Vincent Wang
b6a3562ffb
[ORTModule] Add Env Variable to Control Disabling Custom AutoGrad Function Support (#13430)
Add env variable to control disabling custom autogard function support.
When using ORTModule, if the torch model has torch.nn.Function, if user
confirms that it can be exported to ONNX (for example, by inline
PythonOp) and the backward implementation is matched to the forward
impl, user can export "ORTMODULE_DISABLE_CUSTOM_AUTOGRAD_SUPPORT=1" to
disable the custom autograd support so that it won't use ORT's PythonOp
to fallback to PyTorch. Exporting to ONNX sometimes can leverage some
graph optimizations in ORT so that perf is better.
2022-10-25 16:58:04 +08:00
Vincent Wang
67150baa8d
[ORTModule] ATen Support for aten::upsample_nearest (#13364)
ATen support for aten::upsample_nearest, which is required for
Huggingface's diffusers model training using ORTModule.
2022-10-20 08:30:04 +08:00
Vincent Wang
b6b3f41636
Fixes of Hierarchical ORTModule and ORTModule PythonOp (#13347)
The PR applies some fixes to Hierarchical ORTModule and ORTModule
PythonOp.

For Hierarchical ORTModule:
- Don't wrap module if the caller is to call other function instead of
forward() function
- Support single module instance is call multiple times with different
types of inputs
- Check if module can be warped from top to bottom instead of from
bottom to top

For ORTModule PythonOp:
- Add env variable control to allow using
torch.utils.checkpoint.CheckpointFunction
- Add env variable control to skip register some autograd functions so
that there is no conflict for some models.
2022-10-20 08:16:03 +08:00
Adam Louly
68eff69ab1
Add Utils for federated learning scenarios (#13014)
**Description**: utils for federated learning.

**Motivation and Context**
- This PR includes utils that will be used on federated learning
scenarios.
- Exposing python bindings to some utils, and added a util to calculate
the difference between two buffers.

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
2022-10-17 12:39:43 -07:00
Vincent Wang
807b2f4dd5
[ORTModule] Use Env Variable to Set Provider Option cudnn_conv_algo_search (#13296)
This PR is to add support of using env variable to set provider option
cudnn_conv_algo_search so that user can choose better conv algo search
method to run model. This is a quick fix to unblock the test of MoE
model. Will have another PR to design and implement the ORTModule config
so that we can config ORTModule using Python script or config file
instead of env variable.
2022-10-13 15:36:21 +08:00
Vincent Wang
6fb70a82df
[ORTModule] Update Supported DeepSpeed Version for FP16_Optimizer (#13305)
Update supported deepspeed highest version from 0.7.1 to 0.7.3 for
FP16_Optimizer. Also add version info to warning log.
2022-10-13 13:03:01 +08:00
Vincent Wang
afb5f76770
[ORTModule] ATen Support for torch.nn.GroupNorm (#13293)
Model [huggingface's diffusers
library](https://github.com/huggingface/diffusers) has
torch.nn.GroupNorm which will be exported to sub-graph containing ONNX's
InstanceNormalization, which is lack of gradient. The implementation of
ORT's InstanceNormalization will call cuDNN's BatchNorm for part of
computation, which is not efficient compared to PyTorch's
implementation. This PR is to use ATen fallback to support this torch
module, including its forward and backward.
2022-10-13 11:59:03 +08:00
Vincent Wang
a2658f0784
[ORTModule] Fix Graph Builder for Eval Mode (#13255)
Current graph builder for ORTModule will apply the training's graph
optimizations for both training and eval mode. Take BatchNorm as
example, one of training's graph optimizations will replace
BatchNormalization Op to BatchNormInternal which is for training only.
This PR is to fix this, for eval mode, we will not apply the training's
graph optimizations. The inference's graph optimizations will be applied
when InferenceSession initialization.
2022-10-12 14:39:54 +08:00
Vincent Wang
b9e23bd086
[ORTModule] Fix Custom Op Registry for Torch 1.13+ (#13250)
This PR has two fixes:
- https://github.com/pytorch/pytorch/pull/85636 change the behavior of
register_custom_op_symbolic to only register the symbolic function at a
single version. For ORTModule we need to pass the op_set version when
calling it.
- Since torch_1.13 the signature of einsum is changed to have a new
argument, need to change our custom op symbolic registry code
accordingly.

Without the fixes, ORTModule will not work with the nightly torch, and
the new torch version will be released.
2022-10-11 15:20:51 +08:00
Baiju Meswani
bcc93ab17c
Deprecate ORTTrainer (#13022) 2022-09-23 18:10:09 -07:00
Adam Louly
268bfe2a5d
python training api bindings (#12610)
**Description**: **Python API Bindings for on device training. **
**Motivation and Context**
- This PR contains api bindings so python users can perform a whole
training loop.

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
2022-09-16 09:38:24 -07:00
Thiago Crepaldi
55c745eefd
Add support for ORTModule Torch cpp CUDA extension build within docker (#12868)
Currently, CUDA hardware is not available to be leveraged by build
during `docker build`. because of that, CUDA capable hardware would not
have CUDA support

This PR adds an env varf ONNXRUNTIME_FORCE_CUDA in which it allows CUDA
extensions to be compiled even when CUDA support is not detected.
2022-09-08 15:30:44 -04:00
guyang3532
4765e5c382
Using ORTModule to wrap a evaluation model should not change the mode (#12747)
Using ORTModule to wrap a evaluation model should not change the mode of model
2022-09-08 10:54:59 +08:00
Baiju Meswani
56bae3b196
Use InplaceClipGradNorm for offline processing for on-device training (#12603) 2022-09-02 07:47:17 -07:00
Cheng
5dd9afe75a
python lint (#12825) 2022-09-01 22:38:25 +08:00
Justin Chu
a48b115540
Remove reference to the deprecated variable in torch.onnx.symbolic_helper (#12452)
**Description**: Remove reference to the deprecated variable in `torch.onnx.symbolic_helper` pytorch/pytorch#81953

- Removed unused imports
- Changed BANNED_AUTOGRAD_FUNCTION_NAMES to a frozenset

**Motivation and Context**

The cast_pytorch_to_onnx variable is deprecated and removed in `torch.onnx.symbolic_helper`. Since there is still a need for converting scalar types to onnx type, I copied the mapping to `_CAST_PYTORCH_TO_ONNX` in the module.
2022-08-31 11:55:56 -07:00
Yulong Wang
1a402a3f25
replace 'master' branch ref to 'main' for onnx repo (#12678) 2022-08-30 13:41:42 -07:00
pengwa
a0c25e5c2f
Fix segment fault for alltoall (#12701)
* fix segment fault

* formatting
2022-08-30 11:27:14 +08:00
Vincent Wang
53ecb9e635
Update Supporting DS Version to 0.7.1 for ORTModule (#12696)
update ds version support for fp16_optimizer
2022-08-24 14:56:12 +08:00
Yulong Wang
c144acc534
Replace 'master' branch ref to 'main' in the code (#12547) 2022-08-22 10:48:12 -07:00
Wei-Sheng Chin
dc486d146b
Make ORT callable from various Pytorch compilers (LazyTensor, TorchDynamo, etc) (#10460)
* Make ORT as Pytorch JIT backend

LORT likely doesn't work with aten fallback so we only test LORT in its own CI.

* Revert changes to enable external CUDA allocator. Will add it later.

Revert "Revert changes to enable external CUDA allocator. Will add it later."

This reverts commit d5487f2e193014c805505afae8fb577c53667658.

Fix external allocator

* Relax tolerance and remove commented code

* Print more information in CI

* Fix pointer

* Address comments.
1. Reuse ORT-eager mode's environment.
2. Remove unused ctor.

* Use Pytorch master branch as all PRs are merged

Fix

* Refine based on cpplint feedbacks

* Revert changes to allow custom CUDA allocator in public APIs

* Use torch.testing.assert_close

* Use unittest framework

* Switch docker repo

* Rename *.cpp to *.cc

* Address comments

* Add comment

* Use same pipeline file for eager and lort pipelines

* Address comments

* Add yaml comment

* Fix cmake files

* Address comments

* Rename flags, remove printing code, remove dead comment
2022-08-22 09:40:40 -07:00
Vincent Wang
a078c8d99b
Update Supporting Deepspeed Version of ORTModule's FP16_Optimizer (#12668) 2022-08-22 22:22:53 +08:00
pengwa
24eab921be
Enable PythonOp for --enable_training_torch_interop build (#12539)
* enable PythonOp by default when --enable_training_torch_interop is enabled during build

* clean up

* fix

* fix comment

* fix

* fix tests

* fix fallback test

* pylint format

* refine based on comments
2022-08-12 00:49:30 +08:00
Adam Louly
2681648f5b
Load checkpoint in cpp (#12352)
* Load checkpoint in cpp

* removed unused imports

* throw error on invalid name and change function name

* inplace model assignment, change name and other comments resolved

* name change  on import

* Addded unit test, resolved comments

* remove unused  imports

* resolved comments

* refactoring too reduce memoory allocation

* resolved extra comments

* changed files hierarchy an force added onnx moodel

* solved order of function argument

* used gtest macros on test cases

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-08-09 12:30:50 -07:00
pengwa
a2dc3e9eac
Improve the compilation speed when compiling for multiple architectures. (#12490)
* improve the compilation speed when compiling for multiple architectures.

* formatting

* fix

* use 0 by default

* fix comments
2022-08-09 11:52:26 +08:00
Vincent Wang
e85e31ee80
Update ORTModule Default Opset Version to 15 (#12419)
* update ortmodule opset to 15

* update torch version

* fix ut

* fix ut

* rollback

* rollback for orttrainer
2022-08-05 16:55:04 +08:00
Baiju Meswani
7f58bd7236
Perform graph transformations during offline tooling (#12422) 2022-08-03 11:27:12 -07:00