Commit graph

17 commits

Author SHA1 Message Date
pengwa
43a5147e01
Memory optimization refactor and refinement (#17481)
### Memory optimization refactor and refinement

Currently memory optimizer runs graph transformations and print
recompute opportunities in INFO level, while ORT backend has many many
INFO level logs making users hard to find those information. So we are
looking for a Python binding API to retrieve the memory optimization
opportunities instead of depending on the MemoryOptimizer's default
logging.
Then we can print ORTModule feature statistics using this information. 
Also, with such an API, we can create an ORT session created, where
allocation plan is done, the analysis will consider buffer reuse as
well. This can void giving some recomputation subgraphs that are reusing
other subgraphs' output buffers.

Check
https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md
for the new flow using `MemoryOptimizer`.

This pull requests made following refactoring:
1. Print the log in ORTModule Python script, along with ORTModule
feature enabling stats. This is implemented by exposing an API
`get_serialized_ortmodule_memory_stat` to retrieve the memory
optimization opportunities.
2. We are analyzing memory optimization opportunities considering ORT
memory planning. This is done by firstly creating the execution graph
without enabling MemoryOptimizer, then we call
`execution_agent.get_serialized_ortmodule_memory_stat` which internally
will consider the session memory allocation planner when analyzing
memory optimization opportunity. As a direct result, the memory
optimization opportunities can show those stashed activations that are
reusing other buffers.
3. Move recompute analysis logic from memory_optimizer.h/cc to
recompute_analysis.h/cc.
4. Abstract optimization strategies for their own implementation. This
will make introducing new strategies (for example compression and
decompression ) easier.

New logging matrix (INFO Level), in WARNING level, the details will NOT
show.
```
2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] -
***** ONNX Runtime Training (ORTModule) is accelerating your model *****

ORTModule is enabled with following features ON/OFF for [training] mode:

  ATen Executor         :   ON    :   Dispatch ATen operators to ORT's ATen executor
  Cast Propagation      :   ON    :   Level 1 enabled
  Custom Function       :   ON    :   Support custom torch.autograd.Function export and execution
  Memory Optimizer      :   ON    :   RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs:
                                      Config                                                      Freq    Saving(B)       Saving Symbolic(Bytes)
   - Plan 1             :   ON    :   Reshape+Where+BiasSoftmax+:1:-1                             5       671,088,640     640.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 2             :   ON    :   Cast+:1:-1                                                  6       402,587,648     inputs_input_ids_dim0*inputs_input_ids_dim1*(384.0*inputs_input_ids_dim1 - 64.0)
   - Plan 3             :   OFF   :   Reshape+Where+:1:-1                                         1       134,217,728     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 4             :   OFF   :   BiasSoftmax+:1:-1                                           1       134,086,656     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1*(inputs_input_ids_dim1 - 1)
   - Plan 5             :   OFF   :   BiasGelu+:1:-1                                              6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 6             :   OFF   :   FusedMatMul+:1:-1                                           6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 7             :   OFF   :   FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1               5       26,214,400      25600.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 8             :   OFF   :   Add+:1:-1                                                   1       5,237,760       5120.0*inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1)
   - Plan 9             :   OFF   :   Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1         1       4,096           4.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 10            :   OFF   :   Cast+:2:-1                                                  1       2,048           2.0*inputs_input_ids_dim0*inputs_input_ids_dim1
  Compute Optimizer     :   ON    :   Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0
   - FLOPReduction      :   ON    :   Reduce FLOPs by upstreaming shrinking-sized ops
  Auto Fallback         :   ON    :   Fallback to PyTorch when encountering unsupported ops
  TritonOp Enabled      :   OFF   :   ORT will switch to Triton for executing some ops to further accelerate training.
  ZeRO Stage3 Support   :   OFF   :   Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0

Total ORT initialization overhead is 10.73s where export takes 8.39s.
Other overhead details:  graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s

Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0

Note 1: use comma to enable multiple plans at the same time.
  export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,...
Note 2: saving is calculated based on the 1st batch symbolic dim values:
  inputs_input_ids_dim0=1,
  inputs_input_ids_dim1=1024,
  inputs_attention_mask_dim0=1,
  inputs_attention_mask_dim1=1024,
  inputs_labels_dim0=1,
  inputs_labels_dim1=1024,

************************************************************************
```

If DEVINFO level is enabled, then more details about the memory
optimizations are printed.
```

MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1
==========================================================================================================================================
|Freq   | Memory Optimization Opportunities (Clustered by node-level activation patterns)                                                |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|3      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+Reshape+                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(3),                                                                                                  |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+                                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1                                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(2),                                                                                                  |
|       |   - Output 0  : [ x 2560 x ], byte/elem: 2, 100% saved                                                                         |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+BiasSoftmax+                                                                  |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1                       |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=2                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1         |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+                                                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1                                   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved                           |
|       |                                                                                                                                |
|       |>>Option 2     : RecomputeWithCompromise subgraph Cast+                                                                         |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasSoftmax+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Add+                                                                                        |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1                                             |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
==========================================================================================================================================
Note: use comma as a separator for enabling more than one subgraphs.

************************************************************************

```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-23 11:39:00 +08:00
pengwa
c8e1038eab
Optimize 4bit Qlora training (#18131)
### Optimize 4bit Qlora training

Extent existing `MatmulBnb4bit` to its usage in training scenarios. 

The PR includes following changes:
1. Add special `torch.autograd.Function` export logic for
`bitsandbytes.autograd._functions.MatMul4Bit` that is preferred before
common PythonOp exporter.
2. Add `training_mode` optional attribute for op `MatmulBnb4bit`, which
help skip some inference specific logic in implementation.
3. Add `transB` optional attribute, which is by default be 1; setting it
to be 0 is needed by backward usage.

Changing from `PythonOp` to this `MatmulBnb4bit` brings roughly ~2.9%
throughput gains. The reason is:
`bitsandbytes.autograd._functions.MatMul4Bit` has logic
`ctx.save_for_backward`, which would need an additional copy in
PythonOp, otherwise, the tensor might be released by ORT, while backward
op still references it.

Removing the clones also reduce the peak memory consumptions because
`bitsandbytes.autograd._functions.MatMul4Bit` saved tensors that are not
needed in backward compute.
2023-11-02 09:46:11 -07:00
pengwa
0e2782438a
Support inplace update for PythonOp/Grad (#17687)
### Support inplace update for PythonOp/Grad

This PR is based on another PR
https://github.com/microsoft/onnxruntime/pull/17685's branch, to make it
easier to review.

With PR: PR https://github.com/microsoft/onnxruntime/pull/17685, By
default all PythonOp inputs/outputs are assumed to not be inplaced, if
during run, we found some inplace update happens (by checking output
data address with all inputs data address), we add clone before set it
as PythonOp/Grad's outputs. In this case, results are correct, but
implicit copies overheads are introduced.

This PR allow users to define output input reuse map, to let ORT know
how to do the reuse map, avoid such unnecessary copies.
2023-10-10 21:36:45 -07:00
Abhishek Jindal
54b7503c30
create patch for allgather fn for deepspeed stage 3 (#17855)
### Description
<!-- Describe your changes. -->
Patch for All gather fn for Deepspeed Stage 3 changes


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-11 11:15:06 +08:00
pengwa
6b7bce5ec9
Model post process for zero stage3 training (#17187)
### Model post process for zero stage3 training

This is the last change to make single GPU/Multiple GPUs run pass. 

Design details:
https://microsoft.sharepoint.com/:p:/t/ONNX2/EfNfJ43necpIoPI6x5M2zvYBVbfjoPQmG4Boc_F7-tHm1w?e=ekQwA6&nav=eyJzSWQiOjMxNiwiY0lkIjoxMDE1Nzg3NDZ9

`PyTorch` runs with ZeROOffloadSubscriber:

```
  model = prepare_model(...)
  from onnxruntime.training.utils.hooks import configure_ort_compatible_zero_stage3
  configure_ort_compatible_zero_stage3()
```

`ORTModule` runs with ZeROOffloadSubscriber:

```
  os.environ['ORTMODULE_ENABLE_ZERO_STAGE3'] = '1'
  from onnxruntime.training.ortmodule import ORTModule
  model = ORTModule(self.model)
```

It will be fairly easy to debug convergence issue if both ORT and
PyTorch can run the same offload path.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-22 08:54:25 +08:00
pengwa
d90afc697b
Introduce ZeROOffloadSubscriber for ORTModule (#17006)
### Introduce ZeROOffloadSubscriber for ORTModule

As part of the work: integrate ORTModule with DeepSpeed stage3, this PR
mainly focus on moving original PyTorch-based (leveraging hooks) param
partition/offload implementation to ORTModule compatible implementation.

Changes include:
1. Refactor `SubscriberBase`/`SubcriberManager` to support
pre-forward/post_forward hooks.
2. Implement new `ZeROOffloadSubscriber` by re-using DeepSpeed hook
function as much as possible. Since all hook functions are defined in
`DeepSpeedZeRoOffload._register_hooks_recursively` and
`DeepSpeedZeRoOffload.setup_zero_stage3_hooks`, and the good thing is,
the closure is not complex, all hooks are referencing the owning
`DeepSpeedZeRoOffload` instance, so we can create new hook function with
`FunctionType` by binding the owning `DeepSpeedZeRoOffload` instance,
then call the new created function in subscriber's
`pre_forward_module_apply_impl` and `post_forward_module_apply_impl`
interfaces.
3. Monkey patch `DeepSpeedZeRoOffload.setup_zero_stage3_hooks` to
register the `ZeROOffloadSubscriber` for the model, then we don't need
change any code on the DeepSpeed repo (at least so far).
4. Fix the ATen embedding custom symbolic exporter function by
tolerating weights size be (0) (changed by DeepSpeed zero stage 3).

UT will be added once stage3 is fully supported. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-25 00:15:22 +08:00
pengwa
cd7b3f54da
Allow defining customized PythonOp shape inferer (#17093)
### Allow defining customized PythonOp shape inferer

For `torch.autograd.Function`, we converted it to PythonOp in MSDomain,
there are two places to do shape inferencing for it:

1. in SymbolicShapeInfer, there is one. 
2. in PythonOp op definition. 

For common PythonOp, since we don't know the relation ship between
inputs and outputs, so we only infer the rank from output ranks, and
generate symbolic dimensions for each dim. While this will introduce
many meaningless symbolic dimensions, sometimes blocking our graph
transformers to do op fusion.

This PR provide a way to define custom shape inferencing for
`torch.autograd.Function` we defined, to propagate the original
dimensions across the PythonOp at the best efforts.

But the 2rd one is not covered yet, we could refine that later. Fixing
1st one is enough for ORTModule training/evaluation.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-14 09:13:32 +08:00
pengwa
6e6f582e08
Use full qualified name for PythonOp export (#17021)
### Use full qualified name for PythonOp export

Originally, when there are duplicate named torch.autograd.Function in
different module, for example:

`a.b.c.Gelu` v.s. `d.e.func.<locals>.Gelu`

We by default will throw exception to let user be aware we cannot
distinguish the two Gelu because during model export, we did not module
path. The workaround is we introduced
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` to ignore those duplicated named
Gelu that is not used by model run. This has limitations obviously for
example if two Gelus are both used in training.



This PR finds a way to construct a full qualified name.

`def _export_pt_1_10(g, n, *args, **kwargs):`

1. in exporter function, kwargs contains `name` and `module`, in the
above example:
   `a.b.c.Gelu`  --> name: `Gelu`, module: `a.b.c`
   `d.e.func.<locals>.Gelu` --> name: `Gelu`, module: `d.e`
   
 
Using name and module is not enough to get a full qualified name, for
the second case, where `d.e` is the module path, then there is a
function called `func`, in this function, there is a local
auto.grad.Function named `Gelu`. (Many of our UT looks like this). We
can only get `d.e.Gelu`, but this is not the correct full qual name.

The reason for this: `kwargs[name]` or `n.name` only return the class's
name, not the class's full qual name. (be noted kwargs[module]` is
correct).

2. `n` is torch.Node, we can access `pyobj` to get the
torch.autograd.Function's apply method instance, then use `._self` to
get the torch.autograd.Function class. Then we can get the `module` and
`class`'s ful qual name, added together, we get the full qual name.

With the above change, we don't need use `kwargs[name]` and
`kwargs[module]` , and don't need check naming conflicting or
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` env var any more.
2023-08-09 10:58:33 +08:00
pengwa
a6887f171f
Refactor schema extraction and output unflattening (#16894)
### Motivation and Context

When we handle PyTorch models' inputs in different places (ORTModule or
others), it's common for us to flatten a structured data into a 1-D
tensor list (required by lib for example torch.onnx.export,
torch.autograd.Function.forward or ORT inference session), then do
subsequent work, then unflatten back to original hierarchy as returned
values.

DeepStage3 hooks support work also need such a lib to do similar things,
so I was proposing to extract this pair of APIs in training/utils/,
which can be more used more generally. Also a comprehensive set of test
data are used for testing unflatten/flatten in unit tests.

Let me know if you have any other suggestions. 


### Refactor schema extraction and output unflattening

Move `_extract_schema` and `unflatten_user_output` in
`orttraining/orttraining/python/training/ortmodule/_io.py` . to
`extract_data_and_schema` and `unflatten_data_using_schema` in
`orttraining/orttraining/python/training/utils/torch_io_helper.py` as
shared libs, which can be used later by other features (deepspeed stage
3 hook rewrite).

While there are still a few duplicated logic handling flatten with
different task by recursively loop the data struct, will change them
step by step in case of heavy review efforts.
2023-08-04 13:58:21 +08:00
pengwa
735a32fee1
Introduce memory observer for ORTModule (#16213)
### Introduce memory observer for ORTModule

To analyze memory usage for ORTModule training, we need collect
per-iteration memory footprint in different stages (pre-forward,
post-forward, pre-backward, and post-backward).

Currently we only collect the data using torch.cuda APIs. The next step
is, we could collect the detailed stashed activation list and its
percentage within ORT backend, which is beyond this PR.

Sample as below: 
```
0/8] step 0 memory (MiB) | phase: pre_forward | allocated: 1866 | max allocated: 1866 | cached: 1874 | max cached: 1874 | inactive: 8 | max inactive: 8
[0/8] step 0 memory (MiB) | phase: post_forward | allocated: 23277 | max allocated: 26215 | cached: 26406 | max cached: 26406 | inactive: 193 | max inactive: 405
[0/8] step 0 memory (MiB) | phase: pre_backward | allocated: 23277 | max allocated: 26215 | cached: 26406 | max cached: 26406 | inactive: 193 | max inactive: 405
[0/8] step 0 memory (MiB) | phase: post_backward | allocated: 2932 | max allocated: 26215 | cached: 26406 | max cached: 26406 | inactive: 6158 | max inactive: 6158
  0%|█                                                                                                                                                                                                            | 1/200 [00:26<1:26:18, 26.02s/it]
[0/8] step 1 memory (MiB) | phase: pre_forward | allocated: 2356 | max allocated: 26215 | cached: 26406 | max cached: 26406 | inactive: 2454 | max inactive: 6165
[0/8] step 1 memory (MiB) | phase: post_forward | allocated: 23767 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 2639 | max inactive: 6165
[0/8] step 1 memory (MiB) | phase: pre_backward | allocated: 23767 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 2639 | max inactive: 6165
[0/8] step 1 memory (MiB) | phase: post_backward | allocated: 3422 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 5284 | max inactive: 6165
  1%|██                                                                                                                                                                                                             | 2/200 [00:26<36:47, 11.15s/it]
[0/8] step 2 memory (MiB) | phase: pre_forward | allocated: 2356 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 2454 | max inactive: 6165
[0/8] step 2 memory (MiB) | phase: post_forward | allocated: 23767 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 2639 | max inactive: 6165
[0/8] step 2 memory (MiB) | phase: pre_backward | allocated: 23767 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 2639 | max inactive: 6165
[0/8] step 2 memory (MiB) | phase: post_backward | allocated: 3422 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 5284 | max inactive: 6165
```
2023-06-15 15:45:36 +08:00
pengwa
40bcc0441b
Enhance StatisticsSubscriber (#16098)
### Enhance StatisticsSubscriber

There are few improvements for `StatisticsSubscriber`:

- Reduce peak memory impact for tensors (having many many many elements,
consuming too much GPU memory, causing original recipe run failed with
OOM), by split the statistics into two phases (split into buckets, and
merge result across buckets).
- Allow dump intermediate tensors. Originally only nn.Module forward()'s
return value are dumped, there are requirements we want to inspect some
specific intermediate tensor in the forward() function, now we support
it.
- Add documents for collecting dumps on multiple ranks

Docs link on this branch for better view:
https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool_v2/docs/ORTModule_Convergence_Notes.md

---------

Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
2023-06-12 18:32:08 +08:00
Justin Chu
a36caba073
Bump ruff in CI (#15533)
### Description

Bump ruff version in CI and fixed new lint errors. 

- This change enables the flake8-implicit-str-concat rules which helps
detect unintended string concatenations:
https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc
- Update gitignore to include common python files that we want to
exclude.


### Motivation and Context

Code quality
2023-04-17 10:11:44 -07:00
Justin Chu
938e2136c6
Enable pylint and numpy rules (#15218)
### Description

Enable pylint and numpy rules

### Motivation and Context

Modernize numpy usage and enable more quality checks
2023-03-27 20:37:53 -07:00
Justin Chu
d834ec895a
Adopt linrtunner as the linting tool - take 2 (#15085)
### Description

`lintrunner` is a linter runner successfully used by pytorch, onnx and
onnx-script. It provides a uniform experience running linters locally
and in CI. It supports all major dev systems: Windows, Linux and MacOs.
The checks are enforced by the `Python format` workflow.

This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors
in Python code. `lintrunner` now runs all required python lints
including `ruff`(replacing `flake8`), `black` and `isort`. Future lints
like `clang-format` can be added.

Most errors are auto-fixed by `ruff` and the fixes should be considered
robust.

Lints that are more complicated to fix are applied `# noqa` for now and
should be fixed in follow up PRs.

### Notable changes

1. This PR **removed some suboptimal patterns**:

	- `not xxx in` -> `xxx not in` membership checks
	- bare excepts (`except:` -> `except Exception`)
	- unused imports
	
	The follow up PR will remove:
	
	- `import *`
	- mutable values as default in function definitions (`def func(a=[])`)
	- more unused imports
	- unused local variables

2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than
flake8 and is more robust. We are using it successfully in onnx and
onnx-script. It also supports auto-fixing many flake8 errors.

3. Removed the legacy flake8 ci flow and updated docs.

4. The added workflow supports SARIF code scanning reports on github,
example snapshot:
	

![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png)

5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Unified linting experience in CI and local.

Replacing https://github.com/microsoft/onnxruntime/pull/14306

---------

Signed-off-by: Justin Chu <justinchu@microsoft.com>
2023-03-24 15:29:03 -07:00
pengwa
1d32285536
Statistics tool for ORTModule convergence parity (#15020)
### Statistics tool for ORTModule convergence parity

As ORTModule get more and more validated, it is pretty fast to
intergrade PyTorch based model with ORT.

The same time, we need make sure once there is convergence issue, we
don't spend months of time to investigate. As part of this efforts, this
PR is introducing a tool to dump activation statistics without much
involvement from users. The dumping results contains only some statistic
numbers plus sampled data, which is not big, compared with dumping all
the tensors, it is much faster and space efficient.

For us to use it, two single lines are needed before wrapping ORTModule.
For baseline run, need also apply the same trick.

```
+	from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber
+	SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)])
```

Once you run the steps, following command can be used to merge result
into per-step-summary respectively for ORT and baseline runs.
 
```bash
python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output
```

Docs is added here as part of this PR [convergence investigation
notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md)

Based on the generated merged files, we can compare them with tools. 


![image](https://user-images.githubusercontent.com/10530022/224653929-4e4480bd-bb02-4bbe-bd44-2672bdf91a87.png)

### Design and Implementation

This PR introduced a common mechanism registering custom logic for
nn.Module's post forward hooks. And statistics for activation
(StatisticsSubscriber) is one of the implementations. If there is other
needs, we can define another XXSubscriber to do the customized things.
2023-03-23 20:34:24 +08:00
Justin Chu
fdce4fa6af
Format all python files under onnxruntime with black and isort (#11324)
Description: Format all python files under onnxruntime with black and isort.

After checking in, we can use .git-blame-ignore-revs to ignore the formatting PR in git blame.

#11315, #11316
2022-04-26 09:35:16 -07:00
Baiju Meswani
7691e7ed12
Introduce load balancing dataset samplers (#10163) 2022-02-14 13:46:14 -08:00