Commit graph

1401 commits

Author SHA1 Message Date
pengwa
7cb8b20db2
Remove mem consuming test case to unblock running ci on lower-end gpu (#19059)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-09 20:05:34 +08:00
Wei-Sheng Chin
658e30eb33
Remove DORT since it's in PyTorch main now (#18996)
Main code are removed and tests are modified to use DORT directly from
PyTorch.
2024-01-04 12:59:47 -08:00
PeixuanZuo
7a454acd61
[ROCm] Update CI/Packaging pipeline to ROCm6.0 (#18985)
Update CI/Packaing pipeline to ROCm6.0
2024-01-03 17:25:15 +08:00
pengwa
998517b209
Minor fixes (#18949)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-28 20:01:06 +08:00
pengwa
5eda79bdd3
Improve perf for stage3 training (#18099)
### Improve perf for stage3 training - first wave

Port existing PythonOp/PythonOpGrad python runner to C++, also introduce
an unsafe run mode (to skip inplace, save for backward, materrialized
grad detection on the fly).

This reduce the overhead from XX~XXX us to X ~ lower end of XX us . In
LLAMA2 7B training with 8x32GV100, we have observed 6.7% gains over
PyTorch. (1.59 v.s. 1.49it/s)

Peak memory also dropped from 31GB to 28GB.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-15 13:32:19 +08:00
Ashwini Khade
487abcd25e
Update gradient ops tests (#18783)
### Description
<!-- Describe your changes. -->
TrainingSession has been deprecated for a while now, but the gradient
ops tests are still using training session. This PR updates these tests
to use inference session instead of training session.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This will enable us to remove all the training session related
deprecated code from the repo.
2023-12-13 11:26:52 -08:00
pengwa
dbe886abb3
Disable test_bert_result_with_layerwise_recompute (#18800)
### Disable test_bert_result_with_layerwise_recompute
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-13 12:16:39 +08:00
pengwa
ccf3b2054b
Allow layer-wise recompute (#18566)
### Allow layer-wise recompute 

Early, we need users/developers to specify the subgraphs to recompute,
now we introduced a more user-friendly way to enable recompute for all
detected stashed activation recomputation subgraphs. This scarifies
getting the best configs while makes it easier to support user
requirements when they switches from PyTorch per-layer gradient
checkpoint to ORTModule.

`ORTMODULE_MEMORY_OPT_LEVEL` is introduced to control the usage, by
default, it is 0, e.g. `USER_SPECIFIED`, all subgraphs definedin
`ORTMODULE_MEMORY_OPT_CONFIG` will be recomputed. So this is compatible
to existing recompute usage in ORTModule integrated models.

Using `ORTMODULE_MEMORY_OPT_LEVEL=1`, we will enable all recompute plans
detected, so those configs in `ORTMODULE_MEMORY_OPT_CONFIG` will not be
respected any more.


Add Unit Tests using 3 layer blooms. 



https://github.com/microsoft/onnxruntime/blob/pengwa/add_aggresive_recompute/docs/Memory_Optimizer.md
2023-12-12 08:44:05 +08:00
Ashwini Khade
16df8377d3
Update transformers package to fix the security issue (#18730)
### Description
Updating transformers package in test pipeline to fix a security
vulnerability.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-11 09:15:23 -08:00
pengwa
4bfa84487c
Skip module clone for preparing large model export (#18663)
### Skip module clone for preparing large model export

For LLAMA2 13B, when running with Lora, DeepSpeed stage2 on 8 GPUs . It
failed during preparing outputs which will be used for
torch.onnx.export. The reason, we deep copy all the params including
both big sizes of frozen weights, + a little bit of Lora trainable
weight.

This PR will firstly check whether the GPU memmory is enough for a
cloned module, if not, skip the copy.

Copying the module is to guarantee the fw path run may change the
weight, while this case should be rare. But for now, Not-Able-To-Run is
worse than Runnable-with-A-little-bit-different-initial-weight,
especially for large models.
2023-12-05 12:41:17 -08:00
Bowen Bao
fcea2cb7f1
[Dort] Run type promotion pass to resolve dtype discrepancy (#18516)
Fixes CI failures mentioned in #18507

But we should not keep two separate dort impls in both pytorch and
onnxruntime. They are out of sync.
2023-12-01 09:36:18 -08:00
guyang3532
182c525416
Support MatMulBnb4 in PaddingElimination (#18646)
Also support Cast pattern between input and embedding node for sparsity
inspecting
2023-12-01 19:27:50 +08:00
Changming Sun
23a91c8ba8
Fix warning C4003 in ORT python binding code (#18612)
### Description
Fix warning C4003 in ORT python binding code.

### Motivation and Context
It's better to fix the warning instead of suppressing it.
2023-11-30 08:07:47 -08:00
Vincent Wang
148495ebc5
[ORTModule] Use Default Topo-order for GraphViewer (#18410)
ORT's default topo-order is a reversed DFS algorithm, while the
priority-based topo-order is a forward BFS algorithm. It's likely that
the default order is better than priority-based order on memory because
tensor memory is more likely to be released right after it's consumed.

Currently ORTModule uses priority-based order, for some models, it sorts
lots of small Ops to the beginning, this introduces big CPU overhead at
the beginning (see below screenshot), this PR is to use default order
for training. The priority-based order is heavily used for some
recompute optimization, so if there is recompute enabled, we will still
use priority-based order.

This PR also adds an optimization to the default order, which is to move
all Shape/Size Ops to right after their parent nodes. This is to make
sure the shape and size nodes are executed right after their parents so
it's possible the input tensor memory can be released as soon as
possible. This is especially important for non-CPU devices or for
training case where some gradient graphs use only shape/size of tensors
from forward.

Profiling result:
Before
<img width="910" alt="截屏2023-11-13 12 09 02"
src="https://github.com/microsoft/onnxruntime/assets/11661208/e54d5ead-274f-4725-923e-521bbcfce752">

After
<img width="910" alt="截屏2023-11-13 12 10 44"
src="https://github.com/microsoft/onnxruntime/assets/11661208/f50d196d-11ac-43a2-9493-517e4552ffab">
2023-11-30 20:17:22 +08:00
Vincent Wang
e1d1033131
[ORTModule] Remove Unused Arguments from Generated Triton Code (#18636)
This PR:
- Remove unused arguments from generated triton code,
- Remove unnecessary mask for symbolic shape case from generated triton
code.
- Add doc for usage of ORTMODULE_TRITON_CONFIG_FILE.
2023-11-30 18:32:36 +08:00
Ran Gal
3f42fbad2e
deleted the unused random_device variables because they caused a warning that was treated like an error. (#18543)
deleted the unused random_device variables because they caused a warning
that was treated like an error.

**_Please check if the declaration is required for the random number
generation. if so, there need to be a dummy reference to the variable or
turning off the warning as error behavior._**

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-28 08:54:38 +01:00
pengwa
43a5147e01
Memory optimization refactor and refinement (#17481)
### Memory optimization refactor and refinement

Currently memory optimizer runs graph transformations and print
recompute opportunities in INFO level, while ORT backend has many many
INFO level logs making users hard to find those information. So we are
looking for a Python binding API to retrieve the memory optimization
opportunities instead of depending on the MemoryOptimizer's default
logging.
Then we can print ORTModule feature statistics using this information. 
Also, with such an API, we can create an ORT session created, where
allocation plan is done, the analysis will consider buffer reuse as
well. This can void giving some recomputation subgraphs that are reusing
other subgraphs' output buffers.

Check
https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md
for the new flow using `MemoryOptimizer`.

This pull requests made following refactoring:
1. Print the log in ORTModule Python script, along with ORTModule
feature enabling stats. This is implemented by exposing an API
`get_serialized_ortmodule_memory_stat` to retrieve the memory
optimization opportunities.
2. We are analyzing memory optimization opportunities considering ORT
memory planning. This is done by firstly creating the execution graph
without enabling MemoryOptimizer, then we call
`execution_agent.get_serialized_ortmodule_memory_stat` which internally
will consider the session memory allocation planner when analyzing
memory optimization opportunity. As a direct result, the memory
optimization opportunities can show those stashed activations that are
reusing other buffers.
3. Move recompute analysis logic from memory_optimizer.h/cc to
recompute_analysis.h/cc.
4. Abstract optimization strategies for their own implementation. This
will make introducing new strategies (for example compression and
decompression ) easier.

New logging matrix (INFO Level), in WARNING level, the details will NOT
show.
```
2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] -
***** ONNX Runtime Training (ORTModule) is accelerating your model *****

ORTModule is enabled with following features ON/OFF for [training] mode:

  ATen Executor         :   ON    :   Dispatch ATen operators to ORT's ATen executor
  Cast Propagation      :   ON    :   Level 1 enabled
  Custom Function       :   ON    :   Support custom torch.autograd.Function export and execution
  Memory Optimizer      :   ON    :   RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs:
                                      Config                                                      Freq    Saving(B)       Saving Symbolic(Bytes)
   - Plan 1             :   ON    :   Reshape+Where+BiasSoftmax+:1:-1                             5       671,088,640     640.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 2             :   ON    :   Cast+:1:-1                                                  6       402,587,648     inputs_input_ids_dim0*inputs_input_ids_dim1*(384.0*inputs_input_ids_dim1 - 64.0)
   - Plan 3             :   OFF   :   Reshape+Where+:1:-1                                         1       134,217,728     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 4             :   OFF   :   BiasSoftmax+:1:-1                                           1       134,086,656     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1*(inputs_input_ids_dim1 - 1)
   - Plan 5             :   OFF   :   BiasGelu+:1:-1                                              6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 6             :   OFF   :   FusedMatMul+:1:-1                                           6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 7             :   OFF   :   FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1               5       26,214,400      25600.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 8             :   OFF   :   Add+:1:-1                                                   1       5,237,760       5120.0*inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1)
   - Plan 9             :   OFF   :   Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1         1       4,096           4.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 10            :   OFF   :   Cast+:2:-1                                                  1       2,048           2.0*inputs_input_ids_dim0*inputs_input_ids_dim1
  Compute Optimizer     :   ON    :   Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0
   - FLOPReduction      :   ON    :   Reduce FLOPs by upstreaming shrinking-sized ops
  Auto Fallback         :   ON    :   Fallback to PyTorch when encountering unsupported ops
  TritonOp Enabled      :   OFF   :   ORT will switch to Triton for executing some ops to further accelerate training.
  ZeRO Stage3 Support   :   OFF   :   Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0

Total ORT initialization overhead is 10.73s where export takes 8.39s.
Other overhead details:  graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s

Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0

Note 1: use comma to enable multiple plans at the same time.
  export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,...
Note 2: saving is calculated based on the 1st batch symbolic dim values:
  inputs_input_ids_dim0=1,
  inputs_input_ids_dim1=1024,
  inputs_attention_mask_dim0=1,
  inputs_attention_mask_dim1=1024,
  inputs_labels_dim0=1,
  inputs_labels_dim1=1024,

************************************************************************
```

If DEVINFO level is enabled, then more details about the memory
optimizations are printed.
```

MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1
==========================================================================================================================================
|Freq   | Memory Optimization Opportunities (Clustered by node-level activation patterns)                                                |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|3      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+Reshape+                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(3),                                                                                                  |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+                                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1                                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(2),                                                                                                  |
|       |   - Output 0  : [ x 2560 x ], byte/elem: 2, 100% saved                                                                         |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+BiasSoftmax+                                                                  |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1                       |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=2                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1         |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+                                                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1                                   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved                           |
|       |                                                                                                                                |
|       |>>Option 2     : RecomputeWithCompromise subgraph Cast+                                                                         |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasSoftmax+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Add+                                                                                        |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1                                             |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
==========================================================================================================================================
Note: use comma as a separator for enabling more than one subgraphs.

************************************************************************

```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-23 11:39:00 +08:00
Xavier Dupré
32fabb5555
Fix opset version of the optimizer in function generate_artifacts (#18300)
### Description
`generate_artifacts` generates 4 graphs for training. All graphs should
share the same opset version, the one coming from the model to train,
but the optimizer is left undefined. onnxruntime is using the latest
version defined by onnx but onnxruntime does not necessarily support it.

### Motivation and Context
The code does not let the user change it.
2023-11-22 09:15:11 -08:00
Vincent Wang
3bc9efc7b2
[ORTModule] Adjust Attention Patterns for Efficient Attention ATen Fallback (#18471)
Adjust attention patterns to match latest Whisper+exporter. Also add
some condition check and add docs.
2023-11-22 15:24:05 +08:00
Jambay Kinley
1af0681554
Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2 (#18484)
### Description
<!-- Describe your changes. -->
Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for
QLoRA fine-tuning.
- On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16
dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16`
type which uses float for compute.
- I have validated the op in a llama2-7b training scenario. The losses
match pytorch training and the training throughput is better.
- Cannot add a bfloat16 case in the op unit test since casting BFloat16
to and from float multiple times during the test causes the required
tolerances to be unachievable.

The custom autograd function exporter in onnxruntime-training is updated
to support the latest version of bitsandbytes. They changed how the
`quant_state` is stored.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable QLoRA fine-tuning with bfloat16.
2023-11-20 09:52:58 -08:00
Wei-Sheng Chin
3bcc137eb4
Tiny change to trigger the update of DORT's CI image (#18507)
Recent PyTorch breaks DORT CI and [a
patch](https://github.com/pytorch/pytorch/pull/113697) has been merged
into PyTorch main. In order to update DORT's CI, we made dummy change in
this PR.
2023-11-19 22:09:11 -08:00
Ashwini Khade
02333293de
Removed all the deprecated python training code and related tests and utils (#18333)
### Description
Motivation for this PR is code cleanup.

1. Remove all deprecated python code related to orttrainer, old
checkpoint, related tests and utils
2. Cleanup orttraining_pybind_state.cc to remove all deprecated
bindings.
2023-11-17 18:19:21 -08:00
zhijiang
16d7f55193
lora conv1d replacement (#16643)
in LoRA code, it will use conv1d to do projection for qkv, while the
conv1d calculation is mathematically equivalent to matmul, and matmul is
much faster than conv1d.
The subsitution of the graph optimizer is: 1 conv1d >> 2 split + 1
squeeze + group_num matmul + 1 concat

with this optimizer, we see 10%+ in one 1P model
2023-11-16 17:08:06 +08:00
guyang3532
751aa8d31a
fix axis of layernorm for UpstreamReshape (#18425)
Similar to https://github.com/microsoft/onnxruntime/pull/17255
update axis for Layernormalization when Reshape upstream it.
2023-11-16 16:29:00 +08:00
Vincent Wang
ed89ca573a
[ORTModule] Support User Config for Triton Codegen, Bugfix for Reduce-to-scalar (#18448)
User can provide Triton codegen config JSON through env variable. Also
fix some bugs related to reduction to scalar case.
2023-11-15 17:16:38 +08:00
Vincent Wang
4a82030339
[ORTModule] Symbolic Shape Support for Triton Codegen (#18317)
Add symbolic shape support for Triton codegen for ORTModule.
2023-11-13 12:16:27 +08:00
guyang3532
4dc63692f8
Add FlattenAndUnpad Op (#17845)
### Description
Add an op named `FlattenAndUnpad`.
This op implements functions:
1. Flatten the first two dims of input tensor.
2. Gather valid value from input tensor with index tensor,.


### Motivation and Context
The grad op of `PadAndUnflatten` was `GatherGrad` which is inefficient
in performance.
I implement this `FlattenAndUnpad` just to replace the `GatherGrad` as
grad of `PadAndUnflatten`.
With this op, we also can simplify the "Reshape + ShrunkenGather"
pattern to `PadAndUnflatten` in padding elimination optimizer, which
will also improve performance.
2023-11-09 09:52:48 +08:00
Justin Chu
c250540722
Bump linter versions (#18341)
Bump linter versions and run format.
2023-11-08 13:04:40 -08:00
Prathik Rao
34f77eaa24
bfloat16 support for quickgelugrad (#18336)
### Description
<!-- Describe your changes. -->

Registers BFloat16 datatype as valid input type for CUDA QuickGeluGrad
Kernel.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.

---------

Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-11-08 08:40:02 -08:00
pengwa
2151c79bf1
Tune ORTModule logging experience a bit (#18298)
### Tune logging experience a bit

After last time we update the ORTModule log experience, we found few
issues:
1. `INFO` level output too many things, including PyTorch exporter
verbose logs (tracing graphs) on every ranks. On this level, we only
want to
- Output a little bit more information to Users than `WARNING` level,
for example the memory recomputation recommendations or other
not-fully-ready features.
- Output a little bit more information for a quick diagnostic, collected
on rank-0 only.
2. ONNX Runtime logging filter during graph build, session init
sometimes will hide the issues (for example segement fault), there is no
useful information in `WARNING`/`INFO` for users to report to us. This
is not good!
3. Some of our devs like using `pdb` to debug Python code, but if we add
`import pdb; pdb.set_trace()` in models' code might hang when they use
`INFO` or `WARNING`, where exporter happens and all output got
redirected due to log filtering. The only workaround is to switch to
VERBOSE, which output toooooooooooo many logs.

The corresponding changes proposed here are:
1. For `INFO` logging, 
    - We only logs rank-0. 
- We restricted the ORT backend logging level to be WARNING in this
case, because ORT backend code output way too many logs that should be
under verbose, while we cannot guarantee we can get them cleaned up
immediately once they are added.
- We output the PyTorch exporter verbose log (including tracing graph),
which is useful for a quick diagnostic when an issue happens.
2. Remove all logging filtering on ORT backend, then the segment fault
issue details will not be hidden once it happens again.
 3. Introduced a `DEVINFO` logging,
     - Log logs on all ranks
     - Log ORT backend logging level INFO
- PyTorch exporter logging filtering are all turned OFF (to unblock the
pdb debugging).
4. Currently, to use Memory Optimizer, need use DEVINFO (which will
output ORT backend INFO log). So update memory optimizer document to
reflect this. https://github.com/microsoft/onnxruntime/pull/17481 will
update the requirement back to INFO for show memory optimization infos.

You can check
https://github.com/microsoft/onnxruntime/blob/pengwa/devinfo_level/docs/ORTModule_Training_Guidelines.md#log-level-explanations
for a better view of different log levels.

This PR also extract some changes from a bigger one
https://github.com/microsoft/onnxruntime/pull/17481, to reduce its
complexity for review.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
2023-11-08 17:42:50 +08:00
Prathik Rao
83c0275354
add bfloat16 support for ConcatTraining and SplitTraining ops (#18280)
### Description
<!-- Describe your changes. -->

Updates input/output type constraints on training operators
ConcatTraining and SplitTraining to include bfloat16 which was
introduced in IR version 4.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.

Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-11-07 10:10:01 -08:00
pengwa
4f15b42728
Customize _get_tensor_rank for model export in stage3 (#18294)
### Customize _get_tensor_rank for model export in stage3

Weight/Params sizes are all (0), so exporter logic depending on input
shape will fail.

This PR override `_get_tensor_rank` function by retrieving the shape for
weight differently.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-07 16:37:11 +08:00
zhijiang
630c877b43
Zhijxu/improve ortmodule python perf a little bit (#13716)
improve 2 python functions a little bit.

according to a profiling result from a real user case, we find that 2
python function can be improved. the first is the result before
improvement, the second is after improvement, we can see 8ms saved from
the improvement.

![image](https://user-images.githubusercontent.com/43435212/202961725-b88d679e-993b-4910-a339-253f3ed5dcde.png)

![image](https://user-images.githubusercontent.com/43435212/202961732-6c6deebf-962f-4392-90d7-03705433e3ee.png)
2023-11-07 15:24:57 +08:00
pengwa
c8e1038eab
Optimize 4bit Qlora training (#18131)
### Optimize 4bit Qlora training

Extent existing `MatmulBnb4bit` to its usage in training scenarios. 

The PR includes following changes:
1. Add special `torch.autograd.Function` export logic for
`bitsandbytes.autograd._functions.MatMul4Bit` that is preferred before
common PythonOp exporter.
2. Add `training_mode` optional attribute for op `MatmulBnb4bit`, which
help skip some inference specific logic in implementation.
3. Add `transB` optional attribute, which is by default be 1; setting it
to be 0 is needed by backward usage.

Changing from `PythonOp` to this `MatmulBnb4bit` brings roughly ~2.9%
throughput gains. The reason is:
`bitsandbytes.autograd._functions.MatMul4Bit` has logic
`ctx.save_for_backward`, which would need an additional copy in
PythonOp, otherwise, the tensor might be released by ORT, while backward
op still references it.

Removing the clones also reduce the peak memory consumptions because
`bitsandbytes.autograd._functions.MatMul4Bit` saved tensors that are not
needed in backward compute.
2023-11-02 09:46:11 -07:00
Vincent Wang
1c25fe5580
Fix PoliCheck (#18180)
Fix PoliCheck by changing some words, which was from Triton flash
attention's original code.
2023-10-31 13:53:11 +08:00
guyang3532
58f1d15d19
Replace Transpose with Replace if they are equivalent (#18096)
### Description
Transpose is equivalent to a Reshape if:
 empty dimensions can change place, not empty dimensions must be in
 the same order in the permuted tenosr.
 Example: Shape=(1,1,1024,4096) -> perm=(2,0,3,1).
This pr adds a graph transformer which replaces Transpose with Reshape
if they are equivalent.
Because Transpose need memory copy while Reshape needn't, this
replacement can save overhead for memory copy.
2023-10-27 23:50:18 +08:00
Vincent Wang
b7408f7389
[ORTModule] ATen Efficient Attention and Triton Flash Attention (#17959)
This PR is to support efficient attention and flash attention in
ORTModule, including:
- Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev
or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable.
- Integrate Triton Flash attention, which requires
triton==2.0.0.dev20221202. Need A100 or H100.
ORTMODULE_USE_FLASH_ATTENTION=1 to enable.
- A python transformer tool to match sub-graph by config and write
transformer quickly.

Current transformers supports attention mask for both efficient attn and
flash attn, and dropout for efficient attn only. To support more
training scenarios (such as causal mask in GPT2), more transformers need
to be added.

The feature is guarded by system environment variables, it won't effect
any current behavior if not enabled. Since it requires specific
PyTorch/Triton versions, related tests is not added for now.
2023-10-27 10:29:27 +08:00
pengwa
2c6b31c5aa
FP16 optimizer automatically detect DeepSpeed compatibility (#18084)
### FP16 optimizer automatically detect DeepSpeed compatibility

Optimum/Transformers are using accelerate lib to prepare models, so our
FP16 optimizer wrapper does not work for long time. Because the
namespace is `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper`,
which underlying is still calling into DeepSpeed stage1and2 optimizer.

This PR includes following changes:
1. Add `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper` in the
modifier registry, plus a check on its contained `optimizer` property
MUST be DeepSpeed stage 1 and 2 optimizer. (let's cover Stage 3
optimizer later)
2. For DeepSpeed version > 0.9.1, we will store the source code in a
version list. As long as the related function in DeepSpeed remains
unchanged during its new release, we won't need manually upgrade the
version check any more. If some day, the source code did not match, a
warning will be raised to users, to add a new version of source code in
the list.

With the above change, we will have our FP16 Optimizer working again in
Optimum.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/d35b4aa9-b371-46f1-98ae-73114f91179b)
2023-10-25 15:11:02 +08:00
pengwa
444a0eda30
Avoid one time clone to save memory peak (#17934)
### Avoid one more time clone to save memory peak
2023-10-21 19:45:45 +08:00
Baiju Meswani
a43c57f59d
ResizeGrad CUDA/ROCM kernel implementation (#17772) 2023-10-20 11:39:57 -07:00
Vincent Wang
fa0a79a921
Fix Triton Compile Error for Codegened Dropout Code (#17899) 2023-10-12 20:57:14 +08:00
pengwa
0e2782438a
Support inplace update for PythonOp/Grad (#17687)
### Support inplace update for PythonOp/Grad

This PR is based on another PR
https://github.com/microsoft/onnxruntime/pull/17685's branch, to make it
easier to review.

With PR: PR https://github.com/microsoft/onnxruntime/pull/17685, By
default all PythonOp inputs/outputs are assumed to not be inplaced, if
during run, we found some inplace update happens (by checking output
data address with all inputs data address), we add clone before set it
as PythonOp/Grad's outputs. In this case, results are correct, but
implicit copies overheads are introduced.

This PR allow users to define output input reuse map, to let ORT know
how to do the reuse map, avoid such unnecessary copies.
2023-10-10 21:36:45 -07:00
Abhishek Jindal
54b7503c30
create patch for allgather fn for deepspeed stage 3 (#17855)
### Description
<!-- Describe your changes. -->
Patch for All gather fn for Deepspeed Stage 3 changes


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-11 11:15:06 +08:00
PeixuanZuo
2ef6ee674c
[ROCm] Update ROCm and MIGraphX CI to ROCm5.7 (#17834)
- Update ROCm and MIGraphX CI to ROCm5.7
- Simplify test exculde file. Some tests will output `registered
execution providers ROCMExecutionProvider were unable to run the model.`
if they cannot run.
- Add `enable_training` build argument for MIGraphX pipeline.
2023-10-09 10:29:11 +08:00
pengwa
7201def4ec
Fix convergence for dolly+stage3 training (#17685)
### Fix convergence for dolly+stage3 training

In
[ZeROOffloadSubscriber](216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L359C7-L359C28)),
we defined some PythonOp, taking input and returning it inplace, for
example:

216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L223C20-L223C20).
While it is possible, when ORT runs such a PythonOp, once it completes,
it will release the input OrtValue, triggered the data erasing or
overridden. But the PythonOp's returned value OrtValue are still
pointing to that address, reading or writting on that may introduce a
wrong result or even undefined behaviors.


```
/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py:28: UserWarning: .rank-0: onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction->Backward: ONNX Op attribute 'tensor_reuse_map' doesn't indicate 8-th output is reusing any input, but detected inplace_map indicates it is reusing some input index. A clone will be done before returning to ORT, to align with ORT's NO Buffer reuse plan. Please update inplace_map explicitly to avoid such a copy.
  warnings.warn(f".rank-{get_rank()}: {message}")
  0%|▏                                                                                                                                                                                                                                               | 1/1000 [00:04<1:15:08,  4.51s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,023 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 14.1406, 'learning_rate': 0, 'epoch': 0.0}
  0%|▏                                                                                                                                                                                                                                               | 1/1000 [00:04<1:15:08,  4.51s/it]Invalidate trace cache @ step 5: expected module 6, but got module 7
  0%|▍                                                                                                                                                                                                                                                 | 2/1000 [00:04<31:53,  1.92s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,124 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  0%|▋                                                                                                                                                                                                                                                 | 3/1000 [00:04<18:05,  1.09s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,227 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  0%|▋                                                                                                                                                                                                                                                 | 3/1000 [00:04<18:05,  1.09s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,326 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  0%|█▏                                                                                                                                                                                                                                                | 5/1000 [00:04<08:44,  1.90it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,419 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  0%|█▏                                                                                                                                                                                                                                                | 5/1000 [00:04<08:44,  1.90it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,505 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|█▋                                                                                                                                                                                                                                                | 7/1000 [00:05<05:28,  3.02it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,597 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|█▋                                                                                                                                                                                                                                                | 7/1000 [00:05<05:28,  3.02it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,690 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|██▏                                                                                                                                                                                                                                               | 9/1000 [00:05<03:57,  4.17it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,791 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|██▏                                                                                                                                                                                                                                               | 9/1000 [00:05<03:57,  4.17it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,889 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|██▋                                                                                                                                                                                                                                              | 11/1000 [00:05<03:06,  5.32it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,981 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|██▋                                                                                                                                                                                                                                              | 11/1000 [00:05<03:06,  5.32it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,073 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  1%|███▏                                                                                                                                                                                                                                             | 13/1000 [00:05<02:33,  6.42it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,166 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  1%|███▏                                                                                                                                                                                                                                             | 13/1000 [00:05<02:33,  6.42it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,256 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  2%|███▌                                                                                                                                                                                                                                             | 15/1000 [00:05<02:12,  7.43it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,348 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  2%|███▌                                                                                                                                                                                                                                             | 15/1000 [00:05<02:12,  7.43it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,439 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  2%|████                                                                                                                                                                                                                                             | 17/1000 [00:06<01:59,  8.22it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,535 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  2%|████                                                                                                                                                                                                                                             | 17/1000 [00:06<01:59,  8.22it/s]Traceback (most recent call last):
  File "examples/onnxruntime/training/language-modeling/run_clm.py", line 600, in <module>
    main()
  File "examples/onnxruntime/training/language-modeling/run_clm.py", line 548, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 457, in train
    return inner_training_loop(
  File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 781, in _inner_training_loop
    self.deepspeed.step()
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 2084, in step
    self._take_model_step(lr_kwargs)
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 1990, in _take_model_step
    self.optimizer.step()
  File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1854, in step
    if self._overflow_check_and_loss_scale_update():
  File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1788, in _overflow_check_and_loss_scale_update
    self._update_scale(self.overflow)
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 2132, in _update_scale
    self.loss_scaler.update_scale(has_overflow)
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
    raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
  2%|████                                                                                                                                                                                                                                             | 17/1000 [00:06<06:07,  2.67it/s]
[2023-09-25 08:30:51,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1065120) of binary: /bert_ort/pengwa/py38/bin/python
Traceback (most recent call last):
  File "/bert_ort/pengwa/py38/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/onnxruntime/training/language-modeling/run_clm.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-25_08:30:51
  host      : orttrainingdev10.internal.cloudapp.net
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1065120)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(/bert_ort/pengwa/py38) pengwa@microsoft.com@orttrainingdev10:/bert_ort/pengwa/optim
```

## The Fix

For those output that are reusing input, but ORT is not aware of, we
detected on the fly (the first iteration, by checking the output tensor
addresses with input tensor addresses) , then do implicit copy before
set it as PythonOp's output tensors.


With this fix: (left: PyTorch, right: ORT)


![image](https://github.com/microsoft/onnxruntime/assets/10530022/0d72f431-2abd-4e52-af99-19974b85edde)
2023-10-07 08:40:19 +08:00
Justin Chu
be7541ef4a
[Linter] Bump ruff and remove pylint (#17797)
Bump ruff version and remove pylint from the linter list. Fix any new
error detected by ruff.

### Motivation and Context

Ruff covers many of the pylint rules. Since pylint is not enabled in
this repo and runs slow, we remove it from the linters
2023-10-05 21:07:33 -07:00
shaahji
5a623dca01
Python API to check whether collective ops are available or not (#17730)
Python API to check whether collective ops are available or not

### Description
<!-- Describe your changes. -->

Adding an API to check whether collective ops are available or not.
Since there is no independent MPI enabled build, this flag can be used
on Python front for branching. Specifically, to conditionally enable
tests.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Flag to be used in Python to check whether onnxruntime supports
collective ops or not. Handy for conditionally enabling/disabling tests
and for other branching decisions.
2023-09-29 14:11:05 -07:00
Vincent Wang
e6aa0fa174
Add Gelu Related Ops to Triton Codegen (#17713)
Add Gelu/QuickGelu/GeluGrad/QuickGeluGrad support to Triton Codegen so
that it can be fused with some other connected supported Ops. For
example, in llama2, it can be fused with Mul so we will have extra 1-2%
perf gain.
2023-09-27 19:57:39 +08:00
Scott McKay
33295ed883
Handle string initializers in constant folding (#17422)
### Description
<!-- Describe your changes. -->
* Allow either an allocator or a MemBuffer to be used when creating an
OrtValue from an TensorProto
* `Tensor<std::string>` requires an allocator to allocate/free the
string values
* Forcing the buffer to be allocated outside of the Tensor doesn't seem
to provide any benefit in this usage as the Tensor class disables copy
and assignment (so we wouldn't create 2 copies of the buffer via the
Tensor class that externally managing the would buffer avoid)
* New approach means we don't need to manage the buffers in the
optimizer Info class as the Tensor dtor will do that
* Update naming - MLValue was replaced by OrtValue a long time ago

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#17392
2023-09-27 21:15:58 +10:00
liqun Fu
2be4dc6d04
ONNX 1.15 integration (#17125)
### Description
this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released


### Motivation and Context
Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT.
---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2023-09-26 14:44:48 -07:00