Commit graph

1259 commits

Author SHA1 Message Date
zhijiang
4dc4470cc7
Fix fusion for two LayerNorm sharing same input but with different weights (#15919)
in gpt_j_residual(https://arxiv.org/pdf/2204.06745.pdf), there are 2 LN
nodes will share one same input, and ORT does CSE graph optimization
before LN fusion, which will modify the LN graph pattern and thus make
LN fusion failure.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/40990fd6-796f-4edf-be0b-3203e8503678)
2023-05-22 08:26:36 +08:00
Vrajang Parikh
5abaca9d69
add maybe unused attribute to vars only used for logging (#15970)
### Description
Add maybe_unused attribute to variables that are only used for logging



### Motivation and Context
Building ORT with training using Xcode 14.3 causes`
-Wunused-but-set-variable` error as some variables are created and
exclusively used for debug logging. Adding maybe_unused suppresses
warnings on unused variables when logging is disabled and fixes the
local build.
2023-05-17 10:24:13 -07:00
PeixuanZuo
e96f10d27b
[ROCm] reduce batch size to fix CI error (#15714)
ROCm CI batch size test occasionally fail. Try reduce batch size to fix
it.

error log:
Non-zero status code returned while running FusedMatMul node.
Name:'MatMul_2914_Grad/FusedMatMul_0' Status Message: HIP error
hipErrorNotFound:named symbol not found
Non-zero status code returned while running Gemm node.
Name:'MatMul_2891_Grad/Gemm_5' Status Message: HIP error
hipErrorNotFound:named symbol not found
2023-05-16 13:10:02 +08:00
PeixuanZuo
af6cb2af87
[ROCm] update ROCm/MIGraphX CI to ROCm5.5 (#15905)
update ROCm/MIGraphX CI to ROC5.5.

TODO:
two PR to fix failure on
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py
-
test_gradient_correctness_minmax/test_gradient_correctness_argmax_unfold/test_gradient_correctness_argmax_diagonal
(https://github.com/microsoft/onnxruntime/pull/15903)
- test_ortmodule_attribute_name_collision_warning
(https://github.com/microsoft/onnxruntime/pull/15884)
2023-05-15 10:28:15 +08:00
pengwa
fed52053a7
Refine a bit (on device training) (#15803)
### Few minor refinements:
- Simplify ParameterOptimizerState a bit
- Use inlined containers
- Remove GetStateDict APIs]
- Re-enable cuda test for lr scheduler
2023-05-10 20:36:13 -07:00
pengwa
003c7d3e4d
Add CPU allocation test for multiple GPU distributed run (#15829)
### Add CPU allocation test for non-CPU devices distributed run

When CUDA EP is enabled in distributed training, CPU memory is still
used for some node output. Early we have distributed run test coverage,
but don't cover the case when some of the node are using CPU devices for
storing tensor output. As a result, I recalled we hit regression twice
in the passing months:
- https://github.com/microsoft/onnxruntime/pull/14050
- https://github.com/microsoft/onnxruntime/pull/15823

So adding this test to avoid future regressions. 

The test graph looks like this:


![image](https://user-images.githubusercontent.com/10530022/236594940-70c68a55-18bf-4e09-bbf5-8a64895d3045.png)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-09 10:27:19 +08:00
Changming Sun
34fcdd83c8
Update softmax_grad_impl.cu: add constexpr (#15794)
### Description
Add a "constexpr" keyword to fix a static analysis warning
2023-05-04 08:10:17 -07:00
Baiju Meswani
2d519d21af
Python documentation for onnxruntime-training (#15765) 2023-05-02 16:58:16 -07:00
Ashwini Khade
0ffae8073b
Creating Nuget and Android packages for Training (#15712)
### Description
This PR creates Nuget and Android for Training. 


### Motivation and Context
These packages are intended to be released in ORT 1.15 to enable
On-Device Training Scenarios.

## Packaging Story for Learning On The Edge Release
### Nuget Packages:
1. New Native package -> **Microsoft.ML.OnnxRuntime.Training** (Native
package will contain binaries for: win-x86, win-x64, win-arm, win-arm64,
linux-x64, linux-arm64, android)
2. C# bindings will be added to existing package ->
**Microsoft.ML.OnnxRuntime.Managed**

### Android Package published to Maven:
1. New package for training (full build) ->
**onnxruntime-training-android-full-aar**

### Python Package published to PyPi:
1. Python bindings and offline tooling will be added to the existing ort
training package -> **onnxruntime-training**
2023-05-01 12:59:56 -07:00
Yuhong Guo
41dcf0d32e
Expose build information in dynamic lib (#15643)
### Description
<!-- Describe your changes. -->
1. Add Build Info API to onnx.
2. Fix compile error while building onnxruntime_benchmark in MacOs.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. When Onnxruntime lib is serving online, we need a way to detect how
this lib is built. This PR helps the developer to get the build
information using `strings` such as git branch, git commit id, build
type and cmake cxx flags, which is showed as follows.


![image](https://user-images.githubusercontent.com/19584326/233794371-b2f95a2c-27fb-4709-a6dd-bf4bb12b0b5b.png)


![image](https://user-images.githubusercontent.com/19584326/233794360-f96f5d2e-332c-405c-83f1-370ccc2b86f8.png)

If the build env has no git, there will be no git related infor:


![image](https://user-images.githubusercontent.com/19584326/234558596-298c1b01-9a90-41bf-9372-7259a8f8e5be.png)


3. Fix the following compile error while building benchmark in MacOs.

![image](https://user-images.githubusercontent.com/19584326/233793571-c261ac1f-47b2-434d-a293-7e9edc6c8a66.png)

---------

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2023-04-28 21:57:31 -07:00
pengwa
29d13cea42
Cumulative update on optimizers and tests (on-device training) (#15499) 2023-04-28 09:55:39 -07:00
pengwa
2efb75bfe9
Fold shape related operation (#14936)
### Fold shape related operation at best efforts. 

This is a follow up for PR
https://github.com/microsoft/onnxruntime/pull/12561.
Create a specialized shape_optimzer to constant fold shape related
operation.
ShapeOptimizer at the best efforts to constant fold the dim values that
exists from shape inferencing. This is helpful to simplify the graph,
which on the other hand, help other graph transformers to do more.

Transformer that traverses the graph top-down and performs shape
optimizations.
Try the best effort to constant fold the shape related to Shape node
outputs:
1. Shape generates 1D tensor [12, 128, 512] (all dimensions have
concrete dim value), which can be constant folded
to an initializer including 1D tensor values [12, 128, 512]. (Some logic
of ConstantFolding also does the same thing.)
2. Shape generate 1D tensor [batch_size, 128, 512] ->
Slice(start=1,end=3), we can constant fold the Shape->Slice to
  an initializer including 1D tensor values [128, 512].
3. Shape generate 1D tensor [batch_size, 128, 512] -> Gather(axes=[0],
index=[2]), we can constant fold the
  Shape->Gather to an initializer including 1D tensor values [512].
4. Shape 15 takes input of shape [batch_size, 128, 512], slicing from 1
to 2(exclusive), we can constant fold the
Shape15(start=1,end=2) to an initializer including 1D tensor values
[128].
This would help clean up the graph, combined with ConstantFolding, the
graph would be much more simplified.


### Motivation and Context



One direct motivation to have this is, we have a model subgraph like
this:

![image](https://user-images.githubusercontent.com/10530022/223390243-47b13922-4340-4999-9637-f52a33f69a2d.png)

The subgraph in the green rectangle is trying to get the value `30522`,
with the changes in this PR, the subgraph will be constant folded. Plus
ConstantFolding optimizer will further to optimize out the subsquent
`Squeeze`/`Unsqueeze`/`ConcatTraining`, then we will have a clean very
clean Reshape node, with its shape input be an constant `[-1, 20522]`.

Having this simplified graph, our other compute optimizer can help
further optimize the graph by re-ordering gather/reshape nodes.
2023-04-27 18:59:28 +08:00
Rui Ren
db6a9bc033
support latest deepspeed version for optim (#15682)
### Description
<!-- Describe your changes. -->

support the latest deepspeed 0.9.1 for the next release


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This will avoid the warn message `Skip modifying optimizer because of
unsupported DeepSpeed version`

---------

Co-authored-by: ruiren <ruiren@microsoft.com>
2023-04-25 20:12:23 -07:00
Rui Ren
4c3e350a6a
fix ORTModuleONNXModelException fallback OOM (#15523)
### Description
<!-- Describe your changes. -->
### Error 
```
RuntimeError: There was an error while exporting the PyTorch model to ONNX:-

Traceback (most recent call last):
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 254, in get_exception_as_string
    raise exception
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 385, in _get_exported_model
    torch.onnx.export(self._flattened_module,
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/__init__.py", line 305, in export
    return utils.export(model, args, f, export_params, verbose, training,
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/utils.py", line 118, in export
    _export(model, args, f, export_params, verbose, training, input_names, output_names,
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/utils.py", line 743, in _export
    proto, export_map, val_use_external_data_format = graph._export_onnx(
RuntimeError: ONNX export failed: Couldn't export Python operator XDropout
```
The error leads to Out of Memory issue, because the log.txt file is **26
GB**.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The root cause is that in each `_forward`
```
      if log_level <= _logger.LogLevel.WARNING and not self._raised_ORTModuleONNXModelException:
          warnings.warn(
              (
                  f"Fallback to PyTorch due to exception {type(self._exception)} was triggered. "
                  "Report this issue with a minimal repro at https://www.github.com/microsoft/onnxruntime. "
                  f"See details below:\n\n{_utils.get_exception_as_string(self._exception)}"
              ),
              UserWarning,
          )
```


above code will be called and log the `exception` through
`get_exception_as_string`,

In my training case, this will lead to 40 k times of `Traceback` stdout
and 110 millions lines of `onnx graph` output and run into OOM.

### Validation

After above fixes, the log.txt file will only be **2.4 MB**.

---------

Co-authored-by: ruiren <ruiren@microsoft.com>
2023-04-25 15:10:31 -07:00
Baiju Meswani
5885abfb35
Training Documentation (#15612) 2023-04-25 11:44:12 -07:00
Wei-Sheng Chin
d0c3f92ec6
[DORT] Fix fake tensor problem cuased by PyTorch change (#15664)
This should make `Orttraining Linux Lazy Tensor CI Pipeline` green
again.
2023-04-25 19:56:42 +08:00
Ashwini Khade
124ea0a801
remove compute optimizer from lte (learning on the edge) builds (#15637)
### Description
Removing compute optimizer from on device training builds.

### Motivation and Context
1. mitigate android build failures
2. reduce binary size

Since only CPU EP is enabled for LTE builds, we can optimize the models
offline.
2023-04-24 15:57:15 -07:00
Baiju Meswani
fd6ecc3909
Add env to the TrainingSession constructor (#15635) 2023-04-21 21:05:46 -07:00
Baiju Meswani
b5a1941835
C, C++, Python, C# API update for on device training (#15518) 2023-04-21 11:36:01 -07:00
Baiju Meswani
46210556f0
BatchnormInternal avoid setting num_channels if input shape is not known (#15544) 2023-04-20 12:57:16 -07:00
Baiju Meswani
11b0a18de6
Add support for cuda 11.8 and python 3.11 for training (#15548) 2023-04-20 12:56:45 -07:00
Justin Chu
831734a46e
Fix lint errors missed due to new commits (#15558)
Follow up of #15524
2023-04-18 12:55:02 -07:00
Justin Chu
cf19c3697d
Run clang-format in CI (#15524)
### Description

Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.

Excluded

```
    'onnxruntime/core/mlas/**',
    'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```

because they contain assembly or is data heavy


### Motivation and Context

Coding style consistency
2023-04-18 09:26:58 -07:00
liqun Fu
919d8f2660
update with onnx main (#14929) 2023-04-18 08:42:51 -07:00
Justin Chu
a36caba073
Bump ruff in CI (#15533)
### Description

Bump ruff version in CI and fixed new lint errors. 

- This change enables the flake8-implicit-str-concat rules which helps
detect unintended string concatenations:
https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc
- Update gitignore to include common python files that we want to
exclude.


### Motivation and Context

Code quality
2023-04-17 10:11:44 -07:00
Wei-Sheng Chin
ac6ceffb2c
Force using fixed random seeds for flaky tests (#15515)
Some gradient-related tests fail frequently due to their math
properties. This PR fixes their random seed so that it's possible to
debug in the future.

Fixed
[AB#14605](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/14605),
[AB#14604](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/14604)
2023-04-14 18:44:51 -07:00
zhijiang
05ec22330f
softmax perf improvement pr2 - import softmax bw (#15199)
when dimension to do softmax is 2048, original ort code will fallback to
cudnn, while with some optimization on ort's softmax_warp_backward, we
can be faster than cudnn implementation.

the ideas to optimize softmax_warp_backward is:
1. instead of saving intermediate result in register, we just recompute
to save resource
2. save the input data in fp16 instead of fp32 to further save resource

the perf numbers:

![image](https://user-images.githubusercontent.com/43435212/227476335-ae0b61c4-cd15-40b7-b743-a956fadaedda.png)

please be noted that when dim to do softmax is less than 2048, nothing
will be changed, so only gives perf number of 2048 case.


add more perf number for smaller batch size

![image](https://user-images.githubusercontent.com/43435212/231676120-c8944b09-a664-43f3-a1e8-dfe729c6e816.png)
2023-04-13 14:57:01 +08:00
mindest
67ac36101c
disable BatchNormalizationGrad test (#15485)
### Description
Temporarily disable BatchNormalizationGrad test due to random failure.

Example:

```
2023-04-12T06:33:24.1593811Z 1: [ RUN ] GradientCheckerTest.BatchNormalizationGrad
2023-04-12T06:33:27.5603881Z 1: D:\a\_work\1\s\orttraining\orttraining\test\gradient\gradient_ops_test.cc(1468): error: Value of: IsErrorWithinTolerance(max_error, error_tolerance)
2023-04-12T06:33:27.5604509Z 1: Actual: false
2023-04-12T06:33:27.5604719Z 1: Expected: true
2023-04-12T06:33:27.5604997Z 1: max_error: 1.776702880859375; tolerance: 0.019999999552965164; ORT test random seed: 2552121240;
2023-04-12T06:33:27.5605266Z 1: Google Test trace:
2023-04-12T06:33:27.5605531Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 8910
2023-04-12T06:33:27.5605843Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 5678
2023-04-12T06:33:27.5606478Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 1234
2023-04-12T06:33:27.8285560Z 1: D:\a\_work\1\s\orttraining\orttraining\test\gradient\gradient_ops_test.cc(1493): error: Value of: IsErrorWithinTolerance(max_error, error_tolerance)
2023-04-12T06:33:27.8286181Z 1: Actual: false
2023-04-12T06:33:27.8286404Z 1: Expected: true
2023-04-12T06:33:27.8286669Z 1: max_error: 1.776702880859375; tolerance: 0.019999999552965164; ORT test random seed: 2552121240;
2023-04-12T06:33:27.8286942Z 1: Google Test trace:
2023-04-12T06:33:27.8287208Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 8910
2023-04-12T06:33:27.8287532Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 5678
2023-04-12T06:33:27.8287849Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 1234
2023-04-12T06:33:51.6368960Z 1: [ FAILED ] GradientCheckerTest.BatchNormalizationGrad (27475 ms)
```
2023-04-13 14:53:47 +08:00
pengwa
516c8e95fa
Optimize SCE loss compute (#15401)
### Optimize SCE loss compute

Compute optimization based on label data sparsity:
- Insert ShrunkenGather before SCELoss node, to filter out invalid
labels for compute.
- Support ShrunkenGather upstream.
- Added test for the above.
- Added flag to enable label sparsity optimization with env var, by
default disabled now. Will enable after comprehensive benchmarking
later.
- Extract common logic into test_optimizer_utils.h/cc from
core/optimizer/compute_optimzier_test.cc, then the common functions can
be shared by both core/optimizer/compute_optimzier_test.cc and
orttraining/core/optimizer/compute_optimzier_test.cc
- Extract common logic into shared_utils.h/cc: `GetONNXOpSetVersion` and
`Create1DInitializerFromVector`


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-04-13 13:02:12 +08:00
zhijiang
29c74d3c43
softmax perf improvement pr1 - add more softmax related test (#15176)
1. add fp16 test
2. add test for shape is not power of two.
2023-04-11 17:02:40 +08:00
Changming Sun
d175e87a1f
Delete eager mode code and increase minimal required python version to 3.8 (#15450)
### Description
1. Delete eager mode code.
2. Increase the minimal required python version to 3.8.
2023-04-10 16:00:04 -07:00
Pranav Prakash
3c5d02a9ce
Implement BatchNormGradient kernel for CPU EP (#7622)
**Description**: Register an implementation for BatchNormInternal and
add a CPU kernel for BatchNormGradient. This is the third in a series of
PRs to implement BN training on CPU (first was #6946, second was #7539).

**Motivation and Context**
Support training networks with BatchNorm (e.g. convnets). Also note that
there exists a CUDA kernel for BN (forward training & backwards) but
it's currently disabled due to flaky failures; someone more familiar
with those parts can register the implementation for BNInternal on CUDA
(gradient kernel doesn't have to change).

---------

Co-authored-by: Simon Zirui Guo <simonguozirui@berkeley.edu>
Co-authored-by: mindest <linminuser@gmail.com>
Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
2023-04-08 09:20:26 +08:00
Rui Ren
5e2f46df2b
update deepspeed version 0.8.3 (#15415)
### Description
<!-- Describe your changes. -->
Update the support deepspeed to 0.8.3 as it's the latest version


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This will fix the error of `Skip modifying optimizer because of
unsupported DeepSpeed version`

Co-authored-by: ruiren <ruiren@microsoft.com>
2023-04-07 17:59:50 -07:00
pengwa
16f5909f2d
Introduce shrunken gather operator (#15396)
### Introduce shrunken gather operator

Exist Gather operator schema won't guarantee output element count will
be smaller than input element count.
Actually, it is possible output element count >, =, or < input element
count.

For some cases we know for sure output element count MUST be <= input
element count, we will upstream those Gather operators to reduce compute
flops.

So this PR introduces an ShrunkenGather which explicitly guarantee
output count will be smaller than input count. The operator add
additional restriction on inputs, but still re-use existing Gather's
implementations plus input check during runtime.

This is a requirement for subsequent optimization (Draft PR:
https://github.com/microsoft/onnxruntime/pull/15401) we will do for
label sparsity and embedding sparsity.
2023-04-07 15:12:58 +08:00
Thuy Dao
6e1e808ec8
fix error unqualified call to 'std::move' (#15347) 2023-04-05 20:40:30 -07:00
pengwa
fe0db63dee
Upstream reshape of merging batch/sequence (#15023)
### Upstream reshape of merging batch/sequence

For Reshape node that fulfills following requirements:
- input data rank = 3
- input shape is constant initializer, the untorched dim value MUST be a
constant value.
- Reshape is merging the first dimension, so output data rank = 2.

We upstream it to make it run as earlier as possible. Doing this will
allow us to upstream other operators (Gather) that is blocked by those
kind of Reshape node.

Currently, we did not enable it in graph_transformer_utils, since the
combined upstream gather changes are not ready yet.

Before:


![image](https://user-images.githubusercontent.com/10530022/224698252-f9705082-9710-4385-95ec-f1ccf50dc0e3.png)


After:


![image](https://user-images.githubusercontent.com/10530022/224698381-7e124d0d-ba47-4f35-8e37-6015014cd1c4.png)
2023-04-05 18:51:07 +08:00
Baiju Meswani
6b755debbc
Miscellaneous updates to training artifact generation (#15315) 2023-04-04 20:09:51 -07:00
Nhat Nguyen
198994d01d
Register PytorchAtenDomain in RegisterOrtOpSchemas (#14567) 2023-04-04 17:34:13 -07:00
pengwa
5baf5f506b
log level control + fix typos (#15302)
### log level control + fix typos
2023-04-04 20:19:13 +08:00
Baiju Meswani
e870089ca8
Refining the offline tooling for training artifact generation (#15212) 2023-03-30 18:05:51 -07:00
Justin Chu
710d095124
Refactor the constant _ONE in orttraining_test_ortmodule_api.py (#15128)
Follow up of
https://github.com/microsoft/onnxruntime/pull/15097#discussion_r1142399537
2023-03-28 08:59:51 -07:00
Justin Chu
938e2136c6
Enable pylint and numpy rules (#15218)
### Description

Enable pylint and numpy rules

### Motivation and Context

Modernize numpy usage and enable more quality checks
2023-03-27 20:37:53 -07:00
Justin Chu
d834ec895a
Adopt linrtunner as the linting tool - take 2 (#15085)
### Description

`lintrunner` is a linter runner successfully used by pytorch, onnx and
onnx-script. It provides a uniform experience running linters locally
and in CI. It supports all major dev systems: Windows, Linux and MacOs.
The checks are enforced by the `Python format` workflow.

This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors
in Python code. `lintrunner` now runs all required python lints
including `ruff`(replacing `flake8`), `black` and `isort`. Future lints
like `clang-format` can be added.

Most errors are auto-fixed by `ruff` and the fixes should be considered
robust.

Lints that are more complicated to fix are applied `# noqa` for now and
should be fixed in follow up PRs.

### Notable changes

1. This PR **removed some suboptimal patterns**:

	- `not xxx in` -> `xxx not in` membership checks
	- bare excepts (`except:` -> `except Exception`)
	- unused imports
	
	The follow up PR will remove:
	
	- `import *`
	- mutable values as default in function definitions (`def func(a=[])`)
	- more unused imports
	- unused local variables

2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than
flake8 and is more robust. We are using it successfully in onnx and
onnx-script. It also supports auto-fixing many flake8 errors.

3. Removed the legacy flake8 ci flow and updated docs.

4. The added workflow supports SARIF code scanning reports on github,
example snapshot:
	

![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png)

5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Unified linting experience in CI and local.

Replacing https://github.com/microsoft/onnxruntime/pull/14306

---------

Signed-off-by: Justin Chu <justinchu@microsoft.com>
2023-03-24 15:29:03 -07:00
PeixuanZuo
56bccac35d
[ROCm] update bert-L convergence reference file to fix CI (#15200)
The change of layernorm lead to the change of bert-L convergence result.
2023-03-24 21:43:44 +08:00
pengwa
1d32285536
Statistics tool for ORTModule convergence parity (#15020)
### Statistics tool for ORTModule convergence parity

As ORTModule get more and more validated, it is pretty fast to
intergrade PyTorch based model with ORT.

The same time, we need make sure once there is convergence issue, we
don't spend months of time to investigate. As part of this efforts, this
PR is introducing a tool to dump activation statistics without much
involvement from users. The dumping results contains only some statistic
numbers plus sampled data, which is not big, compared with dumping all
the tensors, it is much faster and space efficient.

For us to use it, two single lines are needed before wrapping ORTModule.
For baseline run, need also apply the same trick.

```
+	from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber
+	SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)])
```

Once you run the steps, following command can be used to merge result
into per-step-summary respectively for ORT and baseline runs.
 
```bash
python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output
```

Docs is added here as part of this PR [convergence investigation
notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md)

Based on the generated merged files, we can compare them with tools. 


![image](https://user-images.githubusercontent.com/10530022/224653929-4e4480bd-bb02-4bbe-bd44-2672bdf91a87.png)

### Design and Implementation

This PR introduced a common mechanism registering custom logic for
nn.Module's post forward hooks. And statistics for activation
(StatisticsSubscriber) is one of the implementations. If there is other
needs, we can define another XXSubscriber to do the customized things.
2023-03-23 20:34:24 +08:00
pengwa
7bec80d92a
Fix reference count for autograd.Function (#15121)
### Fix reference count for autograd

When PythonOp kernel initialized, `AddPointerScalarArgs` creates
`const_args_` which put all non-tensor references (including
ProcessGroup, string, or other user types) in it.

In kernel's destructor, all ref cnt got decreased for `const_args_`. 


```
void PythonOpBase::Clear() {
  for (auto ptr : const_args_) {
    auto obj = reinterpret_cast<PyObject*>(ptr);
    Py_DECREF(obj);
  }
}
```

It means, we did not increase cnt, but just decrease cnt. Running the
unit, segmentation fault will be thrown. The simple fix is to remove the
Py_DECREF for those pointer-type constant inputs triggered by kernel
destructor.

NONTENSOR_OBJECT_POINTER_STORE is the place we increase the reference
during export, then the reference will remain until the python program
terminates.


Additionally tunings:
1. Move some logs into verbose instead of warning in case of flooding
training logs.
2. Move pointer type ref holding from python side
(NONTENSOR_OBJECT_POINTER_STORE) to
orttraining/orttraining/core/framework/torch/custom_function_register.h.
Then we use a consistent approach to manage all PythonOp related python
object/methonds ref count increasing and decreasing.
2023-03-23 12:51:50 +08:00
Baiju Meswani
0086f7590d
LSTM and LSTM gradient implementation for training (#15034) 2023-03-21 21:44:08 -07:00
Justin Chu
bdd7bd084c
Remove the use of eval in test code (#15097)
### Description

Remove the use of `eval` in test code so we don't (1) use eval and (2)
create "unused" local vars that ruff will remove. Predecessor to #15085
2023-03-20 09:43:56 -07:00
pengwa
1ccb79476c
Fix training gpu ci related to pl upgrade (#15092)
### Fix training gpu ci related to pl upgrade

As new version of pln relased, old parameter of pln.Trainer, gpus looks
not supported. So we switch to new params to make it work.

```
['/home/onnxruntimedev/miniconda3/bin/python3', 'orttraining_test_ortmodule_torch_lightning_basic.py', '--train-steps=470', '--epochs=2', '--batch-size=256', '--data-dir', '/mnist'] 

/home/onnxruntimedev/miniconda3/lib/python3.8/site-packages/torch/onnx/utils.py:1794: FutureWarning: The first argument to symbolic functions is deprecated in 1.13 and will be removed in the future. Please annotate treat the first argument (g) as GraphContext and use context information from the object instead. warnings.warn( Traceback (most recent call last): File "orttraining_test_ortmodule_torch_lightning_basic.py", line 101, in <module> main() File "orttraining_test_ortmodule_torch_lightning_basic.py", line 96, in main trainer = pl.Trainer(**kwargs) File "/home/onnxruntimedev/miniconda3/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse.py", line 69, in insert_env_defaults return fn(self, **kwargs) TypeError: __init__() got an unexpected keyword argument 'gpus'
```
2023-03-17 13:26:58 +08:00
Christian Veenhuis
59dfcfdce7
Fix typos in sources: operater, tranform, neccessary, trainig (#14907)
### Description
While browsing the sources I found several typos here and there.
I collected them to a single PR and fixed them.
Namely these typos are: operater, tranform, neccessary, trainig.
After fixing none of them was found anymore:

$ git grep "operater"
$ git grep "tranform"
$ git grep "neccessary"
$ git grep "trainig"
$ 

### Motivation and Context
Since some of the typos are in example notebooks and markdown files,
users can see them.
2023-03-13 22:45:04 -07:00