onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-18 21:21:17 +00:00

Author	SHA1	Message	Date
zhijiang	4dc4470cc7	Fix fusion for two LayerNorm sharing same input but with different weights (#15919 ) in gpt_j_residual(https://arxiv.org/pdf/2204.06745.pdf), there are 2 LN nodes will share one same input, and ORT does CSE graph optimization before LN fusion, which will modify the LN graph pattern and thus make LN fusion failure. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/40990fd6-796f-4edf-be0b-3203e8503678)	2023-05-22 08:26:36 +08:00
Vrajang Parikh	5abaca9d69	add maybe unused attribute to vars only used for logging (#15970 ) ### Description Add maybe_unused attribute to variables that are only used for logging ### Motivation and Context Building ORT with training using Xcode 14.3 causes` -Wunused-but-set-variable` error as some variables are created and exclusively used for debug logging. Adding maybe_unused suppresses warnings on unused variables when logging is disabled and fixes the local build.	2023-05-17 10:24:13 -07:00
PeixuanZuo	e96f10d27b	[ROCm] reduce batch size to fix CI error (#15714 ) ROCm CI batch size test occasionally fail. Try reduce batch size to fix it. error log: Non-zero status code returned while running FusedMatMul node. Name:'MatMul_2914_Grad/FusedMatMul_0' Status Message: HIP error hipErrorNotFound:named symbol not found Non-zero status code returned while running Gemm node. Name:'MatMul_2891_Grad/Gemm_5' Status Message: HIP error hipErrorNotFound:named symbol not found	2023-05-16 13:10:02 +08:00
PeixuanZuo	af6cb2af87	[ROCm] update ROCm/MIGraphX CI to ROCm5.5 (#15905 ) update ROCm/MIGraphX CI to ROC5.5. TODO: two PR to fix failure on orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py - test_gradient_correctness_minmax/test_gradient_correctness_argmax_unfold/test_gradient_correctness_argmax_diagonal (https://github.com/microsoft/onnxruntime/pull/15903) - test_ortmodule_attribute_name_collision_warning (https://github.com/microsoft/onnxruntime/pull/15884)	2023-05-15 10:28:15 +08:00
pengwa	fed52053a7	Refine a bit (on device training) (#15803 ) ### Few minor refinements: - Simplify ParameterOptimizerState a bit - Use inlined containers - Remove GetStateDict APIs] - Re-enable cuda test for lr scheduler	2023-05-10 20:36:13 -07:00
pengwa	003c7d3e4d	Add CPU allocation test for multiple GPU distributed run (#15829 ) ### Add CPU allocation test for non-CPU devices distributed run When CUDA EP is enabled in distributed training, CPU memory is still used for some node output. Early we have distributed run test coverage, but don't cover the case when some of the node are using CPU devices for storing tensor output. As a result, I recalled we hit regression twice in the passing months: - https://github.com/microsoft/onnxruntime/pull/14050 - https://github.com/microsoft/onnxruntime/pull/15823 So adding this test to avoid future regressions. The test graph looks like this: ![image](https://user-images.githubusercontent.com/10530022/236594940-70c68a55-18bf-4e09-bbf5-8a64895d3045.png) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-09 10:27:19 +08:00
Changming Sun	34fcdd83c8	Update softmax_grad_impl.cu: add constexpr (#15794 ) ### Description Add a "constexpr" keyword to fix a static analysis warning	2023-05-04 08:10:17 -07:00
Baiju Meswani	2d519d21af	Python documentation for onnxruntime-training (#15765 )	2023-05-02 16:58:16 -07:00
Ashwini Khade	0ffae8073b	Creating Nuget and Android packages for Training (#15712 ) ### Description This PR creates Nuget and Android for Training. ### Motivation and Context These packages are intended to be released in ORT 1.15 to enable On-Device Training Scenarios. ## Packaging Story for Learning On The Edge Release ### Nuget Packages: 1. New Native package -> Microsoft.ML.OnnxRuntime.Training (Native package will contain binaries for: win-x86, win-x64, win-arm, win-arm64, linux-x64, linux-arm64, android) 2. C# bindings will be added to existing package -> Microsoft.ML.OnnxRuntime.Managed ### Android Package published to Maven: 1. New package for training (full build) -> onnxruntime-training-android-full-aar ### Python Package published to PyPi: 1. Python bindings and offline tooling will be added to the existing ort training package -> onnxruntime-training	2023-05-01 12:59:56 -07:00
Yuhong Guo	41dcf0d32e	Expose build information in dynamic lib (#15643 ) ### Description <!-- Describe your changes. --> 1. Add Build Info API to onnx. 2. Fix compile error while building onnxruntime_benchmark in MacOs. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> 1. When Onnxruntime lib is serving online, we need a way to detect how this lib is built. This PR helps the developer to get the build information using `strings` such as git branch, git commit id, build type and cmake cxx flags, which is showed as follows. ![image](https://user-images.githubusercontent.com/19584326/233794371-b2f95a2c-27fb-4709-a6dd-bf4bb12b0b5b.png) ![image](https://user-images.githubusercontent.com/19584326/233794360-f96f5d2e-332c-405c-83f1-370ccc2b86f8.png) If the build env has no git, there will be no git related infor: ![image](https://user-images.githubusercontent.com/19584326/234558596-298c1b01-9a90-41bf-9372-7259a8f8e5be.png) 3. Fix the following compile error while building benchmark in MacOs. ![image](https://user-images.githubusercontent.com/19584326/233793571-c261ac1f-47b2-434d-a293-7e9edc6c8a66.png) --------- Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-04-28 21:57:31 -07:00
pengwa	29d13cea42	Cumulative update on optimizers and tests (on-device training) (#15499 )	2023-04-28 09:55:39 -07:00
pengwa	2efb75bfe9	Fold shape related operation (#14936 ) ### Fold shape related operation at best efforts. This is a follow up for PR https://github.com/microsoft/onnxruntime/pull/12561. Create a specialized shape_optimzer to constant fold shape related operation. ShapeOptimizer at the best efforts to constant fold the dim values that exists from shape inferencing. This is helpful to simplify the graph, which on the other hand, help other graph transformers to do more. Transformer that traverses the graph top-down and performs shape optimizations. Try the best effort to constant fold the shape related to Shape node outputs: 1. Shape generates 1D tensor [12, 128, 512] (all dimensions have concrete dim value), which can be constant folded to an initializer including 1D tensor values [12, 128, 512]. (Some logic of ConstantFolding also does the same thing.) 2. Shape generate 1D tensor [batch_size, 128, 512] -> Slice(start=1,end=3), we can constant fold the Shape->Slice to an initializer including 1D tensor values [128, 512]. 3. Shape generate 1D tensor [batch_size, 128, 512] -> Gather(axes=[0], index=[2]), we can constant fold the Shape->Gather to an initializer including 1D tensor values [512]. 4. Shape 15 takes input of shape [batch_size, 128, 512], slicing from 1 to 2(exclusive), we can constant fold the Shape15(start=1,end=2) to an initializer including 1D tensor values [128]. This would help clean up the graph, combined with ConstantFolding, the graph would be much more simplified. ### Motivation and Context One direct motivation to have this is, we have a model subgraph like this: ![image](https://user-images.githubusercontent.com/10530022/223390243-47b13922-4340-4999-9637-f52a33f69a2d.png) The subgraph in the green rectangle is trying to get the value `30522`, with the changes in this PR, the subgraph will be constant folded. Plus ConstantFolding optimizer will further to optimize out the subsquent `Squeeze`/`Unsqueeze`/`ConcatTraining`, then we will have a clean very clean Reshape node, with its shape input be an constant `[-1, 20522]`. Having this simplified graph, our other compute optimizer can help further optimize the graph by re-ordering gather/reshape nodes.	2023-04-27 18:59:28 +08:00
Rui Ren	db6a9bc033	support latest deepspeed version for optim (#15682 ) ### Description <!-- Describe your changes. --> support the latest deepspeed 0.9.1 for the next release ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This will avoid the warn message `Skip modifying optimizer because of unsupported DeepSpeed version` --------- Co-authored-by: ruiren <ruiren@microsoft.com>	2023-04-25 20:12:23 -07:00
Rui Ren	4c3e350a6a	fix ORTModuleONNXModelException fallback OOM (#15523 ) ### Description <!-- Describe your changes. --> ### Error ``` RuntimeError: There was an error while exporting the PyTorch model to ONNX:- Traceback (most recent call last): File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 254, in get_exception_as_string raise exception File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 385, in _get_exported_model torch.onnx.export(self._flattened_module, File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/__init__.py", line 305, in export return utils.export(model, args, f, export_params, verbose, training, File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/utils.py", line 118, in export _export(model, args, f, export_params, verbose, training, input_names, output_names, File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/utils.py", line 743, in _export proto, export_map, val_use_external_data_format = graph._export_onnx( RuntimeError: ONNX export failed: Couldn't export Python operator XDropout ``` The error leads to Out of Memory issue, because the log.txt file is 26 GB. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The root cause is that in each `_forward` ``` if log_level <= _logger.LogLevel.WARNING and not self._raised_ORTModuleONNXModelException: warnings.warn( ( f"Fallback to PyTorch due to exception {type(self._exception)} was triggered. " "Report this issue with a minimal repro at https://www.github.com/microsoft/onnxruntime. " f"See details below:\n\n{_utils.get_exception_as_string(self._exception)}" ), UserWarning, ) ``` above code will be called and log the `exception` through `get_exception_as_string`, In my training case, this will lead to 40 k times of `Traceback` stdout and 110 millions lines of `onnx graph` output and run into OOM. ### Validation After above fixes, the log.txt file will only be 2.4 MB. --------- Co-authored-by: ruiren <ruiren@microsoft.com>	2023-04-25 15:10:31 -07:00
Baiju Meswani	5885abfb35	Training Documentation (#15612 )	2023-04-25 11:44:12 -07:00
Wei-Sheng Chin	d0c3f92ec6	[DORT] Fix fake tensor problem cuased by PyTorch change (#15664 ) This should make `Orttraining Linux Lazy Tensor CI Pipeline` green again.	2023-04-25 19:56:42 +08:00
Ashwini Khade	124ea0a801	remove compute optimizer from lte (learning on the edge) builds (#15637 ) ### Description Removing compute optimizer from on device training builds. ### Motivation and Context 1. mitigate android build failures 2. reduce binary size Since only CPU EP is enabled for LTE builds, we can optimize the models offline.	2023-04-24 15:57:15 -07:00
Baiju Meswani	fd6ecc3909	Add env to the TrainingSession constructor (#15635 )	2023-04-21 21:05:46 -07:00
Baiju Meswani	b5a1941835	C, C++, Python, C# API update for on device training (#15518 )	2023-04-21 11:36:01 -07:00
Baiju Meswani	46210556f0	BatchnormInternal avoid setting num_channels if input shape is not known (#15544 )	2023-04-20 12:57:16 -07:00
Baiju Meswani	11b0a18de6	Add support for cuda 11.8 and python 3.11 for training (#15548 )	2023-04-20 12:56:45 -07:00
Justin Chu	831734a46e	Fix lint errors missed due to new commits (#15558 ) Follow up of #15524	2023-04-18 12:55:02 -07:00
Justin Chu	cf19c3697d	Run clang-format in CI (#15524 ) ### Description Run clang-format in CI. Formatted all c/c++, objective-c/c++ files. Excluded ``` 'onnxruntime/core/mlas/', 'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/', ``` because they contain assembly or is data heavy ### Motivation and Context Coding style consistency	2023-04-18 09:26:58 -07:00
liqun Fu	919d8f2660	update with onnx main (#14929 )	2023-04-18 08:42:51 -07:00
Justin Chu	a36caba073	Bump ruff in CI (#15533 ) ### Description Bump ruff version in CI and fixed new lint errors. - This change enables the flake8-implicit-str-concat rules which helps detect unintended string concatenations: https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc - Update gitignore to include common python files that we want to exclude. ### Motivation and Context Code quality	2023-04-17 10:11:44 -07:00
Wei-Sheng Chin	ac6ceffb2c	Force using fixed random seeds for flaky tests (#15515 ) Some gradient-related tests fail frequently due to their math properties. This PR fixes their random seed so that it's possible to debug in the future. Fixed [AB#14605](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/14605), [AB#14604](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/14604)	2023-04-14 18:44:51 -07:00
zhijiang	05ec22330f	softmax perf improvement pr2 - import softmax bw (#15199 ) when dimension to do softmax is 2048, original ort code will fallback to cudnn, while with some optimization on ort's softmax_warp_backward, we can be faster than cudnn implementation. the ideas to optimize softmax_warp_backward is: 1. instead of saving intermediate result in register, we just recompute to save resource 2. save the input data in fp16 instead of fp32 to further save resource the perf numbers: ![image](https://user-images.githubusercontent.com/43435212/227476335-ae0b61c4-cd15-40b7-b743-a956fadaedda.png) please be noted that when dim to do softmax is less than 2048, nothing will be changed, so only gives perf number of 2048 case. add more perf number for smaller batch size ![image](https://user-images.githubusercontent.com/43435212/231676120-c8944b09-a664-43f3-a1e8-dfe729c6e816.png)	2023-04-13 14:57:01 +08:00
mindest	67ac36101c	disable BatchNormalizationGrad test (#15485 ) ### Description Temporarily disable BatchNormalizationGrad test due to random failure. Example: ``` 2023-04-12T06:33:24.1593811Z 1: [ RUN ] GradientCheckerTest.BatchNormalizationGrad 2023-04-12T06:33:27.5603881Z 1: D:\a\_work\1\s\orttraining\orttraining\test\gradient\gradient_ops_test.cc(1468): error: Value of: IsErrorWithinTolerance(max_error, error_tolerance) 2023-04-12T06:33:27.5604509Z 1: Actual: false 2023-04-12T06:33:27.5604719Z 1: Expected: true 2023-04-12T06:33:27.5604997Z 1: max_error: 1.776702880859375; tolerance: 0.019999999552965164; ORT test random seed: 2552121240; 2023-04-12T06:33:27.5605266Z 1: Google Test trace: 2023-04-12T06:33:27.5605531Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 8910 2023-04-12T06:33:27.5605843Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 5678 2023-04-12T06:33:27.5606478Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 1234 2023-04-12T06:33:27.8285560Z 1: D:\a\_work\1\s\orttraining\orttraining\test\gradient\gradient_ops_test.cc(1493): error: Value of: IsErrorWithinTolerance(max_error, error_tolerance) 2023-04-12T06:33:27.8286181Z 1: Actual: false 2023-04-12T06:33:27.8286404Z 1: Expected: true 2023-04-12T06:33:27.8286669Z 1: max_error: 1.776702880859375; tolerance: 0.019999999552965164; ORT test random seed: 2552121240; 2023-04-12T06:33:27.8286942Z 1: Google Test trace: 2023-04-12T06:33:27.8287208Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 8910 2023-04-12T06:33:27.8287532Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 5678 2023-04-12T06:33:27.8287849Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 1234 2023-04-12T06:33:51.6368960Z 1: [ FAILED ] GradientCheckerTest.BatchNormalizationGrad (27475 ms) ```	2023-04-13 14:53:47 +08:00
pengwa	516c8e95fa	Optimize SCE loss compute (#15401 ) ### Optimize SCE loss compute Compute optimization based on label data sparsity: - Insert ShrunkenGather before SCELoss node, to filter out invalid labels for compute. - Support ShrunkenGather upstream. - Added test for the above. - Added flag to enable label sparsity optimization with env var, by default disabled now. Will enable after comprehensive benchmarking later. - Extract common logic into test_optimizer_utils.h/cc from core/optimizer/compute_optimzier_test.cc, then the common functions can be shared by both core/optimizer/compute_optimzier_test.cc and orttraining/core/optimizer/compute_optimzier_test.cc - Extract common logic into shared_utils.h/cc: `GetONNXOpSetVersion` and `Create1DInitializerFromVector` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-04-13 13:02:12 +08:00
zhijiang	29c74d3c43	softmax perf improvement pr1 - add more softmax related test (#15176 ) 1. add fp16 test 2. add test for shape is not power of two.	2023-04-11 17:02:40 +08:00
Changming Sun	d175e87a1f	Delete eager mode code and increase minimal required python version to 3.8 (#15450 ) ### Description 1. Delete eager mode code. 2. Increase the minimal required python version to 3.8.	2023-04-10 16:00:04 -07:00
Pranav Prakash	3c5d02a9ce	Implement BatchNormGradient kernel for CPU EP (#7622 ) Description: Register an implementation for BatchNormInternal and add a CPU kernel for BatchNormGradient. This is the third in a series of PRs to implement BN training on CPU (first was #6946, second was #7539). Motivation and Context Support training networks with BatchNorm (e.g. convnets). Also note that there exists a CUDA kernel for BN (forward training & backwards) but it's currently disabled due to flaky failures; someone more familiar with those parts can register the implementation for BNInternal on CUDA (gradient kernel doesn't have to change). --------- Co-authored-by: Simon Zirui Guo <simonguozirui@berkeley.edu> Co-authored-by: mindest <linminuser@gmail.com> Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>	2023-04-08 09:20:26 +08:00
Rui Ren	5e2f46df2b	update deepspeed version 0.8.3 (#15415 ) ### Description <!-- Describe your changes. --> Update the support deepspeed to 0.8.3 as it's the latest version ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This will fix the error of `Skip modifying optimizer because of unsupported DeepSpeed version` Co-authored-by: ruiren <ruiren@microsoft.com>	2023-04-07 17:59:50 -07:00
pengwa	16f5909f2d	Introduce shrunken gather operator (#15396 ) ### Introduce shrunken gather operator Exist Gather operator schema won't guarantee output element count will be smaller than input element count. Actually, it is possible output element count >, =, or < input element count. For some cases we know for sure output element count MUST be <= input element count, we will upstream those Gather operators to reduce compute flops. So this PR introduces an ShrunkenGather which explicitly guarantee output count will be smaller than input count. The operator add additional restriction on inputs, but still re-use existing Gather's implementations plus input check during runtime. This is a requirement for subsequent optimization (Draft PR: https://github.com/microsoft/onnxruntime/pull/15401) we will do for label sparsity and embedding sparsity.	2023-04-07 15:12:58 +08:00
Thuy Dao	6e1e808ec8	fix error unqualified call to 'std::move' (#15347 )	2023-04-05 20:40:30 -07:00
pengwa	fe0db63dee	Upstream reshape of merging batch/sequence (#15023 ) ### Upstream reshape of merging batch/sequence For Reshape node that fulfills following requirements: - input data rank = 3 - input shape is constant initializer, the untorched dim value MUST be a constant value. - Reshape is merging the first dimension, so output data rank = 2. We upstream it to make it run as earlier as possible. Doing this will allow us to upstream other operators (Gather) that is blocked by those kind of Reshape node. Currently, we did not enable it in graph_transformer_utils, since the combined upstream gather changes are not ready yet. Before: ![image](https://user-images.githubusercontent.com/10530022/224698252-f9705082-9710-4385-95ec-f1ccf50dc0e3.png) After: ![image](https://user-images.githubusercontent.com/10530022/224698381-7e124d0d-ba47-4f35-8e37-6015014cd1c4.png)	2023-04-05 18:51:07 +08:00
Baiju Meswani	6b755debbc	Miscellaneous updates to training artifact generation (#15315 )	2023-04-04 20:09:51 -07:00
Nhat Nguyen	198994d01d	Register PytorchAtenDomain in RegisterOrtOpSchemas (#14567 )	2023-04-04 17:34:13 -07:00
pengwa	5baf5f506b	log level control + fix typos (#15302 ) ### log level control + fix typos	2023-04-04 20:19:13 +08:00
Baiju Meswani	e870089ca8	Refining the offline tooling for training artifact generation (#15212 )	2023-03-30 18:05:51 -07:00
Justin Chu	710d095124	Refactor the constant `_ONE` in `orttraining_test_ortmodule_api.py` (#15128 ) Follow up of https://github.com/microsoft/onnxruntime/pull/15097#discussion_r1142399537	2023-03-28 08:59:51 -07:00
Justin Chu	938e2136c6	Enable pylint and numpy rules (#15218 ) ### Description Enable pylint and numpy rules ### Motivation and Context Modernize numpy usage and enable more quality checks	2023-03-27 20:37:53 -07:00
Justin Chu	d834ec895a	Adopt linrtunner as the linting tool - take 2 (#15085 ) ### Description `lintrunner` is a linter runner successfully used by pytorch, onnx and onnx-script. It provides a uniform experience running linters locally and in CI. It supports all major dev systems: Windows, Linux and MacOs. The checks are enforced by the `Python format` workflow. This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors in Python code. `lintrunner` now runs all required python lints including `ruff`(replacing `flake8`), `black` and `isort`. Future lints like `clang-format` can be added. Most errors are auto-fixed by `ruff` and the fixes should be considered robust. Lints that are more complicated to fix are applied `# noqa` for now and should be fixed in follow up PRs. ### Notable changes 1. This PR removed some suboptimal patterns: - `not xxx in` -> `xxx not in` membership checks - bare excepts (`except:` -> `except Exception`) - unused imports The follow up PR will remove: - `import *` - mutable values as default in function definitions (`def func(a=[])`) - more unused imports - unused local variables 2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than flake8 and is more robust. We are using it successfully in onnx and onnx-script. It also supports auto-fixing many flake8 errors. 3. Removed the legacy flake8 ci flow and updated docs. 4. The added workflow supports SARIF code scanning reports on github, example snapshot: ![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png) 5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Unified linting experience in CI and local. Replacing https://github.com/microsoft/onnxruntime/pull/14306 --------- Signed-off-by: Justin Chu <justinchu@microsoft.com>	2023-03-24 15:29:03 -07:00
PeixuanZuo	56bccac35d	[ROCm] update bert-L convergence reference file to fix CI (#15200 ) The change of layernorm lead to the change of bert-L convergence result.	2023-03-24 21:43:44 +08:00
pengwa	1d32285536	Statistics tool for ORTModule convergence parity (#15020 ) ### Statistics tool for ORTModule convergence parity As ORTModule get more and more validated, it is pretty fast to intergrade PyTorch based model with ORT. The same time, we need make sure once there is convergence issue, we don't spend months of time to investigate. As part of this efforts, this PR is introducing a tool to dump activation statistics without much involvement from users. The dumping results contains only some statistic numbers plus sampled data, which is not big, compared with dumping all the tensors, it is much faster and space efficient. For us to use it, two single lines are needed before wrapping ORTModule. For baseline run, need also apply the same trick. ``` + from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber + SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)]) ``` Once you run the steps, following command can be used to merge result into per-step-summary respectively for ORT and baseline runs. ```bash python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output ``` Docs is added here as part of this PR [convergence investigation notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md) Based on the generated merged files, we can compare them with tools. ![image](https://user-images.githubusercontent.com/10530022/224653929-4e4480bd-bb02-4bbe-bd44-2672bdf91a87.png) ### Design and Implementation This PR introduced a common mechanism registering custom logic for nn.Module's post forward hooks. And statistics for activation (StatisticsSubscriber) is one of the implementations. If there is other needs, we can define another XXSubscriber to do the customized things.	2023-03-23 20:34:24 +08:00
pengwa	7bec80d92a	Fix reference count for autograd.Function (#15121 ) ### Fix reference count for autograd When PythonOp kernel initialized, `AddPointerScalarArgs` creates `const_args_` which put all non-tensor references (including ProcessGroup, string, or other user types) in it. In kernel's destructor, all ref cnt got decreased for `const_args_`. ``` void PythonOpBase::Clear() { for (auto ptr : const_args_) { auto obj = reinterpret_cast<PyObject*>(ptr); Py_DECREF(obj); } } ``` It means, we did not increase cnt, but just decrease cnt. Running the unit, segmentation fault will be thrown. The simple fix is to remove the Py_DECREF for those pointer-type constant inputs triggered by kernel destructor. NONTENSOR_OBJECT_POINTER_STORE is the place we increase the reference during export, then the reference will remain until the python program terminates. Additionally tunings: 1. Move some logs into verbose instead of warning in case of flooding training logs. 2. Move pointer type ref holding from python side (NONTENSOR_OBJECT_POINTER_STORE) to orttraining/orttraining/core/framework/torch/custom_function_register.h. Then we use a consistent approach to manage all PythonOp related python object/methonds ref count increasing and decreasing.	2023-03-23 12:51:50 +08:00
Baiju Meswani	0086f7590d	LSTM and LSTM gradient implementation for training (#15034 )	2023-03-21 21:44:08 -07:00
Justin Chu	bdd7bd084c	Remove the use of eval in test code (#15097 ) ### Description Remove the use of `eval` in test code so we don't (1) use eval and (2) create "unused" local vars that ruff will remove. Predecessor to #15085	2023-03-20 09:43:56 -07:00
pengwa	1ccb79476c	Fix training gpu ci related to pl upgrade (#15092 ) ### Fix training gpu ci related to pl upgrade As new version of pln relased, old parameter of pln.Trainer, gpus looks not supported. So we switch to new params to make it work. ``` ['/home/onnxruntimedev/miniconda3/bin/python3', 'orttraining_test_ortmodule_torch_lightning_basic.py', '--train-steps=470', '--epochs=2', '--batch-size=256', '--data-dir', '/mnist'] /home/onnxruntimedev/miniconda3/lib/python3.8/site-packages/torch/onnx/utils.py:1794: FutureWarning: The first argument to symbolic functions is deprecated in 1.13 and will be removed in the future. Please annotate treat the first argument (g) as GraphContext and use context information from the object instead. warnings.warn( Traceback (most recent call last): File "orttraining_test_ortmodule_torch_lightning_basic.py", line 101, in <module> main() File "orttraining_test_ortmodule_torch_lightning_basic.py", line 96, in main trainer = pl.Trainer(kwargs) File "/home/onnxruntimedev/miniconda3/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse.py", line 69, in insert_env_defaults return fn(self, kwargs) TypeError: __init__() got an unexpected keyword argument 'gpus' ```	2023-03-17 13:26:58 +08:00
Christian Veenhuis	59dfcfdce7	Fix typos in sources: operater, tranform, neccessary, trainig (#14907 ) ### Description While browsing the sources I found several typos here and there. I collected them to a single PR and fixed them. Namely these typos are: operater, tranform, neccessary, trainig. After fixing none of them was found anymore: $ git grep "operater" $ git grep "tranform" $ git grep "neccessary" $ git grep "trainig" $ ### Motivation and Context Since some of the typos are in example notebooks and markdown files, users can see them.	2023-03-13 22:45:04 -07:00

1 2 3 4 5 ...

1259 commits