Increase test comparison tolerance. Add output of random seed value for easier debugging later. Unify RandomValueGenerator::Uniform() to consistently use [min, max) interval.
* initial change to transformer.py
* prepare e2e transformer tests
* refactor transformer tests
* put test python files in a flat folder
* fix typo pip install transform(s)
* python 3.6
* python version to 3.6 in install_ubuntu.sh
* remove argparser
* to use opset ver 12
* workaround loss_scale naming patch in case of loss_fn_
* assign self.loss_fn_ so it can be checked
* skip a few un-needed post-process steps
* fix loss_scale_input_name, clean up post process steps
* skip non-frontend tests
* move cpu/cuda related files to coresponding cpu/cuda folder (#3668)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* type cast for ratio is not necessary for dropout (#3682)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* thrustallocator is not needed since cub is used directly for gather now. (#3683)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* GatherND-12 Implementation (#3645)
* Renamed, UT passing
* Move GatherND CUDA Kerenl into onnxruntime
* Merge GatherNDOpTest
* Refactor Test code
* Merge CPU Kernel Impl
* Handle Negative Indice, Fix UT
* Improve CUDA kernel to handle negative index
* Minor Fixes
* Preserve GatherND-1 Cuda kernel
* Fix Mac build
* fix UT
* Fix Build
* fix GatherNDOpTest.double > CUDA error cudaErrorInvalidDeviceFunction:invalid device function
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
* update with reviewers' comments
* testBertTrainingGradientAccumulation was not using rtol and may fail occasionally with small (e-06) difference
* fix merge mistakes
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Weixing Zhang <weixingzhang@users.noreply.github.com>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
* Refactor GatherND CPU Kernel (Renaming & Simplify)
* Add batch_dim=1 or 2, negative slices tests
* Rename gather_nd_gard_impl.cu
* Use dispatcher to refactor CUDA GatherND/GatherNDGrad
* Change GatherNDBase::CommonComputeKernel --> GatherNDBase::PrepareCompute
* Use HasCudaEnvironment instead of __CUDA_ARCH__ for some double type tests
* Change naming of moments to Moment_x_<weight_name>
* Add checkpointing code and zero checkpoint aggregation
* Correct aggregation for LAMB, cleanup
* Add simple checkpointing test
* Add test for zero checkpoint aggregation
* Fix tests
* fix test
* Review changes
* Fix test after review comment fix
* Fix API, test
* Fix test after API change
* Decouple save load from ORTTrainer
* Add flag to not break checkpointing with ORTModel'
Co-authored-by: aishwarya bhandare <aibhanda@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* allow switching between eval and training modes dynamically
Co-authored-by: Tixxx <root@525204a066204ea794f942530b05ae7f000000.axlncovkyjne5caro2tmz3zryb.xx.internal.cloudapp.net>
* move cpu/cuda related files to coresponding cpu/cuda folder (#3668)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* type cast for ratio is not necessary for dropout (#3682)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* thrustallocator is not needed since cub is used directly for gather now. (#3683)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* GatherND-12 Implementation (#3645)
* Renamed, UT passing
* Move GatherND CUDA Kerenl into onnxruntime
* Merge GatherNDOpTest
* Refactor Test code
* Merge CPU Kernel Impl
* Handle Negative Indice, Fix UT
* Improve CUDA kernel to handle negative index
* Minor Fixes
* Preserve GatherND-1 Cuda kernel
* Fix Mac build
* fix UT
* Fix Build
* fix GatherNDOpTest.double > CUDA error cudaErrorInvalidDeviceFunction:invalid device function
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
* Set gradient as output only for easy mode (#3694)
* Support GPU Event Operators (#3653)
* Add GPU event operators to support in-place updates in
gradient accumulator and optimizer for modifying the tensors
passing through those event operators.
* Address comment and polish code
* Merge shared code between CPU and GPU kernels
* Move event test to a new file
* Address comments
* Update onnxruntime/core/providers/cuda/gpu_data_transfer.cc
* fix path of cpu_featurizers_kernels.cc and cpu_featurizers_kernels.h
Co-authored-by: Weixing Zhang <weixingzhang@users.noreply.github.com>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
Co-authored-by: ashbhandare <ash.bhandare@gmail.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Ethan Tao <ettao@microsoft.com>
* Add GPU event operators to support in-place updates in
gradient accumulator and optimizer for modifying the tensors
passing through those event operators.
* Address comment and polish code
* Merge shared code between CPU and GPU kernels
* Move event test to a new file
* Address comments
* Update onnxruntime/core/providers/cuda/gpu_data_transfer.cc
* Renamed, UT passing
* Move GatherND CUDA Kerenl into onnxruntime
* Merge GatherNDOpTest
* Refactor Test code
* Merge CPU Kernel Impl
* Handle Negative Indice, Fix UT
* Improve CUDA kernel to handle negative index
* Minor Fixes
* Preserve GatherND-1 Cuda kernel
* Fix Mac build
* fix UT
* Fix Build
* fix GatherNDOpTest.double > CUDA error cudaErrorInvalidDeviceFunction:invalid device function
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
1. It is not necessary to include cudnn_common.h for kernels which are not implemented with CUDNN.
2. Minor change in layer norm kernel to simplify the code and resolve building warning.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* pipeline transformer
* clean up
* address feedback
* add record/wait for first stage and updated split script
* address feedback
* make recv/send signal as initializer
* merge
* address feedback
* unify input and initializer
* address feedback and bug fix
* minor fix
* windows build
* fix
* Expand elmination and Expand gradient.
* Resolve comments.
* Fix test break.
* Check if graph can remove the node.
* Resolve comment.
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
* fixes for ort_trainer.py to resume from checkpoint
* define self.state_dict_ during init
* add comment of explanation
* add unit test for restore from checkpoint
* fix file not found
Co-authored-by: suffian khan <sukha@microsoft.com>
1. Centralize its definition in common.cuh.
2. Rename it to GPU_WARP_SIZE which can be extended to AMD GPU later.
3. Centralize warp shuffle functions.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* Remove Useless Cast during Transformer.
* Resolve comments.
* Check if graph can remove the node.
Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* add frontend minst test
* to use torch nightly with torchvision
* remove incorrect comment per reviewer's comment
* experiment torchvision import failure
* experiment install_deps.sh
* more experiment install_deps.sh
* experiment install_deps.sh with --upgrade
* Experiment with install_deps.sh.
* Experiment with install_ubuntu.sh.
* Use Ubuntu 18.04 and Python 3.6 for CI.
* Update cmake version for CI.
* Install MPI on Ubuntu 18.04 for CI.
* Increase tolerance for MNIST test.
* Go back to Ubuntu 16.04 for CI, fix installing from deadsnakes ppa.
* Clean-up.
* Update ort_trainer.py from ort_training.
* Get default Ubuntu Python ver back to 3.5.
* Add underscore to opset_version parameter name in ORTTrainer constructor.
* Move loss/model wrap before the call for sample output.
* Update expected values for MNIST test.
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com>
Fix training modification of Graph SetInputs() and SetOutputs(). Originally there were distinct code paths in Graph based on whether the graph was loaded from a GraphProto or created from scratch. The training modifications made that distinction a bit ambiguous - i.e., even though the Graph is loaded from a GraphProto for training, sometimes we rely on the other code path, e.g., to deduce the graph inputs after modifying it. Consequently, there was some odd behavior when using SetInputs(). For correctness, this change separates the cases where the graph is loaded from a GraphProto and where it is created from scratch.