Commit graph

91 commits

Author SHA1 Message Date
Xueyun Zhu
e8e95110d3
add pipeline to distributed context config (#3789)
* add pipeline to distributed context

* white space
2020-05-01 13:49:51 -07:00
edgchen1
047975e404
Address flaky test ReduceApiTest.Sum. (#3716)
Increase test comparison tolerance. Add output of random seed value for easier debugging later. Unify RandomValueGenerator::Uniform() to consistently use [min, max) interval.
2020-05-01 09:18:26 -07:00
pengwa
98b97be635
collect the last few iteration latency for throuput calculation (#3766) 2020-05-01 13:24:17 +08:00
liqunfu
af3988198c
Liqun/e2e transformer test (#3540)
* initial change to transformer.py

* prepare e2e transformer tests

* refactor transformer tests

* put test python files in a flat folder

* fix typo pip install transform(s)

* python 3.6

* python version to 3.6 in install_ubuntu.sh

* remove argparser

* to use opset ver 12

* workaround loss_scale naming patch in case of loss_fn_

* assign self.loss_fn_ so it can be checked

* skip a few un-needed post-process steps

* fix loss_scale_input_name, clean up post process steps

* skip non-frontend tests

* move cpu/cuda related files to coresponding cpu/cuda folder (#3668)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* type cast for ratio is not necessary for dropout (#3682)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* thrustallocator is not needed since cub is used directly for gather now. (#3683)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* GatherND-12 Implementation (#3645)

* Renamed, UT passing

* Move GatherND CUDA Kerenl into onnxruntime

* Merge GatherNDOpTest

* Refactor Test code

* Merge CPU Kernel Impl

* Handle Negative Indice, Fix UT

* Improve CUDA kernel to handle negative index

* Minor Fixes

* Preserve GatherND-1 Cuda kernel

* Fix Mac build

* fix UT

* Fix Build

* fix GatherNDOpTest.double > CUDA error cudaErrorInvalidDeviceFunction:invalid device function

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>

* update with reviewers' comments

* testBertTrainingGradientAccumulation was not using rtol and may fail occasionally with small (e-06) difference

* fix merge mistakes

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Weixing Zhang <weixingzhang@users.noreply.github.com>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
2020-04-30 12:26:38 -07:00
M. Zeeshan Siddiqui
b9a5ed1fe2
Add SoftmaxCrossEntropyLoss to mixed-precision-transformer. (#3760) 2020-04-30 02:48:21 -07:00
pengwa
0531acccc5
Refine GatherND CPU/CUDA Kernels & Add UTs (#3688)
* Refactor GatherND CPU Kernel (Renaming & Simplify)

* Add batch_dim=1 or 2, negative slices tests

* Rename gather_nd_gard_impl.cu

* Use dispatcher to refactor CUDA GatherND/GatherNDGrad

* Change GatherNDBase::CommonComputeKernel --> GatherNDBase::PrepareCompute

* Use HasCudaEnvironment instead of __CUDA_ARCH__ for some double type tests
2020-04-30 10:17:54 +08:00
ashbhandare
58f53966d3
Add Distributed Checkpointing support (#3639)
* Change naming of moments to Moment_x_<weight_name>

* Add checkpointing code and zero checkpoint aggregation

* Correct aggregation for LAMB, cleanup

* Add simple checkpointing test

* Add test for zero checkpoint aggregation

* Fix tests

* fix test

* Review changes

* Fix test after review comment fix

* Fix API, test

* Fix test after API change

* Decouple save load from ORTTrainer

* Add flag to not break checkpointing with ORTModel'

Co-authored-by: aishwarya bhandare <aibhanda@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-04-29 14:52:21 -07:00
suffiank
ea0e2d1dde
fix warning treated as error due to ignoring return status (#3739)
Co-authored-by: suffian khan <sukha@microsoft.com>
2020-04-29 02:38:53 -07:00
Tixxx
0638565fe0
Fix evaluation issues (#3538)
* allow switching between eval and training modes dynamically

Co-authored-by: Tixxx <root@525204a066204ea794f942530b05ae7f000000.axlncovkyjne5caro2tmz3zryb.xx.internal.cloudapp.net>
2020-04-28 21:03:37 -07:00
M. Zeeshan Siddiqui
939589c265
Fix flaky test and avoid divide by zero in SoftmaxCrossEntropyLoss-CPU. (#3734)
* Fix flaky test and avoid divide by zero in SoftmaxCrossEntropyLoss-CPU.

* fix gather test?

* PR feedback.
2020-04-28 19:35:14 -07:00
edgchen1
1bcfd49918
Merge pull request #3731 from microsoft/ettao/ort-2-master
Merge from ort_training to master
2020-04-28 07:56:05 -07:00
ytaous
75c24a5fac
Revert "Merge from ort_training to master (#3719)" (#3726)
This reverts commit b990ba0059.
2020-04-27 20:42:43 -07:00
ytaous
b990ba0059
Merge from ort_training to master (#3719)
* move cpu/cuda related files to coresponding cpu/cuda folder (#3668)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* type cast for ratio is not necessary for dropout (#3682)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* thrustallocator is not needed since cub is used directly for gather now. (#3683)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* GatherND-12 Implementation (#3645)

* Renamed, UT passing

* Move GatherND CUDA Kerenl into onnxruntime

* Merge GatherNDOpTest

* Refactor Test code

* Merge CPU Kernel Impl

* Handle Negative Indice, Fix UT

* Improve CUDA kernel to handle negative index

* Minor Fixes

* Preserve GatherND-1 Cuda kernel

* Fix Mac build

* fix UT

* Fix Build

* fix GatherNDOpTest.double > CUDA error cudaErrorInvalidDeviceFunction:invalid device function

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>

* Set gradient as output only for easy mode (#3694)

* Support GPU Event Operators (#3653)

* Add GPU event operators to support in-place updates in
gradient accumulator and optimizer for modifying the tensors
passing through those event operators.

* Address comment and polish code

* Merge shared code between CPU and GPU kernels

* Move event test to a new file

* Address comments

* Update onnxruntime/core/providers/cuda/gpu_data_transfer.cc

* fix path of cpu_featurizers_kernels.cc and cpu_featurizers_kernels.h

Co-authored-by: Weixing Zhang <weixingzhang@users.noreply.github.com>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
Co-authored-by: ashbhandare <ash.bhandare@gmail.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-04-27 16:45:21 -07:00
Sherlock
635bc9cd04
Fix graph transformers to support opset 12 ops (#3715) 2020-04-27 11:53:45 -07:00
Ethan Tao
0516e7d22e Merge branch 'ort_public_ort_training' into ettao/ort-2-master 2020-04-27 18:17:17 +00:00
Wei-Sheng Chin
72b38f0a8b
Support GPU Event Operators (#3653)
* Add GPU event operators to support in-place updates in
gradient accumulator and optimizer for modifying the tensors
passing through those event operators.

* Address comment and polish code

* Merge shared code between CPU and GPU kernels

* Move event test to a new file

* Address comments

* Update onnxruntime/core/providers/cuda/gpu_data_transfer.cc
2020-04-24 17:43:04 -07:00
edgchen1
8b5d6fbaf5
Remove internal work item links. (#3698) 2020-04-24 15:38:30 -07:00
ashbhandare
d06763ac1c
Set gradient as output only for easy mode (#3694) 2020-04-24 15:28:28 -07:00
Sherlock
b4d4ea2e5f
GatherND-12 Implementation (#3645)
* Renamed, UT passing

* Move GatherND CUDA Kerenl into onnxruntime

* Merge GatherNDOpTest

* Refactor Test code

* Merge CPU Kernel Impl

* Handle Negative Indice, Fix UT

* Improve CUDA kernel to handle negative index

* Minor Fixes

* Preserve GatherND-1 Cuda kernel

* Fix Mac build

* fix UT

* Fix Build

* fix GatherNDOpTest.double > CUDA error cudaErrorInvalidDeviceFunction:invalid device function

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
2020-04-24 20:55:30 +08:00
Weixing Zhang
2f8a17dcde
thrustallocator is not needed since cub is used directly for gather now. (#3683)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-24 01:51:54 -07:00
Weixing Zhang
c929963d74
type cast for ratio is not necessary for dropout (#3682)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-24 00:49:37 -07:00
Weixing Zhang
f4a04c04e1
move cpu/cuda related files to coresponding cpu/cuda folder (#3668)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-24 00:12:02 -07:00
Weixing Zhang
336624806e
Simplify and clean code (#3655)
1. It is not necessary to include cudnn_common.h for kernels which are not implemented with CUDNN.
2. Minor change in layer norm kernel to simplify the code and resolve building warning.

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-23 10:12:55 -07:00
XiaocenDong
125f68f305
fixed mnist bug (#3569)
* fixed mnist bug

* fixed train_step param
2020-04-23 23:22:38 +08:00
Xueyun Zhu
f1ba9aaf34
Add pipeline transformer for wait/record node (#3513)
* pipeline transformer

* clean up

* address feedback

* add record/wait for first stage and updated split script

* address feedback

* make recv/send signal as initializer

* merge

* address feedback

* unify input and initializer

* address feedback and bug fix

* minor fix

* windows build

* fix
2020-04-22 23:28:01 -07:00
pengwa
6136fd0789
GatherElementsGrad Kernels (#3627)
* GatherElementsGrad cuda kernel & tests

* Fix comments

* Fix include path
2020-04-23 14:02:34 +08:00
Vincent Wang
ffe19ae49b
Expand elimination and Expand gradient. (#3610)
* Expand elmination and Expand gradient.

* Resolve comments.

* Fix test break.

* Check if graph can remove the node.

* Resolve comment.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-04-23 13:17:15 +08:00
Tang, Cheng
37f4f74308
expose training session so the training app could register custom kernel and transformers (#3642)
Co-authored-by: Cheng Tang <chenta@microsoft.com>
2020-04-22 21:35:41 -07:00
suffiank
0e12d05cd2
fixes for ort_trainer.py to resume from checkpoint (#3510)
* fixes for ort_trainer.py to resume from checkpoint

* define self.state_dict_ during init

* add comment of explanation

* add unit test for restore from checkpoint

* fix file not found

Co-authored-by: suffian khan <sukha@microsoft.com>
2020-04-22 16:33:58 -07:00
Weixing Zhang
e4fc83252d
Refactoring code related to WARP_SIZE. (#3623)
1. Centralize its definition in common.cuh.
2. Rename it to GPU_WARP_SIZE which can be extended to AMD GPU later.
3. Centralize warp shuffle functions.

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-22 15:19:06 -07:00
edgchen1
bb9b0ba5b3
Merge pull request #3607 from microsoft/edgchen1/merge_from_master
Merge from master to ort_training
2020-04-22 13:22:32 -07:00
Wei-Sheng Chin
ab70625b29
Add Lamb shape inference (#3634) 2020-04-22 11:32:28 -07:00
Edward Chen
8d09cefafc Merge remote-tracking branch 'origin/ort_training' into edgchen1/merge_from_master 2020-04-22 16:56:15 +00:00
edgchen1
b518cb2a7a
Clean up OPTIONAL name conflict workarounds in ort_training. (#3622)
* Clean up OPTIONAL name conflict workarounds.

* Cleanup unnecessory header files onnx_protobuf.h

Co-authored-by: Sherlock Huang
2020-04-22 09:07:55 -07:00
Vincent Wang
d3a2ac5c5c
Eliminate Useless Cast during Transformer. (#3606)
* Remove Useless Cast during Transformer.

* Resolve comments.

* Check if graph can remove the node.

Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-04-22 16:36:46 +08:00
Sherlock
d66d5bb86a
Update Optimizer Domain and Opset (#3602)
* Update Domain and Opset for SGD

* Update Adam Domain and Opset

* Update Lamb Domain and Opset
2020-04-21 15:06:02 -07:00
Edward Chen
2e4b9b1d0e Disable CudaKernelTest.SoftmaxCrossEntropyLoss_LargeSizeTensor because it's flaky. 2020-04-21 20:30:45 +00:00
Edward Chen
d50c3e7a71 Fix GraphTransformationTests tests. 2020-04-21 18:43:49 +00:00
Edward Chen
daa14b64e3 Merge remote-tracking branch 'origin/master' into edgchen1/merge_from_master 2020-04-21 03:31:32 +00:00
liqunfu
781e1c36be
Add front-end MNIST test (#3231)
* add frontend minst test

* to use torch nightly with torchvision

* remove incorrect comment per reviewer's comment

* experiment torchvision import failure

* experiment install_deps.sh

* more experiment install_deps.sh

* experiment install_deps.sh with --upgrade

* Experiment with install_deps.sh.

* Experiment with install_ubuntu.sh.

* Use Ubuntu 18.04 and Python 3.6 for CI.

* Update cmake version for CI.

* Install MPI on Ubuntu 18.04 for CI.

* Increase tolerance for MNIST test.

* Go back to Ubuntu 16.04 for CI, fix installing from deadsnakes ppa.

* Clean-up.

* Update ort_trainer.py from ort_training.

* Get default Ubuntu Python ver back to 3.5.

* Add underscore to opset_version parameter name in ORTTrainer constructor.

* Move loss/model wrap before the call for sample output.

* Update expected values for MNIST test.

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com>
2020-04-20 11:19:31 -07:00
edgchen1
811bd67872
Clean up docs. (#3579)
* Fix orttraining/README.md formatting.

* Delete ORT_TRAINING_BUILDS.md.

* Fix typo.
2020-04-17 22:13:11 -07:00
edgchen1
2cb8cb816f
Disable or update flaky tests, improve test random seed accessibility. (#3495)
- Add output of test random seed
- Allow setting of test random seed with environment variable
- Disable / relax tolerance for flaky tests
2020-04-17 15:57:32 -07:00
manashgoswami
9fc2b6482b
Ort training README (#3404)
Added README for ORT Training
2020-04-16 14:51:33 -07:00
M. Zeeshan Siddiqui
6c1ccb659f
SoftmaxCrossEntropyLoss-12 forward and backward kernel implementation. (#3465)
* Update ONNX submodule commit to the latest.

* build break.

* SoftmaxCrossEntropyLoss: Forward and backward kernel implementation.

* Revert "build break."

This reverts commit 847cb50d294efbe6c09fa760e7cacf25bfb6146d.

* Add more tests and misc clean up.

* revert unintended changes.

* PR feedback.

* cleanup.

* PR feedback.
2020-04-16 12:27:07 -07:00
Jesse Benson
644bc05830 Add Python API to set random seed: onnxruntime.seed(<seed>) 2020-04-15 09:44:48 -07:00
pengwa
2c7c45076b
MaxBatchSize E2E Test (#3454)
* max batch size e2e test

*update test data snapshot
2020-04-15 09:50:44 +08:00
edgchen1
4fa88a0a23
Remove cast to OpKernelContextInternal to get threadpool and directly use OpKernelContext. (#3523) 2020-04-14 14:30:26 -07:00
Tixxx
06b63975c0
Fix fp16 type mismatch when graph output is an fp32-only node (#3411)
* verify output node before changing its type in mixed precision mode
2020-04-14 09:35:19 -07:00
edgchen1
ba7225f986
Update Graph SetInputs and SetOutputs for training (#3446)
Fix training modification of Graph SetInputs() and SetOutputs(). Originally there were distinct code paths in Graph based on whether the graph was loaded from a GraphProto or created from scratch. The training modifications made that distinction a bit ambiguous - i.e., even though the Graph is loaded from a GraphProto for training, sometimes we rely on the other code path, e.g., to deduce the graph inputs after modifying it. Consequently, there was some odd behavior when using SetInputs(). For correctness, this change separates the cases where the graph is loaded from a GraphProto and where it is created from scratch.
2020-04-13 19:10:44 -07:00
M. Zeeshan Siddiqui
5d99f179b9
Merge pull request #3486 from microsoft/sedymche/merge_master_ort_training
Merge from master into ort_training
2020-04-13 10:55:36 -07:00