Commit graph

99 commits

Author SHA1 Message Date
Wei-Sheng Chin
0aeb383273
Support Pipeline in Training Runner (#3770) 2020-05-06 21:03:36 -07:00
Xueyun Zhu
0e59668c1b
add support for symbolic broadcast for Add/Sub/Mul (#3743)
* add support for symbolic broadcast

* fix comment

* address feedback
2020-05-06 10:40:57 -07:00
Bowen Bao
f7ff5a7aa1
Fix state_dict and save_as_onnx for training (#3774) 2020-05-05 11:47:46 -07:00
Changming Sun
bd78364411
Parallel all the activations ops (#3722)
1. Parallel all the activations ops.
2. Parallel the performance critical path of the LRN op, which makes the ONNX model zoo googlenet model runs 60% faster(latency reduced from 21ms to 13ms).
3. Make the Gemm-Activation fusion support with all the activations ops. Before this change, it only supports LeakyRelu/Relu/Sigmoid/Tanh.
4. Delete onnxruntime/test/framework/op_kernel_test.cc because the file is almost empty.
5. Remove the loggings in KernelRegistry::TryFindKernel, return Status with error message instead.
2020-05-05 01:18:17 -07:00
M. Zeeshan Siddiqui
a24c71af40
Update Dropout(12) forward kernel with training_mode input. (#3805)
* Update Dropout(12) forward and backward kernel with training_mode input.

* Revert deleted assert.

* clean up.

* PR feedback.
2020-05-04 20:05:42 -07:00
M. Zeeshan Siddiqui
6f95cdfa68
Use new cost based threadpool abstractions in CPU gradient operators. (#3807)
* Use ThreadPool abstractions instead of OpenMP.

* PR feedback.
2020-05-04 15:23:10 -07:00
Sherlock
2f8a2364c3
Fix loss function builder (#3801) 2020-05-04 10:41:15 -07:00
M. Zeeshan Siddiqui
4f9f6aedea
CUDA/CPU test for NegativeLogLikelihoodLoss(12) function based loss operator. (#3793) 2020-05-01 21:36:29 -07:00
Xueyun Zhu
e8e95110d3
add pipeline to distributed context config (#3789)
* add pipeline to distributed context

* white space
2020-05-01 13:49:51 -07:00
edgchen1
047975e404
Address flaky test ReduceApiTest.Sum. (#3716)
Increase test comparison tolerance. Add output of random seed value for easier debugging later. Unify RandomValueGenerator::Uniform() to consistently use [min, max) interval.
2020-05-01 09:18:26 -07:00
pengwa
98b97be635
collect the last few iteration latency for throuput calculation (#3766) 2020-05-01 13:24:17 +08:00
liqunfu
af3988198c
Liqun/e2e transformer test (#3540)
* initial change to transformer.py

* prepare e2e transformer tests

* refactor transformer tests

* put test python files in a flat folder

* fix typo pip install transform(s)

* python 3.6

* python version to 3.6 in install_ubuntu.sh

* remove argparser

* to use opset ver 12

* workaround loss_scale naming patch in case of loss_fn_

* assign self.loss_fn_ so it can be checked

* skip a few un-needed post-process steps

* fix loss_scale_input_name, clean up post process steps

* skip non-frontend tests

* move cpu/cuda related files to coresponding cpu/cuda folder (#3668)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* type cast for ratio is not necessary for dropout (#3682)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* thrustallocator is not needed since cub is used directly for gather now. (#3683)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* GatherND-12 Implementation (#3645)

* Renamed, UT passing

* Move GatherND CUDA Kerenl into onnxruntime

* Merge GatherNDOpTest

* Refactor Test code

* Merge CPU Kernel Impl

* Handle Negative Indice, Fix UT

* Improve CUDA kernel to handle negative index

* Minor Fixes

* Preserve GatherND-1 Cuda kernel

* Fix Mac build

* fix UT

* Fix Build

* fix GatherNDOpTest.double > CUDA error cudaErrorInvalidDeviceFunction:invalid device function

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>

* update with reviewers' comments

* testBertTrainingGradientAccumulation was not using rtol and may fail occasionally with small (e-06) difference

* fix merge mistakes

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Weixing Zhang <weixingzhang@users.noreply.github.com>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
2020-04-30 12:26:38 -07:00
M. Zeeshan Siddiqui
b9a5ed1fe2
Add SoftmaxCrossEntropyLoss to mixed-precision-transformer. (#3760) 2020-04-30 02:48:21 -07:00
pengwa
0531acccc5
Refine GatherND CPU/CUDA Kernels & Add UTs (#3688)
* Refactor GatherND CPU Kernel (Renaming & Simplify)

* Add batch_dim=1 or 2, negative slices tests

* Rename gather_nd_gard_impl.cu

* Use dispatcher to refactor CUDA GatherND/GatherNDGrad

* Change GatherNDBase::CommonComputeKernel --> GatherNDBase::PrepareCompute

* Use HasCudaEnvironment instead of __CUDA_ARCH__ for some double type tests
2020-04-30 10:17:54 +08:00
ashbhandare
58f53966d3
Add Distributed Checkpointing support (#3639)
* Change naming of moments to Moment_x_<weight_name>

* Add checkpointing code and zero checkpoint aggregation

* Correct aggregation for LAMB, cleanup

* Add simple checkpointing test

* Add test for zero checkpoint aggregation

* Fix tests

* fix test

* Review changes

* Fix test after review comment fix

* Fix API, test

* Fix test after API change

* Decouple save load from ORTTrainer

* Add flag to not break checkpointing with ORTModel'

Co-authored-by: aishwarya bhandare <aibhanda@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-04-29 14:52:21 -07:00
suffiank
ea0e2d1dde
fix warning treated as error due to ignoring return status (#3739)
Co-authored-by: suffian khan <sukha@microsoft.com>
2020-04-29 02:38:53 -07:00
Tixxx
0638565fe0
Fix evaluation issues (#3538)
* allow switching between eval and training modes dynamically

Co-authored-by: Tixxx <root@525204a066204ea794f942530b05ae7f000000.axlncovkyjne5caro2tmz3zryb.xx.internal.cloudapp.net>
2020-04-28 21:03:37 -07:00
M. Zeeshan Siddiqui
939589c265
Fix flaky test and avoid divide by zero in SoftmaxCrossEntropyLoss-CPU. (#3734)
* Fix flaky test and avoid divide by zero in SoftmaxCrossEntropyLoss-CPU.

* fix gather test?

* PR feedback.
2020-04-28 19:35:14 -07:00
edgchen1
1bcfd49918
Merge pull request #3731 from microsoft/ettao/ort-2-master
Merge from ort_training to master
2020-04-28 07:56:05 -07:00
ytaous
75c24a5fac
Revert "Merge from ort_training to master (#3719)" (#3726)
This reverts commit b990ba0059.
2020-04-27 20:42:43 -07:00
ytaous
b990ba0059
Merge from ort_training to master (#3719)
* move cpu/cuda related files to coresponding cpu/cuda folder (#3668)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* type cast for ratio is not necessary for dropout (#3682)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* thrustallocator is not needed since cub is used directly for gather now. (#3683)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* GatherND-12 Implementation (#3645)

* Renamed, UT passing

* Move GatherND CUDA Kerenl into onnxruntime

* Merge GatherNDOpTest

* Refactor Test code

* Merge CPU Kernel Impl

* Handle Negative Indice, Fix UT

* Improve CUDA kernel to handle negative index

* Minor Fixes

* Preserve GatherND-1 Cuda kernel

* Fix Mac build

* fix UT

* Fix Build

* fix GatherNDOpTest.double > CUDA error cudaErrorInvalidDeviceFunction:invalid device function

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>

* Set gradient as output only for easy mode (#3694)

* Support GPU Event Operators (#3653)

* Add GPU event operators to support in-place updates in
gradient accumulator and optimizer for modifying the tensors
passing through those event operators.

* Address comment and polish code

* Merge shared code between CPU and GPU kernels

* Move event test to a new file

* Address comments

* Update onnxruntime/core/providers/cuda/gpu_data_transfer.cc

* fix path of cpu_featurizers_kernels.cc and cpu_featurizers_kernels.h

Co-authored-by: Weixing Zhang <weixingzhang@users.noreply.github.com>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
Co-authored-by: ashbhandare <ash.bhandare@gmail.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-04-27 16:45:21 -07:00
Sherlock
635bc9cd04
Fix graph transformers to support opset 12 ops (#3715) 2020-04-27 11:53:45 -07:00
Ethan Tao
0516e7d22e Merge branch 'ort_public_ort_training' into ettao/ort-2-master 2020-04-27 18:17:17 +00:00
Wei-Sheng Chin
72b38f0a8b
Support GPU Event Operators (#3653)
* Add GPU event operators to support in-place updates in
gradient accumulator and optimizer for modifying the tensors
passing through those event operators.

* Address comment and polish code

* Merge shared code between CPU and GPU kernels

* Move event test to a new file

* Address comments

* Update onnxruntime/core/providers/cuda/gpu_data_transfer.cc
2020-04-24 17:43:04 -07:00
edgchen1
8b5d6fbaf5
Remove internal work item links. (#3698) 2020-04-24 15:38:30 -07:00
ashbhandare
d06763ac1c
Set gradient as output only for easy mode (#3694) 2020-04-24 15:28:28 -07:00
Sherlock
b4d4ea2e5f
GatherND-12 Implementation (#3645)
* Renamed, UT passing

* Move GatherND CUDA Kerenl into onnxruntime

* Merge GatherNDOpTest

* Refactor Test code

* Merge CPU Kernel Impl

* Handle Negative Indice, Fix UT

* Improve CUDA kernel to handle negative index

* Minor Fixes

* Preserve GatherND-1 Cuda kernel

* Fix Mac build

* fix UT

* Fix Build

* fix GatherNDOpTest.double > CUDA error cudaErrorInvalidDeviceFunction:invalid device function

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
2020-04-24 20:55:30 +08:00
Weixing Zhang
2f8a17dcde
thrustallocator is not needed since cub is used directly for gather now. (#3683)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-24 01:51:54 -07:00
Weixing Zhang
c929963d74
type cast for ratio is not necessary for dropout (#3682)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-24 00:49:37 -07:00
Weixing Zhang
f4a04c04e1
move cpu/cuda related files to coresponding cpu/cuda folder (#3668)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-24 00:12:02 -07:00
Weixing Zhang
336624806e
Simplify and clean code (#3655)
1. It is not necessary to include cudnn_common.h for kernels which are not implemented with CUDNN.
2. Minor change in layer norm kernel to simplify the code and resolve building warning.

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-23 10:12:55 -07:00
XiaocenDong
125f68f305
fixed mnist bug (#3569)
* fixed mnist bug

* fixed train_step param
2020-04-23 23:22:38 +08:00
Xueyun Zhu
f1ba9aaf34
Add pipeline transformer for wait/record node (#3513)
* pipeline transformer

* clean up

* address feedback

* add record/wait for first stage and updated split script

* address feedback

* make recv/send signal as initializer

* merge

* address feedback

* unify input and initializer

* address feedback and bug fix

* minor fix

* windows build

* fix
2020-04-22 23:28:01 -07:00
pengwa
6136fd0789
GatherElementsGrad Kernels (#3627)
* GatherElementsGrad cuda kernel & tests

* Fix comments

* Fix include path
2020-04-23 14:02:34 +08:00
Vincent Wang
ffe19ae49b
Expand elimination and Expand gradient. (#3610)
* Expand elmination and Expand gradient.

* Resolve comments.

* Fix test break.

* Check if graph can remove the node.

* Resolve comment.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-04-23 13:17:15 +08:00
Tang, Cheng
37f4f74308
expose training session so the training app could register custom kernel and transformers (#3642)
Co-authored-by: Cheng Tang <chenta@microsoft.com>
2020-04-22 21:35:41 -07:00
suffiank
0e12d05cd2
fixes for ort_trainer.py to resume from checkpoint (#3510)
* fixes for ort_trainer.py to resume from checkpoint

* define self.state_dict_ during init

* add comment of explanation

* add unit test for restore from checkpoint

* fix file not found

Co-authored-by: suffian khan <sukha@microsoft.com>
2020-04-22 16:33:58 -07:00
Weixing Zhang
e4fc83252d
Refactoring code related to WARP_SIZE. (#3623)
1. Centralize its definition in common.cuh.
2. Rename it to GPU_WARP_SIZE which can be extended to AMD GPU later.
3. Centralize warp shuffle functions.

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-22 15:19:06 -07:00
edgchen1
bb9b0ba5b3
Merge pull request #3607 from microsoft/edgchen1/merge_from_master
Merge from master to ort_training
2020-04-22 13:22:32 -07:00
Wei-Sheng Chin
ab70625b29
Add Lamb shape inference (#3634) 2020-04-22 11:32:28 -07:00
Edward Chen
8d09cefafc Merge remote-tracking branch 'origin/ort_training' into edgchen1/merge_from_master 2020-04-22 16:56:15 +00:00
edgchen1
b518cb2a7a
Clean up OPTIONAL name conflict workarounds in ort_training. (#3622)
* Clean up OPTIONAL name conflict workarounds.

* Cleanup unnecessory header files onnx_protobuf.h

Co-authored-by: Sherlock Huang
2020-04-22 09:07:55 -07:00
Vincent Wang
d3a2ac5c5c
Eliminate Useless Cast during Transformer. (#3606)
* Remove Useless Cast during Transformer.

* Resolve comments.

* Check if graph can remove the node.

Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-04-22 16:36:46 +08:00
Sherlock
d66d5bb86a
Update Optimizer Domain and Opset (#3602)
* Update Domain and Opset for SGD

* Update Adam Domain and Opset

* Update Lamb Domain and Opset
2020-04-21 15:06:02 -07:00
Edward Chen
2e4b9b1d0e Disable CudaKernelTest.SoftmaxCrossEntropyLoss_LargeSizeTensor because it's flaky. 2020-04-21 20:30:45 +00:00
Edward Chen
d50c3e7a71 Fix GraphTransformationTests tests. 2020-04-21 18:43:49 +00:00
Edward Chen
daa14b64e3 Merge remote-tracking branch 'origin/master' into edgchen1/merge_from_master 2020-04-21 03:31:32 +00:00
liqunfu
781e1c36be
Add front-end MNIST test (#3231)
* add frontend minst test

* to use torch nightly with torchvision

* remove incorrect comment per reviewer's comment

* experiment torchvision import failure

* experiment install_deps.sh

* more experiment install_deps.sh

* experiment install_deps.sh with --upgrade

* Experiment with install_deps.sh.

* Experiment with install_ubuntu.sh.

* Use Ubuntu 18.04 and Python 3.6 for CI.

* Update cmake version for CI.

* Install MPI on Ubuntu 18.04 for CI.

* Increase tolerance for MNIST test.

* Go back to Ubuntu 16.04 for CI, fix installing from deadsnakes ppa.

* Clean-up.

* Update ort_trainer.py from ort_training.

* Get default Ubuntu Python ver back to 3.5.

* Add underscore to opset_version parameter name in ORTTrainer constructor.

* Move loss/model wrap before the call for sample output.

* Update expected values for MNIST test.

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com>
2020-04-20 11:19:31 -07:00
edgchen1
811bd67872
Clean up docs. (#3579)
* Fix orttraining/README.md formatting.

* Delete ORT_TRAINING_BUILDS.md.

* Fix typo.
2020-04-17 22:13:11 -07:00
edgchen1
2cb8cb816f
Disable or update flaky tests, improve test random seed accessibility. (#3495)
- Add output of test random seed
- Allow setting of test random seed with environment variable
- Disable / relax tolerance for flaky tests
2020-04-17 15:57:32 -07:00