Commit graph

372 commits

Author SHA1 Message Date
Thiago Crepaldi
77cefcd6c2 Perform forward pass using training graph with intermediate outputs 2020-12-15 09:03:07 -08:00
Thiago Crepaldi
11b69f141e Forward pass using InferenceSession on exported ONNX
Although forward pass works, this has the limitation of not working for
backward pass due to the lack of intermediate tensors needed for
gradient.

Next step is to export a training graph and split it manually
2020-12-15 09:03:07 -08:00
Edward Chen
9810b9e02b
Reduce amount of compiled CUDA device code (#6118)
Move CudaKernel from cuda_common.h to a new separate header, cuda_kernel.h. Update include sites to use cuda_kernel.h instead if they need CudaKernel. Inclusions of cuda_common.h are now more lightweight.

Make corresponding changes for ROCM execution provider code.

Other minor cleanup.
2020-12-14 15:27:40 -08:00
liqunfu
cde723a136
Liqun/move nightly pl to linux multi gpu v100 (#6024)
* move e2e nightly pipeline to azure devop
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-12-14 12:43:41 -08:00
baijumeswani
dd2e5a1a05
state_dict and load_state_dict for ORTTrainer (#6095)
* add functions state_dict and load_state_dict to ORTTrainer

* unit tests for state_dict and load_state_dict for ORTTrainer
2020-12-14 11:55:52 -08:00
Suffian Khan
6cb5d3ac09
Fix multi-tensor LAMB reduction to be deterministic (#6028)
* define ordering of reduction across blocks

* save state

* remove debug code

* remove debug code

* review comments

* significant correction for reduction only over blocks on same tensor

* addressing ocmments

* update rocm/lamb.cc to build as well

* remove times 2048*size in multitensor test until threshold error in rocm resolved

* convert tuple => struct as per recomendation

* update comment

* apply perfect forwarding for launch_multitensor to permit passing ref rather than pointer

* remove excess template arguments from rocm lamb.cc launch_multitensor as well

* fixes for AMD build

* pr comments

* run formatter from vscode

* formatter on cuda files
2020-12-11 13:13:05 -08:00
Sherlock
a53f4dd379
Introduce VariadicAlias, remove hardcoded alias limits (#6106)
* Introduce VariadicAlias, remove hardcoded alias limits

* Include optional-lite in winml build

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-12-11 10:47:08 -08:00
Jesse Benson
38c49c2483 Make ROCM and CUDA reduction_all code more similar. 2020-12-11 09:35:07 -08:00
Vincent Wang
7ddeafdfcc
Add ReduceL2Grad and ClipGrad (#5970)
* ReduceL2Grad and ClipGrad.

* fix win build and amd ci pipeline

* resolve comments.

Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
2020-12-10 11:03:26 +08:00
Sergii Dymchenko
9e26e59a37
Deprecate opsets <12 for training. (#6027) 2020-12-09 00:15:27 -08:00
Weixing Zhang
d95fc5e849
clean un-used code. (#6059)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-12-08 23:15:30 -08:00
Weixing Zhang
2705115732
add dockerfile for ROCm3.10 and update BUILD.md for ROCm EP (#5821)
* add HSA_NO_SCRATCH_RECLAIM=1 to dockerfile

It is to work around an issue in AMD compiler which generates poor GPU ISA when the type of kernel parameter is a structure and “pass-by-value” is used

* update BUILD.md

* add dockerfile for rocm3.10
2020-12-08 23:14:56 -08:00
ashbhandare
b1a75d0e98
Enable passing initial optimizer state while creating training session (#5869)
* Support to pass initial optimizer states to optimizer graph builder

* Changes for passing init optim state to training session config

* Pass optimizer state through cpp and python frontend

* Cleanup

* Review comments

* Fix windows and mac CI

* Review comments

* review comments

* Review comments

* Frontend review changes

* Fix CI
2020-12-08 21:20:51 -05:00
Sherlock
7a43fa0028
Fix AllReduce kernel for contiguous buffer (#6064) 2020-12-08 15:55:13 -08:00
baijumeswani
523d187193
save data to and load data from an hdf5 file for checkpointing (#5975)
* save python dictionary to hdf5 representation and load an hdf5 file into a python dictionary

* unit tests for saving data to and loading data from hdf5 file
2020-12-08 11:40:57 -08:00
ashbhandare
7cebf76a46
Improve checkpointing for Zero stage 1 (#5478)
* Initial running changes

* Checkpointing aggregation changes

* compare with older version

* initial cleanup

* Add zero test, minor fix

* Fix zero test, transform, formatting

* Review comments

* add more unit tests

* review comments

* Try fix CI

* Add additional check on just aggregation code

* Try fix ckpt gen

* Add pregenerated ckpt for CI, enable zero test in e2e

* Moving test to nightly, removing ckpt files

* Add tests to dist GPU CI

* Fix dist test

* Review comments

* Fix test
2020-12-07 09:16:01 -08:00
Jesse Benson
14f6eb14b1 Use __launch_bounds__ workaround, rather than limiting threads to 256 on AMD. 2020-12-03 13:06:34 -08:00
Jesse Benson
245d43615d Fix AMD multi-tensor implementation. 2020-12-03 13:06:34 -08:00
Sherlock
c86a1e5c13
Fix Flaky orttraining tests (#5977)
* Fix Flacky orttraining  tests
2020-12-03 10:24:25 -08:00
Alberto Magni
fb310fba0c
Avoid adding non-existent inputs to new Event nodes (#5915)
During graph resolve non-existent nodes cause shape-inference failures.
2020-12-01 08:21:05 -08:00
Jesse Benson
45966d878a Code review feedback 2020-11-30 09:24:22 -08:00
Jesse Benson
86e30a2db6 Update CUDA IsAllFinite kernel 2020-11-30 09:24:22 -08:00
Jesse Benson
bd96f60888 Use CUDA's IsAllFinite kernel for ROCm 2020-11-30 09:24:22 -08:00
baijumeswani
69b9368c93
Add unit tests to identify configuration migration scenarios for checkpointing (#5678) 2020-11-25 09:40:26 -08:00
baijumeswani
208f4c1d3c
Azure ci pipeline for distributed environment tests (#5881) 2020-11-23 14:01:00 -08:00
Vincent Wang
47185b9513
reducealll2 cpu kernel (#5833)
Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
2020-11-19 10:20:05 +08:00
Tracy Sharpe
f964bb94ba
Add QLinearConv NHWC transformer (#5824)
The implementation of QLinearConv internally does a transpose(NHWC)->im2col+GEMM->transpose(NCHW). This adds a graph transformer to change a model to use a com.microsoft.QLinearConv that supports NHWC natively to avoid unnecessary transposes.
2020-11-17 20:51:02 -08:00
Edward Chen
71e7c2b423
Cache build docker images in container registry. (#5811)
This PR adds infrastructure to automatically cache docker images used in CI builds in a container registry.

Currently, build images are pulled from a container registry for some builds and built every time for others. The container registry requires maintenance to keep the images up to date and building images every time wastes build agent resources.

With this change, a given build image can be looked up in a cache container registry and if present, pulled, and otherwise, built and pushed. The uniqueness of a build image is determined by a hash digest of the dockerfile, docker build context directory, and certain "docker build" options. This digest is part of the image tag in the cache container repository.

The cache container registry will need to be cleaned up periodically. This is not automated yet.
2020-11-17 17:02:24 -08:00
zhijxu
89e5b3a24f resolve review comments 2020-11-16 11:23:01 +08:00
zhijxu
89902c2519 fix frontend bug.
old ort session may already exists when creating new ort session, this may cause OOM error
2020-11-16 11:23:01 +08:00
Jesse Benson
ced5b66306 Re-enable multi-tensor-apply for LAMB optimizer 2020-11-15 09:35:00 -08:00
Weixing Zhang
fc614ad050 revert the code change which was based on b4869926
The change b4869926 which was to remove per-thread allocator would cause seg fault for
distributed training.

In addition, add dockerfile for ROCm3.9
2020-11-15 00:24:32 -08:00
Vincent Wang
0c8902cbbe
Update Gradient Builder of Some Ops for OpSet13 (#5748)
* gradient builder for opset13

* code clean.

* resolve comments

* stop grad for axes input

* add split to stop grad list.

Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-13 16:20:34 +08:00
Alberto Magni
88c3704257
Add shape inference for additional ops
This commit adds shape inference support for the following ops:

SoftmaxCrossEntropy
SoftmaxCrossEntropyLossGrad
SoftmaxCrossEntropyGrad
LayerNormalizationGrad
Motivation and Context
2020-11-12 20:18:54 +00:00
pengwa
49288de17c
Fix memory planning issues (#5752)
* Fix memory planning issues

* fix build

* fix the wrong line...
2020-11-13 03:07:59 +08:00
Vincent Wang
2a87108431
SoftmaxCrossEntropyLoss OpSet13. (#5777)
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-11-12 15:50:34 +08:00
Sherlock
07dc25e939
Compute global gradient norm according to 'enable_grad_norm_clip' (#5728)
* Introduce PassThrough op to wait for all gradient ready before weight update

* Compute gradient norm for fp32 runs

* Update FE UT expected value

* Respect enable_grad_norm_clip
2020-11-11 21:10:34 -08:00
ashbhandare
5aec34500d
Add megatron transforms for BART (#5521)
* Large model export and run ORT Python support

* Megatron change

refine a bit

workaround self attention issue

use partitioned name for weights when megatron model parallel is enabled

Fix Megatron Transformer Issue (cuased by the renaming)

Add UTs for T5 model parallel

Fix megatron seed issue

fix log a bit

checkkpointing changes + rebase

Unintended reshape transform change

t5 layer norm changes

add t5 layer norm kernel

use template for t5 layer norm

template definition changes

no build error

add CPU cuda kernel

first unit test

other forward unit tests

add T5LayerNormGrad

Add c++ transform and test for T5 LN

minor fix

BART MLP Megatron tranform

Add concat slice transform + test

Cosmetic improvements in concat slice transform

Constant folding bug fix + megatron attention transform for BART

Undo unnecessary changes

* Cleanup

* Remove unnecessary changes

* Cleanup megatron

* Windows build

* Add self attention test graph

* Correcting transforms + cleanup

* review comments

* review comments

* fix build and test failures

* Fix CI

* fix windows CI

Co-authored-by: Peng Wang <pengwa@microsoft.com>
Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-11 16:21:36 -08:00
Xueyun Zhu
d8ace07ad7
Add CPU send/recv for pipeline (#5315)
* cpu send/recv

* clean up send/recv

* remove unused code

* assert and nccl option for mnist

* add build option to enable build with only cpu. Without this, nccl is always enabled which will break build on machine that only contains cpu

* Add USE_MPI distinct from USE_NCCL/USE_HOROVOD

* fix

* fix

* exclude cpu send/recv for machines without mpi

Co-authored-by: Tim Harris <tiharr@microsoft.com>
2020-11-11 12:41:39 -08:00
Derek Murray
bc1768c7f1
Stop gradient flowing to the k input of TopK (#5762) 2020-11-11 10:24:44 -08:00
liqunfu
1416d12f0b
Liqun/merge e2e pipelines (#5702)
* Create an Azure Pipeline to merge cpp and python e2e pipelines into one. Still keep cpp 2e2 pipeline until this new pipeline is stable.

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-11 09:42:08 -08:00
edgchen1
2acdc3cd82
Move GetUseDeterministicCompute() to OpKernelContext to avoid need to downcast to OpKernelContextInternal. (#5729) 2020-11-09 11:37:06 -08:00
Weixing Zhang
bb1af718b5
fix build failures due to recent change(858040fa) in CUDA EP (#5736)
Some part of code for reduction kernels has been changed in 858040fa,
which cause failures in rocm build since ROCm EP shares some code with
CUDA EP. This PR is to quick fix this failure by not sharing two files
for now to unblock CI enabling on ROCm EP. Another PR for leveraging
858040fa for ROCm EP will be done later.
2020-11-09 08:41:30 -08:00
Weixing Zhang
fff85a6a35
Add GPU kernels for ROCm EP (#5655)
* Add kernels for AMD GPU.

This PR is mostly about GPU kernels for ROCm EP. Due to similar GPU programming language (CUDA and HIP and similar math library calls, one principle in ROCM EP design is to share CUDA kernels as much as possible for ROCm. Thus, the script amd_hipify.py has been created for converting CUDA kernels to ROCm HIP kernels automatically during compilation phase. But, for some reasons such as perf issue, syntax difference..., some converted kernels need some manual intervention. These kernels will be checked in the repo physically for now. In order to avoid manual intervention, the plan is to refactor CUDA kernels to make them portable between CUDA EP and ROCm EP as much as possible.

Please refer to "HIP Porting Guide" for details.

* like lamb, multi-tensor-apply needs to be disabled for IsAllFiniteOp and ReduceAllL2, current AMD GPU compiler has perf issue for kernel parameter which is a structure with "pass by value".

* Use hipMemsetAsync and add checks on HIP calls.

* move the generated files to build folder.

Co-authored-by: Jesse Benson <jesseb@microsoft.com>
2020-11-06 16:11:06 -08:00
edgchen1
858040faaa
Implement reduce_matrix_columns() to optimize ReduceSum (#5639)
Implement reduce_matrix_columns() to optimize ReduceSum.
2020-11-05 10:25:00 -08:00
ashbhandare
6d8e81cb08
Update Squeeze, Unsqueeze, Split and ReduceSum kernel for Opset13 (#5691)
* Split  change

* ReduceSum and Split change

* Other op changes, Grad builder, tests, registering required opset 13 ops

* Rebase fixes

* Fix tests, add some more

* Review changes, rebase

* Fix windows build

* Disable new tests for TesnorRT EP

* Disable unsupported for OpenVINO

Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-04 20:00:27 -08:00
wezuo
62a99824cb
Wezuo/priority in nodedef (#5692)
* set the priority in nodedef

* remove debugging stmts

* revoke zero builder

* remove unnecessary namespace comment

Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-mgtbby.eastus.cloudapp.azure.com>
Co-authored-by: Wei Zuo <wezuo@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-04 12:40:37 -08:00
edgchen1
28f1e32898
Loosen tolerance of CudaKernelTest.ReduceSum_MidTensor, allow test random seed to be regenerated within a test run. (#5675) 2020-11-03 10:37:00 -08:00
Changming Sun
87e1063e19
Revert "Update Squeeze, Unsqueeze, Split and ReduceSum kernel for Opset13 (#5488)" (#5668)
This reverts commit db63c5d10f.
2020-11-02 16:09:22 -08:00
Jesse Benson
1495f737ca Use cudaMemsetAsync and add checks on CUDA calls. 2020-11-02 11:25:13 -08:00