Commit graph

345 commits

Author SHA1 Message Date
Edward Chen
71e7c2b423
Cache build docker images in container registry. (#5811)
This PR adds infrastructure to automatically cache docker images used in CI builds in a container registry.

Currently, build images are pulled from a container registry for some builds and built every time for others. The container registry requires maintenance to keep the images up to date and building images every time wastes build agent resources.

With this change, a given build image can be looked up in a cache container registry and if present, pulled, and otherwise, built and pushed. The uniqueness of a build image is determined by a hash digest of the dockerfile, docker build context directory, and certain "docker build" options. This digest is part of the image tag in the cache container repository.

The cache container registry will need to be cleaned up periodically. This is not automated yet.
2020-11-17 17:02:24 -08:00
zhijxu
89e5b3a24f resolve review comments 2020-11-16 11:23:01 +08:00
zhijxu
89902c2519 fix frontend bug.
old ort session may already exists when creating new ort session, this may cause OOM error
2020-11-16 11:23:01 +08:00
Jesse Benson
ced5b66306 Re-enable multi-tensor-apply for LAMB optimizer 2020-11-15 09:35:00 -08:00
Weixing Zhang
fc614ad050 revert the code change which was based on b4869926
The change b4869926 which was to remove per-thread allocator would cause seg fault for
distributed training.

In addition, add dockerfile for ROCm3.9
2020-11-15 00:24:32 -08:00
Vincent Wang
0c8902cbbe
Update Gradient Builder of Some Ops for OpSet13 (#5748)
* gradient builder for opset13

* code clean.

* resolve comments

* stop grad for axes input

* add split to stop grad list.

Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-13 16:20:34 +08:00
Alberto Magni
88c3704257
Add shape inference for additional ops
This commit adds shape inference support for the following ops:

SoftmaxCrossEntropy
SoftmaxCrossEntropyLossGrad
SoftmaxCrossEntropyGrad
LayerNormalizationGrad
Motivation and Context
2020-11-12 20:18:54 +00:00
pengwa
49288de17c
Fix memory planning issues (#5752)
* Fix memory planning issues

* fix build

* fix the wrong line...
2020-11-13 03:07:59 +08:00
Vincent Wang
2a87108431
SoftmaxCrossEntropyLoss OpSet13. (#5777)
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-11-12 15:50:34 +08:00
Sherlock
07dc25e939
Compute global gradient norm according to 'enable_grad_norm_clip' (#5728)
* Introduce PassThrough op to wait for all gradient ready before weight update

* Compute gradient norm for fp32 runs

* Update FE UT expected value

* Respect enable_grad_norm_clip
2020-11-11 21:10:34 -08:00
ashbhandare
5aec34500d
Add megatron transforms for BART (#5521)
* Large model export and run ORT Python support

* Megatron change

refine a bit

workaround self attention issue

use partitioned name for weights when megatron model parallel is enabled

Fix Megatron Transformer Issue (cuased by the renaming)

Add UTs for T5 model parallel

Fix megatron seed issue

fix log a bit

checkkpointing changes + rebase

Unintended reshape transform change

t5 layer norm changes

add t5 layer norm kernel

use template for t5 layer norm

template definition changes

no build error

add CPU cuda kernel

first unit test

other forward unit tests

add T5LayerNormGrad

Add c++ transform and test for T5 LN

minor fix

BART MLP Megatron tranform

Add concat slice transform + test

Cosmetic improvements in concat slice transform

Constant folding bug fix + megatron attention transform for BART

Undo unnecessary changes

* Cleanup

* Remove unnecessary changes

* Cleanup megatron

* Windows build

* Add self attention test graph

* Correcting transforms + cleanup

* review comments

* review comments

* fix build and test failures

* Fix CI

* fix windows CI

Co-authored-by: Peng Wang <pengwa@microsoft.com>
Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-11 16:21:36 -08:00
Xueyun Zhu
d8ace07ad7
Add CPU send/recv for pipeline (#5315)
* cpu send/recv

* clean up send/recv

* remove unused code

* assert and nccl option for mnist

* add build option to enable build with only cpu. Without this, nccl is always enabled which will break build on machine that only contains cpu

* Add USE_MPI distinct from USE_NCCL/USE_HOROVOD

* fix

* fix

* exclude cpu send/recv for machines without mpi

Co-authored-by: Tim Harris <tiharr@microsoft.com>
2020-11-11 12:41:39 -08:00
Derek Murray
bc1768c7f1
Stop gradient flowing to the k input of TopK (#5762) 2020-11-11 10:24:44 -08:00
liqunfu
1416d12f0b
Liqun/merge e2e pipelines (#5702)
* Create an Azure Pipeline to merge cpp and python e2e pipelines into one. Still keep cpp 2e2 pipeline until this new pipeline is stable.

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-11 09:42:08 -08:00
edgchen1
2acdc3cd82
Move GetUseDeterministicCompute() to OpKernelContext to avoid need to downcast to OpKernelContextInternal. (#5729) 2020-11-09 11:37:06 -08:00
Weixing Zhang
bb1af718b5
fix build failures due to recent change(858040fa) in CUDA EP (#5736)
Some part of code for reduction kernels has been changed in 858040fa,
which cause failures in rocm build since ROCm EP shares some code with
CUDA EP. This PR is to quick fix this failure by not sharing two files
for now to unblock CI enabling on ROCm EP. Another PR for leveraging
858040fa for ROCm EP will be done later.
2020-11-09 08:41:30 -08:00
Weixing Zhang
fff85a6a35
Add GPU kernels for ROCm EP (#5655)
* Add kernels for AMD GPU.

This PR is mostly about GPU kernels for ROCm EP. Due to similar GPU programming language (CUDA and HIP and similar math library calls, one principle in ROCM EP design is to share CUDA kernels as much as possible for ROCm. Thus, the script amd_hipify.py has been created for converting CUDA kernels to ROCm HIP kernels automatically during compilation phase. But, for some reasons such as perf issue, syntax difference..., some converted kernels need some manual intervention. These kernels will be checked in the repo physically for now. In order to avoid manual intervention, the plan is to refactor CUDA kernels to make them portable between CUDA EP and ROCm EP as much as possible.

Please refer to "HIP Porting Guide" for details.

* like lamb, multi-tensor-apply needs to be disabled for IsAllFiniteOp and ReduceAllL2, current AMD GPU compiler has perf issue for kernel parameter which is a structure with "pass by value".

* Use hipMemsetAsync and add checks on HIP calls.

* move the generated files to build folder.

Co-authored-by: Jesse Benson <jesseb@microsoft.com>
2020-11-06 16:11:06 -08:00
edgchen1
858040faaa
Implement reduce_matrix_columns() to optimize ReduceSum (#5639)
Implement reduce_matrix_columns() to optimize ReduceSum.
2020-11-05 10:25:00 -08:00
ashbhandare
6d8e81cb08
Update Squeeze, Unsqueeze, Split and ReduceSum kernel for Opset13 (#5691)
* Split  change

* ReduceSum and Split change

* Other op changes, Grad builder, tests, registering required opset 13 ops

* Rebase fixes

* Fix tests, add some more

* Review changes, rebase

* Fix windows build

* Disable new tests for TesnorRT EP

* Disable unsupported for OpenVINO

Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-04 20:00:27 -08:00
wezuo
62a99824cb
Wezuo/priority in nodedef (#5692)
* set the priority in nodedef

* remove debugging stmts

* revoke zero builder

* remove unnecessary namespace comment

Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-mgtbby.eastus.cloudapp.azure.com>
Co-authored-by: Wei Zuo <wezuo@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-04 12:40:37 -08:00
edgchen1
28f1e32898
Loosen tolerance of CudaKernelTest.ReduceSum_MidTensor, allow test random seed to be regenerated within a test run. (#5675) 2020-11-03 10:37:00 -08:00
Changming Sun
87e1063e19
Revert "Update Squeeze, Unsqueeze, Split and ReduceSum kernel for Opset13 (#5488)" (#5668)
This reverts commit db63c5d10f.
2020-11-02 16:09:22 -08:00
Jesse Benson
1495f737ca Use cudaMemsetAsync and add checks on CUDA calls. 2020-11-02 11:25:13 -08:00
ashbhandare
db63c5d10f
Update Squeeze, Unsqueeze, Split and ReduceSum kernel for Opset13 (#5488)
* Split  change

* ReduceSum and Split change

* Other op changes, Grad builder, tests, registering required opset 13 ops

* Rebase fixes

* Fix tests, add some more

* Review changes, rebase

* Fix windows build

Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-02 10:51:48 -08:00
M. Zeeshan Siddiqui
f2168cef29
Misc. cleanup. (#5659)
Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-02 07:05:28 -08:00
M. Zeeshan Siddiqui
9af0d48524
Memory planner and pattern generation enhancements. (#4443)
* static allocation.

* chanegs.

* contigious dynamic allocation.

* contigious dynamic allocation.

* fix bugs.

* fix bug.

* build errors.

* PR feedback.

* PR feedback.

* Update Graph builder for nccl_allreduce, mps.

* misc.

* fix windows build break.

* changes.

* fine-grained memory-time scheduling.

* merge.

* fix misc stuff.

* fix windows build.

* fix windows build.

* fix merge bug.

* merge conflicts.

* revert onnx-tensorrt submodule commit.

* fix submodule commit.

* misc.

* merge conflicts.

* Revert "merge conflicts."

This reverts commit 319a071a6e.

* merge conflict.

* merge conflict.

* merge conflicts.

* fixes.

* PR feedback.

* build break.

* build break.

* Add asserts.

* Add asserts.

* asserts.

* asserts.

* asserts.

* asserts.

* asserts.

* fixes.

* fixes.

Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: root <root@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-01 23:05:46 -08:00
Zhang Lei
17bce6f07e
Implement Im2colNd NHWC and related qlinearconv logic for u8s8. (#5612)
Implement Im2colNd NHWC and related qlinearconv logic for u8s8, and training.
2020-10-30 15:28:30 -07:00
Weixing Zhang
aec4cb489e
ROCm EP for AMD GPU (#5480)
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/

ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …

Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.

Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……

The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.  

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2020-10-29 17:13:04 -07:00
Vincent Wang
1fa1c51544
bug fix for name of gradient constant (#5626)
Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
2020-10-30 07:08:19 +08:00
Sergii Dymchenko
2e1fa3ccb7
Fix GeluRecompute for 2 inputs case. (#5573)
* Add test for FastGelu + GeluRecompute.

* Fix GeluRecompute for 2 inputs case.

* Fix test for BiasGelu + GeluRecompute.

* Copy all inputs to Gelu, not just 2.

* Move GeluRecompute test to training-specific file.
2020-10-29 00:07:13 -07:00
liqunfu
5129b4d5bc
batch size tests (#5508)
* batch size tests

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-28 15:55:40 -07:00
Tim Harris
5e8952ef89
ThreadPool clean up : mm_pause in loops, correctly spin-then-wait, and adopt static methods consistently in the API (#5590)
Description: This change makes three changes to the ThreadPool class to clean up issues identified during performance analysis and optimization. (1) It uses mm_pause intrinsics in spin loops, helping avoid consuming pipeline resources while waiting. (2) It re-organizes the spin-then-steal loop for work distribution to start out spinning as intended, rather than to start out trying to steal. (3) It updates the ThreadPool class's API to be consistent in the use of static methods for public functions. The PR includes minor doc updates and corresponding changes to test cases.

Motivation and Context
The change helps ensure consistency in behavior between the OpenMP and Eigen-based implementations. Unlike the instance methods, the static methods abstract over the different ways in which threading can be implemented; they will map onto the OpenMP or Eigen-based implementations when threading is used. When threading is not used they will run work sequentially.
2020-10-28 09:49:18 +00:00
liqunfu
92662659ba
Liqun/remove number matching (#5606)
replace number matching with relaxed comparison in frontend tests
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-27 21:27:37 -07:00
Ryan Hill
e90b6f06d1
Factor out IAllocator so that it can be shared with shared providers (#5567)
* Factor out IAllocator so shared providers can use it directly.
2020-10-27 17:28:17 -07:00
Weixing Zhang
b851973f22
pipeline_worker_pool_.JoinAll() should be called in pipeline code path (#5604)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-10-27 11:57:46 -07:00
ytaous
6f824c25e5
Dropout op elimination - enable for ORT training (#5588)
* dropout elimination

* per comments

* fix build

* fix build

Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-27 11:51:23 -07:00
Dmitri Smirnov
3433576fd3
Support for Sparse Initializers (#5540)
Introduce sparse_initializers support.
  Convert them to dense on model load and prune graph_proto_
  so they don't consume space. Convert back to sparse on ORT Format model save.
  Implement serializing sparse initializers to OrtFormat.
  Fix Model::ToProto() to return original sparse initializers
  Set a flag that graph_sync is needed when loading a simple ORT Format model.
  otherwise nothing is resolved.
  Add ORT Format history to README.md
  ifdef MINIMAL build for DenseToSparseTensorInitializer
  Allow duplicate initializers to support existing models.
  Issue a warning instead of aborting.

* Revert "Remove SparseTensor support from minimal build. (#5114)"
This reverts commit 59ee8ffb17.



Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
2020-10-27 10:32:06 -07:00
Sherlock
694a4d6413
Add more loggings for GradientBuilder (#5556)
* Add more loggings for GradientBuilder

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-26 15:15:52 -07:00
edgchen1
68fe722691
GatherGrad optimization (#5524)
The existing implementation of the GatherGrad CUDA kernel does not do work in a very parallel manner for certain inputs which can lead to poor performance.

The computation essentially involves multiple summations. The values are gathered from the input and the sums are scattered to the output.

Previously, each sum was computed by a single thread. If there is an instance of a summation of a large number of values, it can significantly impact the overall kernel execution time.

The updated version has an alternate implementation which splits the sums into partial sums which get accumulated together later. This allows for more parallelism. A significant downside is that the alternate implementation requires CPU and GPU synchronization because intermediate GPU results are required by the CPU computation. The original implementation outperformed the alternate for certain inputs (e.g., where the maximum number of values in a sum was not large), so the updated version chooses between them based on the input. The input analysis has some overhead.

The implementation was adapted from PyTorch (b186831c08/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu).
2020-10-26 12:53:53 -07:00
Sergii Dymchenko
8224718f8f
Enable CommonSubexpressionElimination in training. (#5504)
* Add test for CommonSubexpressionElimination in training.

* Enable CommonSubexpressionElimination in training.

* Add ommonSubexpressionEliminationApplyOnce for training.
2020-10-26 11:25:15 -07:00
ashbhandare
0a9b83a313
Add zero test (#5476) 2020-10-21 17:12:00 -07:00
Vincent Wang
b48f596a91
GatherElementsGrad CPU Kernel and TopKGrad CPU/CUDA Kernel (#5511)
* TopKGrad CPU kernel

* use Scatter for GatherElementsGrad and TopKGrad.

* rollback convgrad change.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-10-21 09:29:29 +08:00
Xavier Dupré
66c8a441e0
Improves ReduceSum performance by removing transposition. (#5370)
* Improves ReduceSum performance
* Add min, max, L1, L2, logsum, sumsquare
* remove all reduce implementation including transpose
2020-10-20 10:36:31 +02:00
Juliana Franco
0298b9734e
Save in EndTraining only if in last rank (#5500)
* Only save partition of graph with loss (during EndTraining)

* fix comments

Co-authored-by: Juliana <jufranc@microsoft.com>
2020-10-19 14:16:48 -07:00
Derek Murray
0b59004666
Add fallback function implementation for DivGrad (#5518)
* Add fallback function implementation for DivGrad.

* Add shape inference for DivGrad.

* Add missing argument.

Co-authored-by: Derek Murray <demurra@microsoft.com>
2020-10-19 10:47:47 -07:00
Derek Murray
6f65e2ad2c
Mark the dX and dB outputs of ConvGrad as OpSchema::Optional. (#5462)
* Mark the dB output of ConvGrad as OpSchema::Optional.

* Also mark dX as optional

Co-authored-by: Derek Murray <demurra@microsoft.com>
2020-10-15 16:54:17 -07:00
Derek Murray
64f6d856e4
Add FlattenGrad and test. (#5461)
Co-authored-by: Derek Murray <demurra@microsoft.com>
2020-10-15 16:11:57 -07:00
Derek Murray
88f6523baf
Add type inference for BroadcastGradientArgs (#5501)
* Add type inference for BroadcastGradientArgs

This change enables the ONNX shape and type inference to work on a function body containing a BroadcastGradientArgs op. Without this change, the dummy inference function is used, and no types are inferred for the output here:

531e6dd459/onnx/shape_inference/implementation.cc (L467-L469)

* Handle optional outputs.
2020-10-15 16:11:24 -07:00
Scott McKay
7da7e07909
Cleanup some test infrastructure (#5484)
* Created shared version of InferenceSession wrapper class and update relevant tests to use it.
Include domain in the ops counting helper so it's more general and we don't need to duplicate it in the nchwc tests. Update tests to include domain in key being checked.

* Fix some training tests

* Fix prefixing of contrib op names in test
2020-10-16 06:44:01 +10:00
KeDengMS
c444b9d76a
Add CUDA option to run copy in default stream (#5445)
* Add CUDA option to run copy in default stream

This change fixes #4829. Thanks @maherzog for providing the repro!

The bug is caused by memory reuse in BFC arena, where copy and
compute stream in CUDA has a racing condition.

BFC arena is an arena allocator on top of cudaMalloc/Free to
reduce the cost in syncing CPU and GPU when alloc/free. It means
when CPU alloc/free the memory, GPU might not finished previous
work on the memory, so that CPU and GPU could run asynchronously.

This is OK if there's only one stream, where the execution order
in CPU and GPU are consistent. For example, if we have two kernels
A and B, CPU runs allocA->computeA->freeA->allocB->computeB->freeB,
A and B could shares the same memory since computeA and computeB
will not have racing as long as they run in the same GPU compute
stream.

However, if CPU runs allocA->CopyA->freeA->allocB->computeB->freeB,
the order of execution in GPU could have copyA happen after computeB,
if copy and compute happens in different GPU streams.

This change makes copy to run in default compute stream, while adding
an option to fall back to previous behavior if there's perf hit. This
is a short term fix before BFC arena could support multiple streams.

User may use following options to revert to previous behavior:
C API:
  struct OrtCUDAProviderOptions cudaProviderOpt;
  cudaProviderOpt.do_copy_in_default_stream = false;
C++ API:
  CUDAExecutionProviderInfo cudaEPInfo;
  cudaEPInfo.do_copy_in_default_stream = false;
C# API:
  pending...
Python:
  import onnxruntime
  onnxruntime.capi._pybind_state.set_do_copy_in_default_stream(False)

* Confirmed the test failes in CI when doing copy in separate stream

Revert the test to get CI pass now

* Fix Windows test

* Address CR
2020-10-12 22:12:05 -07:00