This PR adds infrastructure to automatically cache docker images used in CI builds in a container registry.
Currently, build images are pulled from a container registry for some builds and built every time for others. The container registry requires maintenance to keep the images up to date and building images every time wastes build agent resources.
With this change, a given build image can be looked up in a cache container registry and if present, pulled, and otherwise, built and pushed. The uniqueness of a build image is determined by a hash digest of the dockerfile, docker build context directory, and certain "docker build" options. This digest is part of the image tag in the cache container repository.
The cache container registry will need to be cleaned up periodically. This is not automated yet.
* gradient builder for opset13
* code clean.
* resolve comments
* stop grad for axes input
* add split to stop grad list.
Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
This commit adds shape inference support for the following ops:
SoftmaxCrossEntropy
SoftmaxCrossEntropyLossGrad
SoftmaxCrossEntropyGrad
LayerNormalizationGrad
Motivation and Context
* Introduce PassThrough op to wait for all gradient ready before weight update
* Compute gradient norm for fp32 runs
* Update FE UT expected value
* Respect enable_grad_norm_clip
* Large model export and run ORT Python support
* Megatron change
refine a bit
workaround self attention issue
use partitioned name for weights when megatron model parallel is enabled
Fix Megatron Transformer Issue (cuased by the renaming)
Add UTs for T5 model parallel
Fix megatron seed issue
fix log a bit
checkkpointing changes + rebase
Unintended reshape transform change
t5 layer norm changes
add t5 layer norm kernel
use template for t5 layer norm
template definition changes
no build error
add CPU cuda kernel
first unit test
other forward unit tests
add T5LayerNormGrad
Add c++ transform and test for T5 LN
minor fix
BART MLP Megatron tranform
Add concat slice transform + test
Cosmetic improvements in concat slice transform
Constant folding bug fix + megatron attention transform for BART
Undo unnecessary changes
* Cleanup
* Remove unnecessary changes
* Cleanup megatron
* Windows build
* Add self attention test graph
* Correcting transforms + cleanup
* review comments
* review comments
* fix build and test failures
* Fix CI
* fix windows CI
Co-authored-by: Peng Wang <pengwa@microsoft.com>
Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* cpu send/recv
* clean up send/recv
* remove unused code
* assert and nccl option for mnist
* add build option to enable build with only cpu. Without this, nccl is always enabled which will break build on machine that only contains cpu
* Add USE_MPI distinct from USE_NCCL/USE_HOROVOD
* fix
* fix
* exclude cpu send/recv for machines without mpi
Co-authored-by: Tim Harris <tiharr@microsoft.com>
* Create an Azure Pipeline to merge cpp and python e2e pipelines into one. Still keep cpp 2e2 pipeline until this new pipeline is stable.
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Some part of code for reduction kernels has been changed in 858040fa,
which cause failures in rocm build since ROCm EP shares some code with
CUDA EP. This PR is to quick fix this failure by not sharing two files
for now to unblock CI enabling on ROCm EP. Another PR for leveraging
858040fa for ROCm EP will be done later.
* Add kernels for AMD GPU.
This PR is mostly about GPU kernels for ROCm EP. Due to similar GPU programming language (CUDA and HIP and similar math library calls, one principle in ROCM EP design is to share CUDA kernels as much as possible for ROCm. Thus, the script amd_hipify.py has been created for converting CUDA kernels to ROCm HIP kernels automatically during compilation phase. But, for some reasons such as perf issue, syntax difference..., some converted kernels need some manual intervention. These kernels will be checked in the repo physically for now. In order to avoid manual intervention, the plan is to refactor CUDA kernels to make them portable between CUDA EP and ROCm EP as much as possible.
Please refer to "HIP Porting Guide" for details.
* like lamb, multi-tensor-apply needs to be disabled for IsAllFiniteOp and ReduceAllL2, current AMD GPU compiler has perf issue for kernel parameter which is a structure with "pass by value".
* Use hipMemsetAsync and add checks on HIP calls.
* move the generated files to build folder.
Co-authored-by: Jesse Benson <jesseb@microsoft.com>
* Split change
* ReduceSum and Split change
* Other op changes, Grad builder, tests, registering required opset 13 ops
* Rebase fixes
* Fix tests, add some more
* Review changes, rebase
* Fix windows build
* Disable new tests for TesnorRT EP
* Disable unsupported for OpenVINO
Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Split change
* ReduceSum and Split change
* Other op changes, Grad builder, tests, registering required opset 13 ops
* Rebase fixes
* Fix tests, add some more
* Review changes, rebase
* Fix windows build
Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/
ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …
Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.
Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……
The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
* Add test for FastGelu + GeluRecompute.
* Fix GeluRecompute for 2 inputs case.
* Fix test for BiasGelu + GeluRecompute.
* Copy all inputs to Gelu, not just 2.
* Move GeluRecompute test to training-specific file.
Description: This change makes three changes to the ThreadPool class to clean up issues identified during performance analysis and optimization. (1) It uses mm_pause intrinsics in spin loops, helping avoid consuming pipeline resources while waiting. (2) It re-organizes the spin-then-steal loop for work distribution to start out spinning as intended, rather than to start out trying to steal. (3) It updates the ThreadPool class's API to be consistent in the use of static methods for public functions. The PR includes minor doc updates and corresponding changes to test cases.
Motivation and Context
The change helps ensure consistency in behavior between the OpenMP and Eigen-based implementations. Unlike the instance methods, the static methods abstract over the different ways in which threading can be implemented; they will map onto the OpenMP or Eigen-based implementations when threading is used. When threading is not used they will run work sequentially.
replace number matching with relaxed comparison in frontend tests
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Introduce sparse_initializers support.
Convert them to dense on model load and prune graph_proto_
so they don't consume space. Convert back to sparse on ORT Format model save.
Implement serializing sparse initializers to OrtFormat.
Fix Model::ToProto() to return original sparse initializers
Set a flag that graph_sync is needed when loading a simple ORT Format model.
otherwise nothing is resolved.
Add ORT Format history to README.md
ifdef MINIMAL build for DenseToSparseTensorInitializer
Allow duplicate initializers to support existing models.
Issue a warning instead of aborting.
* Revert "Remove SparseTensor support from minimal build. (#5114)"
This reverts commit 59ee8ffb17.
Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
The existing implementation of the GatherGrad CUDA kernel does not do work in a very parallel manner for certain inputs which can lead to poor performance.
The computation essentially involves multiple summations. The values are gathered from the input and the sums are scattered to the output.
Previously, each sum was computed by a single thread. If there is an instance of a summation of a large number of values, it can significantly impact the overall kernel execution time.
The updated version has an alternate implementation which splits the sums into partial sums which get accumulated together later. This allows for more parallelism. A significant downside is that the alternate implementation requires CPU and GPU synchronization because intermediate GPU results are required by the CPU computation. The original implementation outperformed the alternate for certain inputs (e.g., where the maximum number of values in a sum was not large), so the updated version chooses between them based on the input. The input analysis has some overhead.
The implementation was adapted from PyTorch (b186831c08/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu).
* Add test for CommonSubexpressionElimination in training.
* Enable CommonSubexpressionElimination in training.
* Add ommonSubexpressionEliminationApplyOnce for training.
* TopKGrad CPU kernel
* use Scatter for GatherElementsGrad and TopKGrad.
* rollback convgrad change.
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
* Add type inference for BroadcastGradientArgs
This change enables the ONNX shape and type inference to work on a function body containing a BroadcastGradientArgs op. Without this change, the dummy inference function is used, and no types are inferred for the output here:
531e6dd459/onnx/shape_inference/implementation.cc (L467-L469)
* Handle optional outputs.
* Created shared version of InferenceSession wrapper class and update relevant tests to use it.
Include domain in the ops counting helper so it's more general and we don't need to duplicate it in the nchwc tests. Update tests to include domain in key being checked.
* Fix some training tests
* Fix prefixing of contrib op names in test
* Add CUDA option to run copy in default stream
This change fixes#4829. Thanks @maherzog for providing the repro!
The bug is caused by memory reuse in BFC arena, where copy and
compute stream in CUDA has a racing condition.
BFC arena is an arena allocator on top of cudaMalloc/Free to
reduce the cost in syncing CPU and GPU when alloc/free. It means
when CPU alloc/free the memory, GPU might not finished previous
work on the memory, so that CPU and GPU could run asynchronously.
This is OK if there's only one stream, where the execution order
in CPU and GPU are consistent. For example, if we have two kernels
A and B, CPU runs allocA->computeA->freeA->allocB->computeB->freeB,
A and B could shares the same memory since computeA and computeB
will not have racing as long as they run in the same GPU compute
stream.
However, if CPU runs allocA->CopyA->freeA->allocB->computeB->freeB,
the order of execution in GPU could have copyA happen after computeB,
if copy and compute happens in different GPU streams.
This change makes copy to run in default compute stream, while adding
an option to fall back to previous behavior if there's perf hit. This
is a short term fix before BFC arena could support multiple streams.
User may use following options to revert to previous behavior:
C API:
struct OrtCUDAProviderOptions cudaProviderOpt;
cudaProviderOpt.do_copy_in_default_stream = false;
C++ API:
CUDAExecutionProviderInfo cudaEPInfo;
cudaEPInfo.do_copy_in_default_stream = false;
C# API:
pending...
Python:
import onnxruntime
onnxruntime.capi._pybind_state.set_do_copy_in_default_stream(False)
* Confirmed the test failes in CI when doing copy in separate stream
Revert the test to get CI pass now
* Fix Windows test
* Address CR