* Add CUDA option to run copy in default stream
This change fixes#4829. Thanks @maherzog for providing the repro!
The bug is caused by memory reuse in BFC arena, where copy and
compute stream in CUDA has a racing condition.
BFC arena is an arena allocator on top of cudaMalloc/Free to
reduce the cost in syncing CPU and GPU when alloc/free. It means
when CPU alloc/free the memory, GPU might not finished previous
work on the memory, so that CPU and GPU could run asynchronously.
This is OK if there's only one stream, where the execution order
in CPU and GPU are consistent. For example, if we have two kernels
A and B, CPU runs allocA->computeA->freeA->allocB->computeB->freeB,
A and B could shares the same memory since computeA and computeB
will not have racing as long as they run in the same GPU compute
stream.
However, if CPU runs allocA->CopyA->freeA->allocB->computeB->freeB,
the order of execution in GPU could have copyA happen after computeB,
if copy and compute happens in different GPU streams.
This change makes copy to run in default compute stream, while adding
an option to fall back to previous behavior if there's perf hit. This
is a short term fix before BFC arena could support multiple streams.
User may use following options to revert to previous behavior:
C API:
struct OrtCUDAProviderOptions cudaProviderOpt;
cudaProviderOpt.do_copy_in_default_stream = false;
C++ API:
CUDAExecutionProviderInfo cudaEPInfo;
cudaEPInfo.do_copy_in_default_stream = false;
C# API:
pending...
Python:
import onnxruntime
onnxruntime.capi._pybind_state.set_do_copy_in_default_stream(False)
* Confirmed the test failes in CI when doing copy in separate stream
Revert the test to get CI pass now
* Fix Windows test
* Address CR
* Update MaxBatchSize and include recompute mode
* Minor fix for frontend test
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
**Description**: Add missing gradient registration for the `Exp` op.
**Motivation and Context**
* Adding support for training a model that uses the `Exp` op.
Co-authored-by: Derek Murray <demurra@microsoft.com>
* t5 layer norm changes
* add t5 layer norm kernel
* use template for t5 layer norm
* template definition changes
* no build error
* add CPU cuda kernel
* first unit test
* other forward unit tests
* add T5LayerNormGrad
* Add c++ transform and test for T5 LN
* fix and some debug prints
* fix cuda error
* rename from t5 to simplified
* PR comments
* revert change on invertible LM code path
* remove duplicate forward computation
* add GradientCheckerTest.SimplifiedLayerNormGrad
* change back macro
* Fix SimplifiedLayerNorm Gradient
* merge with Sherlockss changes
* changed cuda kernel
* reapply cpu kernel changes
Co-authored-by: Jingyan Wang <jingywa@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: aishwarya bhandare <aibhanda@microsoft.com>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Avoid inserting other CUDA calls in-between NCCL Send's and Recv's
* Add a comment
* Place CUDA EP on the right device
* Fix a warning
* Address a comment
* use run_orttraining_test_orttrainer_frontend_separately to work around a sporadic segfault.
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* - Link with libatomic if needed
- Install pip differently so it doesn't clash with the system pip which may involve a wrapper script
- Remove ability to specify offset when Tensor allocates the data. The data prior to offset isn't accessible by anything.
- Fix use of offset in TensorOpTest to work on armv7 where it must be aligned to the type it points to.
- Fix ActivationOpNoInfTest.Softsign to allow for armv7 behavior
- Fix ReductionOpTest.ReduceMean_*keepdims to allow for armv7 floating point inaccuracy
* Address PR comments