- make size_ and data_ data members private
- rename GetCapacity() to Capacity() to be consistent (e.g., with Size())
- add static_assert for trivially copyable T because it is copied with memcpy
* Draft for LayerNorm Optimization
* Modify LayernormGrad kernel based on new backward graph.
* keep two LayernormGrad implementations.
One is implemented based on input X, mean. The other is based on output Y, scale, bias. The first one is enabled by default. The second one can be enabled by --use_invertible_layernorm_grad
* expose use_invertible_layernorm_grad to frontend.
* add fp16 tests.
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
For BiasGeluGradDxKernel:
- Implement optimization to first load from global memory into registers as suggested by Weixing.
- Support larger bias sizes which were previously limited by the number of threads per block.
- Address flaky unit test by increasing the error tolerance to the default value.
* Keep loss subgraph as FP32 when mixed-p training.
* Fix case where there is no white-list loss op.
* Get nodes from loss_scale instead of whitelist.
* rename const variables.
Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Implement BiasDropout Fusion and Kernel
Dropout kernel for residual input
BiasDropout Fusion to take residual input
Fix BiasDropout Kernel
Optimize DropoutGrad with 4 elements per thread
* Add graph transformer UT
* MLTypeCallDispatcher for RatioData
* Use MLTypeDispatcher for ratio tensor
* Handle traing_mode input for BiasDropout fusion
* Add test case for missing ratio input
* Replace using FinalizeNodeFusion
* Make BiasDropout kernel template-less
* Make DropoutGrad template-less
* Make Dropout and TrainableDropout template-less
* Regenerate onnx file for UT
* Minior fix on divmod in BiasDropoutKernel
* Adjust pt frontend test due to dropout randomnesss
* Make dropout kernel opeartion in fp32
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Update function body initialization
* minor fix
* changes per review comments
* minor fix
* format fix
* add function initialization in mixed precision transformer
* more updates
* more fixes
* Move allocators to SessionState so they're decoupled from ExecutionProviders
- when looking up an allocator it's based on OrtMemoryInfo not the EP so SessionState is a more natural place for that infromation to be stored
- add device based lookup
- simplifies logic for copying feeds/fetches across devices
Cleanup SessionState and SessionStateInitializer
- provide more things to SessionState at construction time so we don't construct and instance and immediately after call a bunch of setters
- simplify SessionStateInitializer
- reduced down to FinalizeSessionState method
* Rename partition_optimizer -> deepspeed_zero
* Use ZeROConfig in orttraining_pybind_state.cc
* deepspeed_zero -> deepspeed_zero_stage for clarity
* Expose as deepspeed_zero_stage in pybind
* Avoid signed/unsigned warning on loops
* Report sizes when distributed world configuration is inconsistent
* Add DistributedRunContextTest for pipeline stage configuration
* ReduceMean/Sum gradient without shape dependency.
* optimize expand and use it to replace add.
* Adjust test.
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
* Fix a bug and add code to profile memory
1. Compile Send/Recv again (currently broken because of
HOROVOD refactor).
2. Add code to print out initializer allocation size and
activation memory size.
* Address comments
* Split memory counts per locations
* Fix a metric
* Add ONNX postpasses
* add flag + add bert test from onnx file
* address PR comments
* fix typo
* fix rebase
* address comments
* Fix test failures
* add new pass for expand for new pt version, add comments
* fix rebase
Co-authored-by: lahaidar <lahaidar@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* ORT on CUDA 11
1. Seperate HOROVOD and MPI
2. Seperate NCCL from HOROVOD in CMakeLists.txt
2. Remove dependency on external cub
3. cudnnSetRNNDescriptor is changed in cuDNN 8.0
* polish the code about MPI/NCCL in CMakeLists.txt and build.py
* check CUDA version
* ${MPI_INCLUDE_DIRS} should be PUBLIC
* sm30, sm50 are deprecated in CUDA 11 Toolkit
* update change based on code review feedback.
* add sm_52
* improve MPI/NCCL build path
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* Enable static memory planning for pipeline.
1. We fix a bug when resolving symbolic shape for scalars.
2. We pass the original inputs to all pipeline stages so that
the symbolic shapes can be resolved.
* Further Improvements
1. Address comments.
2. Further reduce activation size by ~50% when pipeline is on.
This is done by removing all but one gradient tensor from the last
RecordEvent in the backward pass.
* Address a comment
* Fix Windows build
Fix memory leak when a Python list passed as a feed.
Create a custom allocator that can take ownership of python
arrays that are created inside pybind.
Allow direct memory use if continuous array is a copy because
we now can take ownership of it by the allocator.
* bug fix for models not using wrapper
* add test case for no wrapper case
* update test case to use internal learning rate
* fix bug with frozen weight update
Dropout op was recently changed to accept a new input named
'training_mode', which is passed in to DropoutGrad automatically.
This PR updates the DropoutGrad schema to accommodate the new input.
Tests were also update to reflect the API change
Co-authored-by: Thiago Crepaldi <thiag.crepaldi@microsoft.com>
* fixed seg fault when using concrete shape
disable gradient as output
* fix evaluation hang issue for multiple gpu run
* Remove dead code, ORTModel and improve docstrings (#3814)
* Refine ORTTrainer docstring descriptions (#3907)