* Avoid signed/unsigned warning on loops
* Report sizes when distributed world configuration is inconsistent
* Add DistributedRunContextTest for pipeline stage configuration
* ReduceMean/Sum gradient without shape dependency.
* optimize expand and use it to replace add.
* Adjust test.
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
* Fix a bug and add code to profile memory
1. Compile Send/Recv again (currently broken because of
HOROVOD refactor).
2. Add code to print out initializer allocation size and
activation memory size.
* Address comments
* Split memory counts per locations
* Fix a metric
* Add ONNX postpasses
* add flag + add bert test from onnx file
* address PR comments
* fix typo
* fix rebase
* address comments
* Fix test failures
* add new pass for expand for new pt version, add comments
* fix rebase
Co-authored-by: lahaidar <lahaidar@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* ORT on CUDA 11
1. Seperate HOROVOD and MPI
2. Seperate NCCL from HOROVOD in CMakeLists.txt
2. Remove dependency on external cub
3. cudnnSetRNNDescriptor is changed in cuDNN 8.0
* polish the code about MPI/NCCL in CMakeLists.txt and build.py
* check CUDA version
* ${MPI_INCLUDE_DIRS} should be PUBLIC
* sm30, sm50 are deprecated in CUDA 11 Toolkit
* update change based on code review feedback.
* add sm_52
* improve MPI/NCCL build path
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* Enable static memory planning for pipeline.
1. We fix a bug when resolving symbolic shape for scalars.
2. We pass the original inputs to all pipeline stages so that
the symbolic shapes can be resolved.
* Further Improvements
1. Address comments.
2. Further reduce activation size by ~50% when pipeline is on.
This is done by removing all but one gradient tensor from the last
RecordEvent in the backward pass.
* Address a comment
* Fix Windows build
Fix memory leak when a Python list passed as a feed.
Create a custom allocator that can take ownership of python
arrays that are created inside pybind.
Allow direct memory use if continuous array is a copy because
we now can take ownership of it by the allocator.
* bug fix for models not using wrapper
* add test case for no wrapper case
* update test case to use internal learning rate
* fix bug with frozen weight update
Dropout op was recently changed to accept a new input named
'training_mode', which is passed in to DropoutGrad automatically.
This PR updates the DropoutGrad schema to accommodate the new input.
Tests were also update to reflect the API change
Co-authored-by: Thiago Crepaldi <thiag.crepaldi@microsoft.com>
* fixed seg fault when using concrete shape
disable gradient as output
* fix evaluation hang issue for multiple gpu run
* Remove dead code, ORTModel and improve docstrings (#3814)
* Refine ORTTrainer docstring descriptions (#3907)
Add transformer glue test example to show how to use ORTTrainer to fine-tune a transformer model
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* online partition
* fix when multiple consumer nodes is in cut info
* fix windows build
* address feedback
* adding test
* feedback
* address feedback
* add parser for cut edge
* windows build
In this PR, we
1. create some APIs for creating NVTX objects
2. apply those APIs in pipeline-related operators and sequential executor.
As a result, we can explicitly see how a pipeline schedule is run by GPUs in
Nvidia's visual profiler. Note that these APIs are Linux only due to Nvidia's
limited support.
* Remove 'model_.' prefix for onnx model initializers in training
* fix test case remove redundant device test
* rename
* Fix state_dict/load_state_dict with frozen_weight
* nit
* Add monkey patch for pt opset 10
* remove pt patch in CI
* nit: newline
* gpt2 training perf
* gpt2 training perf
* debug
* debug
* debug
* fix bug
* minor
* on comments
* dynamic sql
* fix build
* minor
* linked hash
* on comments
* minor
* mem
* minor
Co-authored-by: Ethan Tao <ettao@microsoft.com>
* Initial update of readme
* Readme updates
* Review of consolidated README (#3930)
* Proposed updates for readme (#3953)
I found some of the information was duplicated within the doc, so attempted to streamline
* Fix links
* More updates
- fix build instructions
- nodejs doc reorganization
- roadmap update
- version fixes
* Update ORT Server build instructions
* More doc cleanup
* fix python dev notes name
* Update nodejs and some links
* sync eigen version back to master
* Minor fixes
* add nodsjs to sample table of content
* Update README.md
* Update README.md
* Update README.md
* Update README.md
* Update README.md
* Update README.md
* address PR feedback
* address PR feedback
* nodejs build instruction
* Update Java instructions to include gradle
* Roadmap refresh
Reformat some data, fix link, minor rewording
* Clarify Visual C++ runtime req
Co-authored-by: Nat Kershaw (MSFT) <nakersha@microsoft.com>
Co-authored-by: Prasanth Pulavarthi <prasantp@microsoft.com>
Co-authored-by: manashgoswami <magoswam@microsoft.com>
* dashboard integration - first phase
* change a field
* perf scripts
* addressing PR comments
* address comments and fix build
* minor
* make GetConfigFromData() const
* more update for comments
* addressing comments
* more on addressing comments
* minor
* fix build
* add condition check
* more on comments
* retrun status
* remove batch size
* on comments
* rename pkg path
* rename pkg path
* additional commentss
Co-authored-by: Ethan Tao <ettao@microsoft.com>
* Do not register Dropout(12) as training ONLY kernel.
* Move Dropout forward implementation in inference project.
* fix inference build test failures.
* remove fp16 test since its support is absent on CPU.
* build break.
* Fold Shape node in constant folding.
* bugfix
* Fix test failure.
* Bugfix for C++ frontend.
* Bugfix for C++ frontend.
Co-authored-by: Vincent Wang <weicwang@microsoft.com>