Commit graph

171 commits

Author SHA1 Message Date
liqunfu
0bff55512e
updated expected values for frontend test to pass frontend e2e pipeline. raise tolerance to reduce future risk of failure (#4497)
* updated expected values for frontend test, raise tol
2020-07-13 19:25:54 -07:00
edgchen1
c71c49aaa0
Make TArray safer to use and update method name for consistency. (#4483)
- make size_ and data_ data members private
- rename GetCapacity() to Capacity() to be consistent (e.g., with Size())
- add static_assert for trivially copyable T because it is copied with memcpy
2020-07-13 09:59:56 -07:00
Vincent Wang
7fb194d03d
Update convergence baseline for ci_test. (#4465)
Co-authored-by: Vincent Wang <weicwang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-07-09 15:29:36 +08:00
Tixxx
b156ae4448
Support training_mode flag in eval (#4324)
* add training_mode feed for evaluation to support opset12
2020-07-08 10:38:54 -07:00
Hariharan Seshadri
6d6b6b54a5
Support binding a graph output to a specific device via the Python binding (#4439) 2020-07-07 21:09:37 -07:00
Ashwini Khade
dd73e8c016
add function initialization back to graph resolve (#4434) 2020-07-06 15:17:27 -07:00
liqunfu
0fdb1e9f60
Liqun/roberta (#4408)
add GLUE Roberta example, fix unused initializer issue at backend. Bert GLUE expected out updated due to graph changes between June29 to July1st
2020-07-06 09:19:30 -07:00
pengwa
8bcdefc9c1
Optimize GatherND (#4097)
* Optimize GatherND
* Refine the code, Fix few comments
2020-07-03 19:42:32 +08:00
Weixing Zhang
bd11ab6816
Optimize LayernormGrad (#4156)
* Draft for LayerNorm Optimization

* Modify LayernormGrad kernel based on new backward graph.

* keep two LayernormGrad implementations.

One is implemented based on input X, mean. The other is based on output Y, scale, bias. The first one is enabled by default. The second one can be enabled by --use_invertible_layernorm_grad

* expose use_invertible_layernorm_grad to frontend.

* add fp16 tests.

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-07-02 22:09:30 -07:00
edgchen1
dba22b17b4
Update BiasGeluGradDxKernel and tests. (#4400)
For BiasGeluGradDxKernel:
- Implement optimization to first load from global memory into registers as suggested by Weixing.
- Support larger bias sizes which were previously limited by the number of threads per block.
- Address flaky unit test by increasing the error tolerance to the default value.
2020-07-02 18:55:44 -07:00
Vincent Wang
28e4c0edf5
Keep loss_scale and Whole Loss Subgraph in FP32 during Mixed Precision Training (#4268)
* Keep loss subgraph as FP32 when mixed-p training.

* Fix case where there is no white-list loss op.

* Get nodes from loss_scale instead of whitelist.

* rename const variables.

Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-07-03 06:54:56 +08:00
Sherlock
2d54c89d77
Update filename and Cleanup unused cudnn kernels (#4387)
* Update filename and Cleanup unused cudnn kernels

* Cleanup unnecessary dependency
2020-07-01 17:19:49 -07:00
Bowen Bao
7ec9a73202
deprecate frontend layernorm postpass (#4372) 2020-07-01 13:06:03 -07:00
liqunfu
5dcb9b4858
Liqun/backprop deterministic graph (#4315)
make gradient graph deterministic
add to session option use_deterministic_compute.
2020-07-01 12:39:10 -07:00
Sherlock
6365760906
BiasDropoutFusion (#4167)
* Implement BiasDropout Fusion and Kernel

Dropout kernel for residual input

BiasDropout Fusion to take residual input

Fix BiasDropout Kernel

Optimize DropoutGrad with 4 elements per thread

* Add graph transformer UT

* MLTypeCallDispatcher for RatioData

* Use MLTypeDispatcher for ratio tensor

* Handle traing_mode input for BiasDropout fusion

* Add test case for missing ratio input

* Replace using FinalizeNodeFusion

* Make BiasDropout kernel template-less

* Make DropoutGrad template-less

* Make Dropout and TrainableDropout template-less

* Regenerate onnx file for UT

* Minior fix on divmod in BiasDropoutKernel

* Adjust pt frontend test due to dropout randomnesss

* Make dropout kernel opeartion in fp32

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-06-30 15:43:14 -07:00
Ashwini Khade
0404763f23
Update function body initialization for ONNX functions (#4332)
* Update function body initialization

* minor fix

* changes per review comments

* minor fix

* format fix

* add function initialization in mixed precision transformer

* more updates

* more fixes
2020-06-30 14:30:59 -07:00
ytaous
4380b8ba68
adjust bs size (#4375)
Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-06-30 10:29:48 -07:00
Scott McKay
274e6b4153
Cleanup SessionState. Move allocator lookup to SessionState. (#4194)
* Move allocators to SessionState so they're decoupled from ExecutionProviders
  - when looking up an allocator it's based on OrtMemoryInfo not the EP so SessionState is a more natural place for that infromation to be stored
  - add device based lookup
    - simplifies logic for copying feeds/fetches across devices
Cleanup SessionState and SessionStateInitializer
  - provide more things to SessionState at construction time so we don't construct and instance and immediately after call a bunch of setters
  - simplify SessionStateInitializer
    - reduced down to FinalizeSessionState method
2020-06-28 14:55:42 +10:00
liqunfu
c3c4ce5ceb
refactor prototypes into headers (#4337)
* refactor prototypes into headers
2020-06-26 12:02:14 -07:00
edgchen1
0b450dcd9f
Enable BiasGelu fusion for training (#4146)
Add gradient for BiasGelu and FastGelu with bias.
Enable BiasGeluFusion and GeluApproximation transformers in TrainingSession.
2020-06-25 17:48:12 -07:00
edgchen1
a6d10376df
Fix build error when USE_NCCL is defined. (#4334) 2020-06-24 23:32:31 -07:00
Tim Harris
a241eb0bbe
Renaming --partition_optimizer to --deepspeed_zero_stage (#4312)
* Rename partition_optimizer -> deepspeed_zero

* Use ZeROConfig in orttraining_pybind_state.cc

* deepspeed_zero -> deepspeed_zero_stage for clarity

* Expose as deepspeed_zero_stage in pybind
2020-06-24 22:05:03 +01:00
Tim Harris
5c6a27408a
Remove signed/unsigned compiler warnings, add additional pipeline test case (#4314)
* Avoid signed/unsigned warning on loops

* Report sizes when distributed world configuration is inconsistent

* Add DistributedRunContextTest for pipeline stage configuration
2020-06-24 11:36:18 +01:00
Vincent Wang
f26c149d7d
Set NonZero Output Shape for Gradient Building. (#4246)
* Set NonZero output shape for gradient building.

* Resolve comments.

Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
2020-06-24 13:43:22 +08:00
Vincent Wang
3374733783
Refactor ReduceMean/Sum Gradient without Shape Dependency. (#4261)
* ReduceMean/Sum gradient without shape dependency.

* optimize expand and use it to replace add.

* Adjust test.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-06-24 11:36:53 +08:00
Bowen Bao
15cb4b3023
Fix session load state & run extra_postpasses only once (#4255)
* Fix session load state & run extra_postpasses only once

* add testcase for onnx model as well
2020-06-23 11:45:26 -07:00
Vincent Wang
b41fcf1570
Bugfix for shape inference and GetShape. (#4243)
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-06-17 15:11:02 +08:00
Wei-Sheng Chin
189fb60ef9
Fix a bug and add code to profile memory (#4241)
* Fix a bug and add code to profile memory

1. Compile Send/Recv again (currently broken because of
   HOROVOD refactor).
2. Add code to print out initializer allocation size and
   activation memory size.

* Address comments

* Split memory counts per locations

* Fix a metric
2020-06-16 10:17:27 -07:00
edgchen1
63bf587623
Use azcopy to download test data (#4221)
Use azcopy from download_e2e_test_data.py, add helper function for downloading azcopy.
Update download_test_data.py to use helper function.
2020-06-16 10:14:34 -07:00
ytaous
5d28efd434
opset12 code cleanup (#4242)
* opset12 code cleanup

* opset12 code cleanup

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-06-15 19:45:35 -07:00
ytaous
e0334f177c
Opset12 upgrade for existing models used by perf/e2e pipelines (#4238)
* opset12 support

* opset12 support

* on comments

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-06-15 14:26:53 -07:00
Bowen Bao
b08771f00e
Add ONNX Training Post-Passes to Front-End - Cont (#4041)
* Add ONNX postpasses

* add flag + add bert test from onnx file

* address PR comments

* fix typo

* fix rebase

* address comments

* Fix test failures

* add new pass for expand for new pt version, add comments

* fix rebase

Co-authored-by: lahaidar <lahaidar@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-06-15 10:33:26 -07:00
Weixing Zhang
b4b1c6440a
Enable ORT with CUDA 11 toolkit (#4168)
* ORT on CUDA 11

1. Seperate HOROVOD and MPI
2. Seperate NCCL from HOROVOD in CMakeLists.txt
2. Remove dependency on external cub
3. cudnnSetRNNDescriptor is changed in cuDNN 8.0

* polish the code about MPI/NCCL in CMakeLists.txt and build.py

* check CUDA version

* ${MPI_INCLUDE_DIRS} should be PUBLIC

* sm30, sm50 are deprecated in CUDA 11 Toolkit

* update change based on code review feedback.

* add sm_52

* improve MPI/NCCL build path

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-06-15 08:47:03 -07:00
Wei-Sheng Chin
ecc901717e
Use subset to release gradient tensors earlier (#4222) 2020-06-14 22:52:54 -07:00
Wei-Sheng Chin
de9da123cf
Enable static memory planning for pipeline. (#4204)
* Enable static memory planning for pipeline.
1. We fix a bug when resolving symbolic shape for scalars.
2. We pass the original inputs to all pipeline stages so that
   the symbolic shapes can be resolved.

* Further Improvements
1. Address comments.
2. Further reduce activation size by ~50% when pipeline is on.
   This is done by removing all but one gradient tensor from the last
   RecordEvent in the backward pass.

* Address a comment

* Fix Windows build
2020-06-12 21:43:50 -07:00
Edward Chen
6b4f652017 Clean up status checks in gradient_graph_builder_test.cc. 2020-06-12 14:28:39 -07:00
Edward Chen
7096e6f5ef Reduce severity of GraphAugmenter logging statement. 2020-06-12 14:28:39 -07:00
pengwa
e6ccb1ac28
GatherNDGrad for CPU (#4123)
* GatherNDGrad on CPU

* Remove __CUDA_ARCH__ check in .cc files
2020-06-12 02:43:49 +08:00
Xueyun Zhu
65a682354b
enable pipeline to run with mixed precision (#4113)
* enable pipeline to run with mixed precision

* address feedback

* address feedback

* test log

* pipe infomation if test fails

* ci failure
2020-06-10 22:16:24 -07:00
suffiank
7f5339505e
Discover trainable parameters using reverse DFS from loss node (#4116)
Discover trainable parameters using reverse DFS from loss node, omitting recursion along untrainable inputs.

Co-authored-by: suffian khan <sukha@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: suffian khan <sukha@microsoft.com>
2020-06-08 14:16:10 -07:00
Sergii Dymchenko
653417ae4b
Fix scaler->scalar typo. (#4142) 2020-06-08 13:02:12 -07:00
Dmitri Smirnov
4e1dac67cd
Address memory leak and improve memory handling (#4124)
Fix memory leak when a Python list passed as a feed.
  Create a custom allocator that can take ownership of python
  arrays that are created inside pybind.
  Allow direct memory use if continuous array is a copy because
  we now can take ownership of it by the allocator.
2020-06-08 09:29:46 -07:00
liqunfu
ffed43e9b8
handle loss and name marching wrappers (#4066)
* handle loss and name marching wrappers

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-06-05 23:34:26 -07:00
Bowen Bao
1e5307d458
Bug fix for parameter names of models not using wrapper (#4061)
* bug fix for models not using wrapper

* add test case for no wrapper case

* update test case to use internal learning rate

* fix bug with frozen weight update
2020-06-05 12:03:38 -07:00
Thiago Crepaldi
81101c9efd
Fix DropoutGrad op (#4052)
Dropout op was recently changed to accept a new input named
'training_mode', which is passed in to DropoutGrad automatically.

This PR updates the DropoutGrad schema to accommodate the new input.
Tests were also update to reflect the API change

Co-authored-by: Thiago Crepaldi <thiag.crepaldi@microsoft.com>
2020-06-04 15:00:02 -07:00
liqunfu
905c535626
still need to make the test stable. Lower the acc number a bit to make the test pass for now (#4117)
Co-authored-by: liqun fu <liqun@OrtTrainingDev1.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-06-02 21:37:48 -07:00
ashbhandare
f18a99b245
Exclude non-trainable torch buffers from trainable weights (#4099)
* Initial changes

* Removed redundant fix

* Revert unintended formatting change.

* Add unit test
2020-06-02 14:05:44 -07:00
edgchen1
ba74914c5a
Remove evaluation output from training e2e test baseline data. (#4092) 2020-06-01 15:06:21 -07:00
ytaous
72d508b7a0
New perf metric - e2e throughput (#4085)
* new metric

* on comments

* tab to spaces

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-06-01 12:11:34 -07:00
Tixxx
6404aba5ae
Orttraining rc1 master merge (#4080)
* fixed seg fault when using concrete shape
disable gradient as output

* fix evaluation hang issue for multiple gpu run

* Remove dead code, ORTModel and improve docstrings (#3814)

* Refine ORTTrainer docstring descriptions (#3907)
2020-05-29 12:28:12 -07:00