Commit graph

149 commits

Author SHA1 Message Date
Tim Harris
5c6a27408a
Remove signed/unsigned compiler warnings, add additional pipeline test case (#4314)
* Avoid signed/unsigned warning on loops

* Report sizes when distributed world configuration is inconsistent

* Add DistributedRunContextTest for pipeline stage configuration
2020-06-24 11:36:18 +01:00
Vincent Wang
f26c149d7d
Set NonZero Output Shape for Gradient Building. (#4246)
* Set NonZero output shape for gradient building.

* Resolve comments.

Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
2020-06-24 13:43:22 +08:00
Vincent Wang
3374733783
Refactor ReduceMean/Sum Gradient without Shape Dependency. (#4261)
* ReduceMean/Sum gradient without shape dependency.

* optimize expand and use it to replace add.

* Adjust test.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-06-24 11:36:53 +08:00
Bowen Bao
15cb4b3023
Fix session load state & run extra_postpasses only once (#4255)
* Fix session load state & run extra_postpasses only once

* add testcase for onnx model as well
2020-06-23 11:45:26 -07:00
Vincent Wang
b41fcf1570
Bugfix for shape inference and GetShape. (#4243)
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-06-17 15:11:02 +08:00
Wei-Sheng Chin
189fb60ef9
Fix a bug and add code to profile memory (#4241)
* Fix a bug and add code to profile memory

1. Compile Send/Recv again (currently broken because of
   HOROVOD refactor).
2. Add code to print out initializer allocation size and
   activation memory size.

* Address comments

* Split memory counts per locations

* Fix a metric
2020-06-16 10:17:27 -07:00
edgchen1
63bf587623
Use azcopy to download test data (#4221)
Use azcopy from download_e2e_test_data.py, add helper function for downloading azcopy.
Update download_test_data.py to use helper function.
2020-06-16 10:14:34 -07:00
ytaous
5d28efd434
opset12 code cleanup (#4242)
* opset12 code cleanup

* opset12 code cleanup

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-06-15 19:45:35 -07:00
ytaous
e0334f177c
Opset12 upgrade for existing models used by perf/e2e pipelines (#4238)
* opset12 support

* opset12 support

* on comments

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-06-15 14:26:53 -07:00
Bowen Bao
b08771f00e
Add ONNX Training Post-Passes to Front-End - Cont (#4041)
* Add ONNX postpasses

* add flag + add bert test from onnx file

* address PR comments

* fix typo

* fix rebase

* address comments

* Fix test failures

* add new pass for expand for new pt version, add comments

* fix rebase

Co-authored-by: lahaidar <lahaidar@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-06-15 10:33:26 -07:00
Weixing Zhang
b4b1c6440a
Enable ORT with CUDA 11 toolkit (#4168)
* ORT on CUDA 11

1. Seperate HOROVOD and MPI
2. Seperate NCCL from HOROVOD in CMakeLists.txt
2. Remove dependency on external cub
3. cudnnSetRNNDescriptor is changed in cuDNN 8.0

* polish the code about MPI/NCCL in CMakeLists.txt and build.py

* check CUDA version

* ${MPI_INCLUDE_DIRS} should be PUBLIC

* sm30, sm50 are deprecated in CUDA 11 Toolkit

* update change based on code review feedback.

* add sm_52

* improve MPI/NCCL build path

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-06-15 08:47:03 -07:00
Wei-Sheng Chin
ecc901717e
Use subset to release gradient tensors earlier (#4222) 2020-06-14 22:52:54 -07:00
Wei-Sheng Chin
de9da123cf
Enable static memory planning for pipeline. (#4204)
* Enable static memory planning for pipeline.
1. We fix a bug when resolving symbolic shape for scalars.
2. We pass the original inputs to all pipeline stages so that
   the symbolic shapes can be resolved.

* Further Improvements
1. Address comments.
2. Further reduce activation size by ~50% when pipeline is on.
   This is done by removing all but one gradient tensor from the last
   RecordEvent in the backward pass.

* Address a comment

* Fix Windows build
2020-06-12 21:43:50 -07:00
Edward Chen
6b4f652017 Clean up status checks in gradient_graph_builder_test.cc. 2020-06-12 14:28:39 -07:00
Edward Chen
7096e6f5ef Reduce severity of GraphAugmenter logging statement. 2020-06-12 14:28:39 -07:00
pengwa
e6ccb1ac28
GatherNDGrad for CPU (#4123)
* GatherNDGrad on CPU

* Remove __CUDA_ARCH__ check in .cc files
2020-06-12 02:43:49 +08:00
Xueyun Zhu
65a682354b
enable pipeline to run with mixed precision (#4113)
* enable pipeline to run with mixed precision

* address feedback

* address feedback

* test log

* pipe infomation if test fails

* ci failure
2020-06-10 22:16:24 -07:00
suffiank
7f5339505e
Discover trainable parameters using reverse DFS from loss node (#4116)
Discover trainable parameters using reverse DFS from loss node, omitting recursion along untrainable inputs.

Co-authored-by: suffian khan <sukha@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: suffian khan <sukha@microsoft.com>
2020-06-08 14:16:10 -07:00
Sergii Dymchenko
653417ae4b
Fix scaler->scalar typo. (#4142) 2020-06-08 13:02:12 -07:00
Dmitri Smirnov
4e1dac67cd
Address memory leak and improve memory handling (#4124)
Fix memory leak when a Python list passed as a feed.
  Create a custom allocator that can take ownership of python
  arrays that are created inside pybind.
  Allow direct memory use if continuous array is a copy because
  we now can take ownership of it by the allocator.
2020-06-08 09:29:46 -07:00
liqunfu
ffed43e9b8
handle loss and name marching wrappers (#4066)
* handle loss and name marching wrappers

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-06-05 23:34:26 -07:00
Bowen Bao
1e5307d458
Bug fix for parameter names of models not using wrapper (#4061)
* bug fix for models not using wrapper

* add test case for no wrapper case

* update test case to use internal learning rate

* fix bug with frozen weight update
2020-06-05 12:03:38 -07:00
Thiago Crepaldi
81101c9efd
Fix DropoutGrad op (#4052)
Dropout op was recently changed to accept a new input named
'training_mode', which is passed in to DropoutGrad automatically.

This PR updates the DropoutGrad schema to accommodate the new input.
Tests were also update to reflect the API change

Co-authored-by: Thiago Crepaldi <thiag.crepaldi@microsoft.com>
2020-06-04 15:00:02 -07:00
liqunfu
905c535626
still need to make the test stable. Lower the acc number a bit to make the test pass for now (#4117)
Co-authored-by: liqun fu <liqun@OrtTrainingDev1.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-06-02 21:37:48 -07:00
ashbhandare
f18a99b245
Exclude non-trainable torch buffers from trainable weights (#4099)
* Initial changes

* Removed redundant fix

* Revert unintended formatting change.

* Add unit test
2020-06-02 14:05:44 -07:00
edgchen1
ba74914c5a
Remove evaluation output from training e2e test baseline data. (#4092) 2020-06-01 15:06:21 -07:00
ytaous
72d508b7a0
New perf metric - e2e throughput (#4085)
* new metric

* on comments

* tab to spaces

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-06-01 12:11:34 -07:00
Tixxx
6404aba5ae
Orttraining rc1 master merge (#4080)
* fixed seg fault when using concrete shape
disable gradient as output

* fix evaluation hang issue for multiple gpu run

* Remove dead code, ORTModel and improve docstrings (#3814)

* Refine ORTTrainer docstring descriptions (#3907)
2020-05-29 12:28:12 -07:00
Wei-Sheng Chin
e951b29a0b
Fix a macro and memory regression (#4068)
onnxruntime_training_bert can run the following command again.

./onnxruntime_training_bert --model_name=bert-large-uncased_L_24_H_1024_A_16_V_30528_S_512_Dp_0.1_optimized_layer_norm --num_train_steps=16 --train_batch_size=52 --mode=train --train_data_dir=/bert_data/128/books_wiki_en_corpus/train --test_data_dir=/bert_data/128/books_wiki_en_corpus/test --gradient_accumulation_steps=16 --optimizer=Lamb --learning_rate=3e-3 --max_seq_length=128 --max_predictions_per_seq=20 --warmup_ratio=0.2843 --warmup_mode=Poly --display_loss_steps=100  --use_mixed_precision=True --allreduce_in_fp16 --use_nccl
2020-05-29 09:24:40 -07:00
edgchen1
38d76cc904
Clean up training E2E test (#4078)
Update training E2E build to not go through CTest and call test scripts directly.
2020-05-29 09:20:47 -07:00
pengwa
6d03470587
Add e2e measurement for training (#4049)
* add e2e measurement
2020-05-29 10:08:29 +08:00
liqunfu
6665d5e2bc
Liqun/a transformer example (#3845)
Add transformer glue test example to show how to use ORTTrainer to fine-tune a transformer model

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-05-27 15:21:35 -07:00
Xueyun Zhu
633008b5ef
Add pipeline online partition logic for pipeline (#3996)
* online partition

* fix when multiple consumer nodes is in cut info

* fix windows build

* address feedback

* adding test

* feedback

* address feedback

* add parser for cut edge

* windows build
2020-05-26 17:44:09 -07:00
Wei-Sheng Chin
24eda3df33
Create Utils for Adding Range and Marker (#4013)
In this PR, we
  1. create some APIs for creating NVTX objects
  2. apply those APIs in pipeline-related operators and sequential executor.
As a result, we can explicitly see how a pipeline schedule is run by GPUs in 
Nvidia's visual profiler. Note that these APIs are Linux only due to Nvidia's
limited support.
2020-05-24 22:55:24 -07:00
Bowen Bao
0a5395bb78
Remove 'model_.' prefix from onnx model initializers in training (#3881)
* Remove 'model_.' prefix for onnx model initializers in training

* fix test case remove redundant device test

* rename

* Fix state_dict/load_state_dict with frozen_weight

* nit

* Add monkey patch for pt opset 10

* remove pt patch in CI

* nit: newline
2020-05-20 10:06:31 -07:00
ytaous
fb4efafc8e
GPT-2 training perf scripts (#3974)
* gpt2 training perf

* gpt2 training perf

* debug

* debug

* debug

* fix bug

* minor

* on comments

* dynamic sql

* fix build

* minor

* linked hash

* on comments

* minor

* mem

* minor

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-05-19 10:21:40 -07:00
Faith Xu
b8a255e1b5
Doc Updates for Build (#3976)
* Initial update of readme

* Readme updates

* Review of consolidated README (#3930)

* Proposed updates for readme (#3953)

I found some of the information was duplicated within the doc, so attempted to streamline

* Fix links

* More updates

- fix build instructions
- nodejs doc reorganization
- roadmap update
- version fixes

* Update ORT Server build instructions

* More doc cleanup

* fix python dev notes name

* Update nodejs and some links

* sync eigen version back to master

* Minor fixes

* add nodsjs to sample table of content

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* address PR feedback

* address PR feedback

* nodejs build instruction

* Update Java instructions to include gradle

* Roadmap refresh

Reformat some data, fix link, minor rewording

* Clarify Visual C++ runtime req

Co-authored-by: Nat Kershaw (MSFT) <nakersha@microsoft.com>
Co-authored-by: Prasanth Pulavarthi <prasantp@microsoft.com>
Co-authored-by: manashgoswami <magoswam@microsoft.com>
2020-05-18 20:08:36 -07:00
M. Zeeshan Siddiqui
44731e88bb
Add comments for zero valued normalization factor in SoftmaxCrossEntropyLossGrad CUDA kernel. (#3972) 2020-05-18 09:08:09 -07:00
Wei-Sheng Chin
0d11649bb3
Address comments from #3823 and polish code (#3964)
* Address comments from #3823 and polish code

* One line
2020-05-17 14:08:33 -07:00
M. Zeeshan Siddiqui
a296b16719
Prevent divide by zero in CUDA implementation of SoftmaxCrossEntropyLossGrad. (#3962) 2020-05-16 00:33:25 -07:00
Wei-Sheng Chin
33208c9f6b
Modify Pipeline Facilities to Fix PipeDream Deadlock (#3823)
* Prepare utils for adding Wait's and Record's

* Have a running PipeDream

* Add comments

* Polish comments

* Clean code

* Fix test

* Polish names

* Polish names

* Remove debug headers

* Fix a shape inference bug (not related to pipeline code)

* Fix a warning

* Address some comments

* Address comments

* Only touch consumers of outputs when re-wire edges
2020-05-15 18:27:19 -07:00
ytaous
bc441b7e5c
Add cpu/mem usage for perf metrics (#3947)
* add cpu/mem usage

* on comments

* on comments

* renaming

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-05-15 12:29:40 -07:00
ytaous
93eb9bcfde
Add yaml/perf scripts for new perf test pipeline (#3909)
* yaml/perf scripts for new pipeline

* yaml/perf scripts for new pipeline

* remove unused imports

* testing some comments change

* testing some comments change

* testing jdbc

* testing jdbc

* testing jdbc

* exclude pwd from jdbc properties

* exclude pwd from jdbc properties

* namedtuple

* on comments

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-05-13 14:15:17 -07:00
Bowen Bao
0f82b42fed
Ensure pt model is set to cpu in ort_trainer (#3867)
* Ensure pt model is set to cpu in ort_trainer

* add note comment
2020-05-12 13:32:27 -07:00
Thiago Crepaldi
70abb120b3
Remove ORTModel from frontend API (#3825)
* Resolve conflict

* Address review
2020-05-11 18:20:33 -07:00
M. Zeeshan Siddiqui
c46a9e8d65
Add numerical stability to SoftmaxGrad test inputs. (#3857)
* Increase the tolerance for SoftmaxGrad CPU-GPU compare tests.

* Increase the tolerance for SoftmaxGrad CPU-GPU compare tests.

* Add 1e-2 to Y for numerical stability.

* build break.

* comments.

* PR feedback.

* PR feedback.
2020-05-11 17:59:24 -07:00
ytaous
96030fdcbc
dashboard integration - output training perf metrics as json (#3809)
* dashboard integration - first phase

* change a field

* perf scripts

* addressing PR comments

* address comments and fix build

* minor

* make GetConfigFromData() const

* more update for comments

* addressing comments

* more on addressing comments

* minor

* fix build

* add condition check

* more on comments

* retrun status

* remove batch size

* on comments

* rename pkg path

* rename pkg path

* additional commentss

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-05-10 10:29:38 -07:00
M. Zeeshan Siddiqui
eb33d5eda9
Do not register Dropout(12) as training ONLY kernel. (#3859)
* Do not register Dropout(12) as training ONLY kernel.

* Move Dropout forward implementation in inference project.

* fix inference build test failures.

* remove fp16 test since its support is absent on CPU.

* build break.
2020-05-09 21:38:17 -07:00
Vincent Wang
3c24841569
Fold Shape Node During Constant Folding (#3748)
* Fold Shape node in constant folding.

* bugfix

* Fix test failure.

* Bugfix for C++ frontend.

* Bugfix for C++ frontend.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-05-09 20:15:03 +08:00
ashbhandare
424a00bf04
Fix enabling gradient as output for easy mode. (#3866) 2020-05-07 15:07:14 -07:00