Commit graph

50 commits

Author SHA1 Message Date
edgchen1
2cb8cb816f
Disable or update flaky tests, improve test random seed accessibility. (#3495)
- Add output of test random seed
- Allow setting of test random seed with environment variable
- Disable / relax tolerance for flaky tests
2020-04-17 15:57:32 -07:00
manashgoswami
9fc2b6482b
Ort training README (#3404)
Added README for ORT Training
2020-04-16 14:51:33 -07:00
M. Zeeshan Siddiqui
6c1ccb659f
SoftmaxCrossEntropyLoss-12 forward and backward kernel implementation. (#3465)
* Update ONNX submodule commit to the latest.

* build break.

* SoftmaxCrossEntropyLoss: Forward and backward kernel implementation.

* Revert "build break."

This reverts commit 847cb50d294efbe6c09fa760e7cacf25bfb6146d.

* Add more tests and misc clean up.

* revert unintended changes.

* PR feedback.

* cleanup.

* PR feedback.
2020-04-16 12:27:07 -07:00
Jesse Benson
644bc05830 Add Python API to set random seed: onnxruntime.seed(<seed>) 2020-04-15 09:44:48 -07:00
pengwa
2c7c45076b
MaxBatchSize E2E Test (#3454)
* max batch size e2e test

*update test data snapshot
2020-04-15 09:50:44 +08:00
edgchen1
4fa88a0a23
Remove cast to OpKernelContextInternal to get threadpool and directly use OpKernelContext. (#3523) 2020-04-14 14:30:26 -07:00
Tixxx
06b63975c0
Fix fp16 type mismatch when graph output is an fp32-only node (#3411)
* verify output node before changing its type in mixed precision mode
2020-04-14 09:35:19 -07:00
edgchen1
ba7225f986
Update Graph SetInputs and SetOutputs for training (#3446)
Fix training modification of Graph SetInputs() and SetOutputs(). Originally there were distinct code paths in Graph based on whether the graph was loaded from a GraphProto or created from scratch. The training modifications made that distinction a bit ambiguous - i.e., even though the Graph is loaded from a GraphProto for training, sometimes we rely on the other code path, e.g., to deduce the graph inputs after modifying it. Consequently, there was some odd behavior when using SetInputs(). For correctness, this change separates the cases where the graph is loaded from a GraphProto and where it is created from scratch.
2020-04-13 19:10:44 -07:00
M. Zeeshan Siddiqui
5d99f179b9
Merge pull request #3486 from microsoft/sedymche/merge_master_ort_training
Merge from master into ort_training
2020-04-13 10:55:36 -07:00
Tixxx
f5ba9c922d
fix internal loss scale (#3483)
* Changed internal loss scale to 1-D

* added test

Co-authored-by: root <root@525204a066204ea794f942530b05ae7f000000.axlncovkyjne5caro2tmz3zryb.xx.internal.cloudapp.net>
2020-04-10 14:13:48 -07:00
edgchen1
20c7dd9f5c
Remove orttraining/docker directory. (#3476)
The docker images are not publicly available yet.
Addressing PR comment: https://github.com/microsoft/onnxruntime/pull/3174#discussion_r390761308
2020-04-10 09:41:22 -07:00
Vincent Wang
03996c7c08
Fixes for Where, ConcatGrad and ReduceSumGrad (#3415)
* Fixes for Expand, Where, ConcatGrad ReduceSumGrad.

* Roll back expand, fix, add tests for reduce grad.

* Roll back CPU Expand change.

* Fix after merge.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-04-10 19:35:32 +08:00
Sergii Dymchenko
84773c61c6 Rename ONNX OPTIONAL to OPTIONAL_VALUE. 2020-04-09 16:22:30 -07:00
liqunfu
1ddfe1249b
frontend test to use random seed (#3209)
frontend test to use random seed
2020-04-08 10:03:07 -07:00
ytaous
b35468289a
View Op - new unit tests and add support for tensor memcpy by offset/size (#3439)
* view ops UTs

* update per comments

* PR comments - code clean up

* code clean up per comments

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-04-07 13:07:11 -07:00
Thiago Crepaldi
15e32b44fd
Merge pull request #3383
Merge from master into ort_training
2020-04-06 19:05:01 -07:00
Edward Chen
95707d22a5 Disable gradient clipping for E2E test. 2020-04-06 23:07:28 +00:00
Sherlock
a3ab2ba036
Reapply commit 131c65d; Fix memory regression issue. (#3423)
* Reapply commit 131c65d

* fix merge error
2020-04-06 10:29:31 -07:00
edgchen1
82c1e1b3db
Enable loss scale input from Python frontend (#3327)
Made some fixes to enable loss scale to be wired up to ORT from the Python frontend. In particular, now addition of loss scaling is done unconditionally if mixed precision is enabled. The generated loss scale input name is passed back to the frontend.

Also fixed how inputs were added during the training graph configuration. Graph::SetInputs() was causing some issues - it seems to not be working correctly.

Also added some mixed precision Python frontend tests.

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-04-03 16:02:14 -07:00
Sherlock
f437665360
Revert "Addressing PR comments (#3334)" (#3412)
This reverts commit 131c65d23d.
2020-04-03 11:59:47 -07:00
Thiago Crepaldi
d89e5d91a6 Disable GradientCheckerTest tests for GPU/Debug build (#3407) 2020-04-03 01:01:58 +00:00
Thiago Crepaldi
675035b1a8
Disable GradientCheckerTest tests for GPU/Debug build (#3407) 2020-04-02 18:00:54 -07:00
Sherlock
614eb438ae
Update Op's Domain and Version (#3356)
* Update Nccl ops domain opset

* Update ZeroGradient Domain OpSet

* Update InPlaceAccumulator Domain OpSet

* Update SoftmaxGrad Domain and OpSet

* Update LayerNormalizationGrad Domain and OpSet

* Update BatchNormGrad Domain and Opset

* Update IsAllFinite Domain and Opset

* Update DivGrad Domain and Opset

* Update GatherGrad Domain and Opset

* Update IsFinite Domain and OpSet

* Update ReduceAllL2 Domain and Opset

* Update MixedPrecisionScale Doman and Opset

* Update AllOp Domain and Opset

* Update GroupOp Domain and OpSet

* Update ViewOp Domain and OpSet
2020-04-01 10:10:38 -07:00
Xueyun Zhu
efc8bd738f
add pipeline graph split script (#3275)
* pipeline graph cut

* add element type

* add input wait event and shape info

* shape inference

* support multiple cuts

* format script

* address feedback

* address feedback
2020-03-31 19:30:18 -07:00
Thiago Crepaldi
83c3da3fc0 Fix code-base after breaking API changes 2020-03-31 17:59:20 -07:00
Weixing Zhang
1bbc421884
Don't cast to fp16 in LayernormGrad (#3328)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-03-28 19:07:32 -07:00
Sherlock
ffb2a3359e
Implement WhereGrad (#3343) 2020-03-27 19:10:40 -07:00
Tixxx
49e6043d07
support Huggingface's adamw (#3318)
* add weight decay mode to support both pytorch and huggingface's adamw
2020-03-27 08:04:27 -07:00
ytaous
131c65d23d
Addressing PR comments (#3334)
* PR comments

* PR comments

* PR comments

* error out bad shape

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-03-26 18:43:30 -07:00
Xueyun Zhu
0a6ec0df56
Merge pull request #3285 from microsoft/xuzhu/merge_from_master
Merge from master to ort_training
2020-03-26 12:10:13 -07:00
Sherlock
d143b41b81
Expose frozen_weights in PyTorch Frontend (#3317) 2020-03-26 11:26:54 -07:00
Wei-Sheng Chin
b38fc0d541
Add bias correction in Adam & Lamb for C++ frontend & python frontend (#3301) 2020-03-25 09:46:44 -07:00
Xueyun Zhu
e9877850a4 fix python error 2020-03-25 01:59:37 +00:00
Bowen Bao
6474801ceb
Update ort_trainer.py with lazy onnx export (#3244)
* Delay onnx export to avoid extra info

* handle cases where onnx model is provided at initialization

* address comments

* fix rebase error
2020-03-24 13:34:15 -07:00
Li-Wen Chang
98c28060b0
Aggregated Send/Recv (#3232)
* Aggregated Send/Recv

* fix typos

* CR refine

* CR refine

* CR refine

* Add scalar check.

* typo

* reformat

* CR refine

* Forgot to swap order in the implementation after spec changed

* CR refine

* Cr refine

* add Send's input type checking
2020-03-24 10:20:11 -07:00
KeDengMS
d15c74e713
Implement pipeline event generator (#3206)
Implement pipeline event generator with OneFWOneBW schedule in timeline. Each stage of pipeline contains FW and BW of a subset of the model and are scheduled in one worker thread for each microbatch.
2020-03-23 17:32:54 -07:00
Xueyun Zhu
8f7bd51f7a fix pybind issue introduced by merge 2020-03-23 23:23:34 +00:00
Tixxx
7f610caca0
Make gradient clipping configurable. (#3243)
* Make gradient clipping configurable.
add control flag to c++ and python frontend
2020-03-23 12:21:48 -07:00
Xueyun Zhu
9dbc50c438 fix build break 2020-03-21 02:16:00 +00:00
liqunfu
d521efd904
refactor frontend (#3235)
* refactor frontend

* remove training python files from inferencing build

* update according to reviewer's comments

* merge pybind_state.cc

* refactor pybind_state.cc

* code clean up

* missed a forward declaration in ort_pybind_state.cc

* passed pytest

* move training_session.py into a subfolder per reviewer's comment

* add copyright

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-03-19 20:59:41 -07:00
edgchen1
d9f628cb1d
Remove orttraining/tools/scripts/profile directory. (#3268) 2020-03-19 14:13:05 -07:00
edgchen1
61e8a24340
Address PR comments (#3255)
* Added comment for ntfw_remove().

* Rewrite WindowsEnv::DeleteFolder(), some other clean up.
2020-03-18 17:57:57 -07:00
edgchen1
c5576d70a6
Fix build issues (#3214)
* Fixed issues with Python and inference-only build.

* Handle ImportError for training imports.

* fix windows build

* fix compile error

* fix centos build

* fix windows build

* fix compile error

* Use SafeInt for allocation calculation, fix typo.

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-03-17 16:10:23 -07:00
Sherlock
4b2c8e884e
Udpate License Header (#3212) 2020-03-16 10:24:31 -07:00
Jesse Benson
3a7539e071 Update bert-base convergence values 2020-03-13 23:03:34 -07:00
Jesse Benson
dc11b82956 Tweak the dropout calculation. 2020-03-13 23:03:34 -07:00
Ethan Tao
2f1e997e5b Merged PR 5686: fix P100/fp16 issues
1. misaligned address in atomic_add()
2. GatherNDGradKernel to use atomic_add
3. enable/add UTs for GatherNDGrad and reduction_ops using half
- __CUDA_ARCH__ won't take effect on .cc code, leverage HasCudaEnvironment() instead
4. verified convergence graph and perf test
- p100 is much slower than v100 on fp16
- fp16/128 need to reduce batch size from 66 to 64 to avoid OOM issue
5. verify convergence test on Dev3/v100

TBD - broken UTs related to MatmulIntegerOpTest (works on v100/windows, though)
2020-03-12 16:51:45 -07:00
Ke Deng
75025461e2 Initial implementation of graph cut and pipeline
This is a draft of graph cut and wait/record to demonstrate cut and Wait/Record design. You may find sub models and profiling json under onnxruntime/test if you run "onnxruntime_test_all --gtest_filter=GradientGraphBuilderTest.TrainingSession_WithPipeline"
2020-03-12 16:51:45 -07:00
Edward Chen
80dd62a240 Enable CI for training. 2020-03-11 14:41:32 -07:00
Edward Chen
e542cfd0e0 Introduce training changes. 2020-03-11 14:39:03 -07:00