Commit graph

256 commits

Author SHA1 Message Date
Wei-Sheng Chin
4ccca20def
Replace MPI Send and Recv with NCCL Send and Recv (#5054)
* Prototype NCCL P2P

* Clean code

* Fix NCCL path and some minor bugs

* Add path

* Fix path

* Try fix path

* Add missed files

* Address some comments

* Clean code

* Rename files

* Add MPI path back and fix a path

* Put MPI path under USE_NCCL flag

* not to build Send and Recv when MPI is not installed
2020-09-09 09:39:56 -07:00
Vincent Wang
07bf8b968e
Register BiasGelu and BiasDropout for CUDA only. (#5060)
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-09-09 11:46:55 +08:00
Sherlock
38453acae3
Further populate Stop Gradient list (#5021)
* Add to Stop Gradient list

* Improve Stop gradient
2020-09-08 12:49:09 -07:00
liqunfu
de58720a97
Liqun/transformer test and e2e golden numbers (#5064)
* match new/old api numbers

* new golden numbers for Roberta and MC

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-04 18:11:37 -07:00
Vincent Wang
84de14a833
Register OpSet13 CUDA Kernels for BERT/UniLMv2 (#4856)
* opset13 cuda kernels for BERT.

* add opset13 SoftmaxCrossEntropyLoss.

* opset13 size.

* fix argmax/min for ut.

* fix ut failure for argmax/min.

* OrtMemTypeCPUInput

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-09-05 08:09:52 +08:00
Bowen Bao
6dd4af3936
Fix initializer name only when wrapper is applied (#4920)
* Fix initializer name only when wrapper is applied

* fix inspect import
2020-09-04 12:08:07 -07:00
Thiago Crepaldi
0fc9c504fe
Re-enable CI tests for the new PyTorch frontend (#5017)
This PR includes:

* Re-enable CI tests for new PyTorch frontend
* Re-enable fp16 and adjust tolerances for number matching
2020-09-04 09:36:24 -07:00
liqunfu
bb13b52291
to allow parallel training with mpi4py (#4942)
to allow parallel training with mpi4py
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-03 12:47:12 -07:00
Thiago Crepaldi
9388d49c0d
Add warning to non pickable models (#5037) 2020-09-03 11:53:56 -07:00
Thiago Crepaldi
9d1bdef195
Update CODEOWNERS and minor docstring fix (#5002)
This PR includes:

* Previous CODEOWNERS was encompassing more files than just training files
* Polynomial optimizer config is missing part of its docstring
2020-09-03 11:52:38 -07:00
Suffian Khan
546965c2da
Add deterministic path for AllReduceL2 (used to compute gradient norm) (#5027)
* add deterministic path for reduce l2

* add unit tests

* memset zero size off by one

* eliminate windows warning as error

Co-authored-by: suffian khan <sukha@OrtTrainingDev1.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-03 10:02:41 -07:00
Bowen Bao
22ba266bd6
Add flag to _internal_use to control export of contrib ops in ort trainer (#4968) 2020-09-03 09:11:47 -07:00
Scott McKay
28445c88f9
Changes to enable saving and loading an ORT format model (#4995)
* Changes to enable saving and loading an ORT format model via the public APIs.
Cleanup session.py to try and make slightly more understandable. More refactoring is needed here.
Couple of bug fixes

* Fix bug in handling NodeArg serialization for optional inputs which has a name and no type info.

* Address PR comments
  - tweak SessionOptions config to avoid double lookup
  - merge duplicated functionality in python binding around registering an EP with optional options

Fix a couple of build issues.

* Update C API to be consistent with python API
  - only load model in InferenceSession ctor if required
  - support loading ORT model in minimal build

* Fix nodejs test.
We get an invalid path error from LoadInterOp first now

* Another attempt at fixing nodejs test.
Error message depends on whether ENABLE_LANGUAGE_INTEROP_OPS is defined. Make the output consistent.

The interop implementation looks suspicious given it appears to be internal code that is going via the public api. TBD if that should be fixed.

* Fix couple of build issues.

* Disable test temporarily so PR can be checked in.
Will fix in separate PR that adds final pieces for minimal build as the test is required there.

* Give up on nodejs test and make the match simpler.
Fix init call in TrainingSession python to not pass through sess. it wasn't being used in Session anyway so passing it through just adds confusion.

* Fix call to Session.__init__ in TrainingSession.
Session now initializes Session._sess to None to make it clearer where the 'ownership' of that member is, and that needs to happen before TrainingSession sets it.
2020-09-03 09:10:48 -07:00
Sherlock
a935731bd3
Neg Gradient (#5022)
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-02 15:54:17 -07:00
Thiago Crepaldi
aabed34d5c
Fix checkpoint API and improve loss scaler handling (#4950)
This PR also includes:
	* More LossScaler tests
        * Minor LossScaler improvement
	* Check model after extra post processing
	* Improve basic training tests to include all optimizers
	* Set rtol=1e-7 tolerance for Legacy vs Experimental frontend API tests
	* Increase number of training tests for Legacy vs Experimental tests
	* Minor refactoring on existing tests
        * Fix Checkpoint API for Gradient Accumulation / fp16 scenarios
2020-09-02 09:38:02 -07:00
Thiago Crepaldi
eebc2cccce
Fix fetches when eval_step's input is a subset of train_step's input (#4966)
This PR also includes MNIST sample using the new forntend as a sample
2020-09-02 08:57:44 -07:00
Thiago Crepaldi
f38f2d5b54
Port #4920 into the new pytorch frontend (#4965) 2020-09-01 19:00:49 -07:00
Hariharan Seshadri
d30dd41c0e
Remove public default ctor in PyInferenceSession and replace it with a protected ctor (#4990) 2020-09-01 17:10:36 -07:00
liqunfu
d79af260bb
Liqun/new api orttraining test transformers (#4982)
* matching transformer model test with Lamb
* increase epochs
* use atol 1e-6 to pass full precision test
2020-09-01 13:11:06 -07:00
Xueyun Zhu
1e1f5a9c79
support data parallel + pipeline parallel (#4648)
* enable data + pipeline parallel

* distributed group calculation

* fix typo

* fix test and minor changes
2020-08-31 17:32:03 -07:00
Thiago Crepaldi
9817b8c8a7
Fix state_dict/checkpoint issue introduced by #4639 (#4984)
https://github.com/microsoft/onnxruntime/pull/4639 changed the default
behavior by removing optimizer state from state_dict/checkpoint APIs.
The reason for the previous change was to allow models trained on ORT to
be used for inference on PyTorch, which is an important feature.

Due to the change aforementioned, when resuming training from a checkpoint,
the optimizer would start with random weights, leading to a bad performance.
This behavior would also cause reproducibility issues, as the optimizer
wouldnt be able to resume from its previous state.

This PR adds a boolean flag to state_dict/save_xheckpoint API that
when True (default) it saves both model and optimizer state.
When False, only the model state is kept.
2020-08-31 17:00:14 -07:00
Sherlock
50c610e70a
Stop Gradient at Shape op (#4983) 2020-08-31 13:13:17 -07:00
M. Zeeshan Siddiqui
6d9d252bc3
Disable NegativeLogLikelihoodLoss_LargeSizeTensor test (#4979)
Disabling this test until it's intermittent failure is root caused, this is a function and does not have a dedicated op by itself. However, this op is not used in known model to the best of my knowledge to disabling this test for the sanity of CI until the investigation is over is probably reasonable.
2020-08-31 11:02:07 -07:00
Sherlock
98f7fdd7da
Handle MatmulGradient with 2D Weight at B (#4977) 2020-08-30 22:56:33 -07:00
Hariharan Seshadri
7045910d10
Support RegisterCustomOpsLibrary via the Python API (#4764) 2020-08-28 13:24:29 -07:00
Wei-Sheng Chin
1281ff6462
Put operators in-between Wait and Record (#4916) 2020-08-28 11:44:54 -07:00
Tang, Cheng
efdd96595f
bfloat16 and opset13 related fix (#4913)
* regsiter part of opset13 cpu kernels; fix a bug in func impl; adjust reshapefusion order

* remove useless function

Co-authored-by: Cheng Tang <chenta@microsoft.com>
2020-08-27 16:10:53 -07:00
Sherlock
9f5d4918dc
MatMul Gradient optimization for dB when B's is 2D tensor (#4899)
* Optimized MatMulGrad for dB when B's shape is 2D

* Refactor for ConstantScalarNode

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-08-27 11:33:20 -07:00
harshithapv
00fe718264
Fix divide-by-zero for SSCE kernel when normalize factor is zero. (#4911)
* Changes in SSCE for all tokens ignored case.
2020-08-26 17:12:17 -07:00
Thiago Crepaldi
cac25751bd
Fix mnist example (#4926) 2020-08-26 15:28:39 -07:00
liqunfu
b3783a9f85
matching multiple choice between new and old apis (#4918)
* matching multiple choice between new and old apis

* update according to reviewer's comments

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-08-26 12:36:10 -07:00
Bowen Bao
db6a821869
Enable example transformer test with dynamic size inputs (#4888)
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2020-08-24 14:31:08 -07:00
Rayan-Krishnan
eb05db5a2a
Fix OptimizerConfig params groups (#4877)
* Copy samples to build folder and load models from there. Fix CI
* This PR also includes a fix to path validation for save_as_onnx API
* Add torchtext to CI for GPU training
* Remove new frontend tests from CI

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2020-08-22 22:04:17 -07:00
Pranav Sharma
29dcfb24ab
Allow multiple sessions to share an allocator, optimize constant folding memory usage, expose arena configs. (#4813)
* Add support for sharing allocators

* Incremental update

* Address some PR comments, add unit tests, add documentation.

* Address PR comments, add tests and some documentation.

* Fix build and test issues

* Remove RegisterAllocator API restoring the OrtAllocator interface changes. Changed docs to reflect this.
Also fixed the orttraining segfault. The segfault was because in the case of training session,
the CPU exec prov is not available at the time the transformers are applied. Changed it to create
a new one.
2020-08-22 10:03:17 -07:00
jingyanwangms
fa68bbc82e
Relu grad kernel (#4864)
* create branch for debug

* move unit test

* more changes

* move relu to activations_grad*

* Fix ReluGrad Domain and opset version

* added unit test, CudaKernelTest.Relu_basic doesn't work yet

* remove CudaKernelTest.Relu_basic

* PR comment

* add unit test ReluGradTest_Basic

Co-authored-by: Jingyan Wang <jingywa@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-08-22 01:03:44 -07:00
Thiago Crepaldi
dce2ce7a4f
Fix checkpoint API and copy samples into build dir (#4887)
* Fix state_dict APIs
* Copy samples to build folder and fix CI
2020-08-22 00:09:48 -07:00
liqunfu
6260d073b3
Glue parallel training (#4550)
add mpi size, rank python API

add single node parallel training example
2020-08-21 21:24:27 -07:00
Thiago Crepaldi
acbf6d15c6
Improve LRScheduler tests (#4885)
* LRScheduler tests added to the Transformer model
	* Refactored LRScheduler tests for the BERT Toy onnx example
	* Removed dead code
2020-08-21 16:18:30 -07:00
Thiago Crepaldi
5427a7e9af
Update LRScheduler to use scheduling similar to HuggingFace (#4880) 2020-08-21 10:24:04 -07:00
Rayan-Krishnan
7589445e6e
Add ONNX BERT Frozen Weights and Save as ONNX Tests (#4859)
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2020-08-19 21:31:38 -07:00
liqunfu
25cc6158a8
update golden numbers (#4865)
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-08-19 20:52:10 -07:00
liqunfu
d7233c7c97
Fix training for models with dict input (#4842)
This PR also includes:
	* Remove defaults from named tuples to support python 3.6
	* Allows model which takes dicts as input
	* Adapts BERT finetuning example to run on the new frontend
        * Match numbers for BERT fine tuning model

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2020-08-19 18:36:36 -07:00
Thiago Crepaldi
7cc88ef7ed
Port legacy checkpoint API into new front-end (#4855)
* Port legacy checkpoint API into new front-end

This PR also fixes:
	* Warnings on ORTTrainer for improper tensor copies
	* Inaccurate LRScheduler tests using wrong LR
	* Stale DeepSpeed documentation
	* Minor code refactoring for Toy BERT tests
        * Move experimental state_dict() and load_state_dict() into checkpoint ns
2020-08-19 14:27:28 -07:00
Vincent Wang
5eaac31faa
support opset13 on transformers. (#4837)
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-08-19 11:13:37 +08:00
gwang-msft
dee7596724
Add a generic collection of session configurations to the SessionOptions (#4718)
* adding generic configurations for session options

* fix a build break on linux

* fix training ci build break

* fix training ci build break

* addressed CR comments

* fix traning ci build break

* move config_key from enum to string

* add c# api

* add python api

* fix build break

* move prepacking from 2 new api entries to session options configs

* fix traning ci build break

* add python test, update some comments, move const key definition to avoid build break

* addressed comments

* move definitions of keys to common.h

* move api to version 5

* remove accidental change in build.py

* remove pragma to avoid build break

* addressed CR comments

* fix the python build break, and move location of config keys definition

* small typo changes
2020-08-18 13:40:40 -07:00
ytaous
2605af9a0b
Fix for mainz model (#4744)
* fix for mainz model

* fix build

* on comments

* revert the extra check

* on comments

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-08-18 11:47:19 -07:00
Thiago Crepaldi
f3b0c93a45
Fix issue preventing loss scaler to run due (#4833)
`LossScaler.update()` was not being properly called due to the incorrect TrainStepInfo.all_finite assignment.

Additionally to this fix, _ORTTrainerModelDesc.is_finite was renamed to _ORTTrainerModelDesc.all_finite to make it more uniform with TrainStepInfo
2020-08-18 10:03:02 -07:00
Hariharan Seshadri
a3c95374c3
Support asymmetric paddings in CUDA Conv kernel (#4627) 2020-08-18 02:09:30 -07:00
Rayan-Krishnan
24d9f4e0c3
Add More Extensive ONNX BERT Tests (#4827)
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2020-08-17 19:54:22 -07:00
Thiago Crepaldi
f933910ea3
Update LambConfig defaults to match backend (#4826) 2020-08-17 16:58:14 -07:00