Commit graph

305 commits

Author SHA1 Message Date
ashbhandare
0a9b83a313
Add zero test (#5476) 2020-10-21 17:12:00 -07:00
Vincent Wang
b48f596a91
GatherElementsGrad CPU Kernel and TopKGrad CPU/CUDA Kernel (#5511)
* TopKGrad CPU kernel

* use Scatter for GatherElementsGrad and TopKGrad.

* rollback convgrad change.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-10-21 09:29:29 +08:00
Xavier Dupré
66c8a441e0
Improves ReduceSum performance by removing transposition. (#5370)
* Improves ReduceSum performance
* Add min, max, L1, L2, logsum, sumsquare
* remove all reduce implementation including transpose
2020-10-20 10:36:31 +02:00
Juliana Franco
0298b9734e
Save in EndTraining only if in last rank (#5500)
* Only save partition of graph with loss (during EndTraining)

* fix comments

Co-authored-by: Juliana <jufranc@microsoft.com>
2020-10-19 14:16:48 -07:00
Derek Murray
0b59004666
Add fallback function implementation for DivGrad (#5518)
* Add fallback function implementation for DivGrad.

* Add shape inference for DivGrad.

* Add missing argument.

Co-authored-by: Derek Murray <demurra@microsoft.com>
2020-10-19 10:47:47 -07:00
Derek Murray
6f65e2ad2c
Mark the dX and dB outputs of ConvGrad as OpSchema::Optional. (#5462)
* Mark the dB output of ConvGrad as OpSchema::Optional.

* Also mark dX as optional

Co-authored-by: Derek Murray <demurra@microsoft.com>
2020-10-15 16:54:17 -07:00
Derek Murray
64f6d856e4
Add FlattenGrad and test. (#5461)
Co-authored-by: Derek Murray <demurra@microsoft.com>
2020-10-15 16:11:57 -07:00
Derek Murray
88f6523baf
Add type inference for BroadcastGradientArgs (#5501)
* Add type inference for BroadcastGradientArgs

This change enables the ONNX shape and type inference to work on a function body containing a BroadcastGradientArgs op. Without this change, the dummy inference function is used, and no types are inferred for the output here:

531e6dd459/onnx/shape_inference/implementation.cc (L467-L469)

* Handle optional outputs.
2020-10-15 16:11:24 -07:00
Scott McKay
7da7e07909
Cleanup some test infrastructure (#5484)
* Created shared version of InferenceSession wrapper class and update relevant tests to use it.
Include domain in the ops counting helper so it's more general and we don't need to duplicate it in the nchwc tests. Update tests to include domain in key being checked.

* Fix some training tests

* Fix prefixing of contrib op names in test
2020-10-16 06:44:01 +10:00
KeDengMS
c444b9d76a
Add CUDA option to run copy in default stream (#5445)
* Add CUDA option to run copy in default stream

This change fixes #4829. Thanks @maherzog for providing the repro!

The bug is caused by memory reuse in BFC arena, where copy and
compute stream in CUDA has a racing condition.

BFC arena is an arena allocator on top of cudaMalloc/Free to
reduce the cost in syncing CPU and GPU when alloc/free. It means
when CPU alloc/free the memory, GPU might not finished previous
work on the memory, so that CPU and GPU could run asynchronously.

This is OK if there's only one stream, where the execution order
in CPU and GPU are consistent. For example, if we have two kernels
A and B, CPU runs allocA->computeA->freeA->allocB->computeB->freeB,
A and B could shares the same memory since computeA and computeB
will not have racing as long as they run in the same GPU compute
stream.

However, if CPU runs allocA->CopyA->freeA->allocB->computeB->freeB,
the order of execution in GPU could have copyA happen after computeB,
if copy and compute happens in different GPU streams.

This change makes copy to run in default compute stream, while adding
an option to fall back to previous behavior if there's perf hit. This
is a short term fix before BFC arena could support multiple streams.

User may use following options to revert to previous behavior:
C API:
  struct OrtCUDAProviderOptions cudaProviderOpt;
  cudaProviderOpt.do_copy_in_default_stream = false;
C++ API:
  CUDAExecutionProviderInfo cudaEPInfo;
  cudaEPInfo.do_copy_in_default_stream = false;
C# API:
  pending...
Python:
  import onnxruntime
  onnxruntime.capi._pybind_state.set_do_copy_in_default_stream(False)

* Confirmed the test failes in CI when doing copy in separate stream

Revert the test to get CI pass now

* Fix Windows test

* Address CR
2020-10-12 22:12:05 -07:00
Sherlock
60dbd8a1e5
Update maximum batch size for UT; Include recompute modes (#5444)
* Update MaxBatchSize and include recompute mode
* Minor fix for frontend test

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-12 14:50:43 -07:00
Derek Murray
dbc626dcbe
Add ExpGrad registration and test. (#5438)
**Description**: Add missing gradient registration for the `Exp` op.

**Motivation and Context**
* Adding support for training a model that uses the `Exp` op.

Co-authored-by: Derek Murray <demurra@microsoft.com>
2020-10-12 13:56:08 -07:00
jingyanwangms
20c47ce91c
Simplified layer norm changes (#5028)
* t5 layer norm changes

* add t5 layer norm kernel

* use template for t5 layer norm

* template definition changes

* no build error

* add CPU cuda kernel

* first unit test

* other forward unit tests

* add T5LayerNormGrad

* Add c++ transform and test for T5 LN

* fix and some debug prints

* fix cuda error

* rename from t5 to simplified

* PR comments

* revert change on invertible LM code path

* remove duplicate forward computation

* add GradientCheckerTest.SimplifiedLayerNormGrad

* change back macro

* Fix SimplifiedLayerNorm Gradient

* merge with Sherlockss changes

* changed cuda kernel

* reapply cpu kernel changes

Co-authored-by: Jingyan Wang <jingywa@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: aishwarya bhandare <aibhanda@microsoft.com>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-12 11:22:12 -07:00
Wei-Sheng Chin
6cba42e942
Avoid inserting other CUDA calls in-between NCCL Send's and Recv's (#5430)
* Avoid inserting other CUDA calls in-between NCCL Send's and Recv's

* Add a comment

* Place CUDA EP on the right device

* Fix a warning

* Address a comment
2020-10-09 15:34:46 -07:00
Sergii Dymchenko
3a9a1a4ef1
Fix registration for GatherGrad (#5382)
* Fix registration for GatherGrad to fix GatherGradOpTest.GatherGrad_axis0_indices2d_half.

* Fix GatherGrad registration for CUDA also.
2020-10-09 11:57:50 -07:00
Suffian Khan
498f94668d
Keep all_finite tensor on CPU when using PyTorch Frontend (#5371) 2020-10-08 15:47:18 -07:00
Hariharan Seshadri
6f54113a1b
Support OrtValue binding in Python to enable interesting IOBinding scenarios in Python (#5248) 2020-10-06 21:14:41 -07:00
liqunfu
773992c7d4
Liqun/bert pretrain tb (#5377)
* add tensor board, remove torch.distributed.lanuch because ort nccl depends on MPI. Use MPI to launch parallel training.

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-06 16:28:31 -07:00
Du Li
323c4dfe02
Adding an option for cudnn conv algorithms. (#5159)
* adding cudnn conv algorithm selection options.

* adding cudnn conv algorithm selection options.

* export the api

* adding the perf test option.

* accomodating pr comments.

* Move OrtSessionOptionsAppendExecutionProvider_CUDA to onnxruntime_c_api.h

* Accomodating PR comments.
2020-10-05 16:53:52 -07:00
Ashwini Khade
668ab04917
rename all TransposeMatMul nodes to FusedMatMul (#5373) 2020-10-05 12:41:05 -07:00
Wei-Sheng Chin
4e3a420aa7
Use single thread when pipeline is not enabled in TrainingRunner (#4265)
* Use single thread when pipeline is not enabled in TrainingRunner

* Remove macro indents

* Format file and remove state variable
2020-10-05 10:42:09 -07:00
Sherlock
e71668f92c
Expose recompute configs to the frontend (#5318)
* Expose recompute configs to the frontend

* Add frontend test

* Ensure recompute graph transformer is only applied once

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-02 09:49:47 -07:00
liqunfu
fe50213491
Liqun/bert pretrain2 (#5327)
* bert single node multi GPU pretrain w/o checkpoint

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-01 11:01:26 -07:00
Sherlock
37445d1198
Update Bert Perf Script (#5339)
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-30 14:30:20 -07:00
Sherlock
9ec1ed42a8
Enable BiasDropoutFusion for CUDA EP only (#5324)
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-29 14:00:15 -07:00
Sherlock
11c194ce29
Minor fix for ComputeBroadcastBackwardAxesDynamic; Fix for GradientGraphBuilder logging (#5313)
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-29 09:49:05 -07:00
Vincent Wang
eae2473dc1
Scale Op for ReduceMeanGrad. (#5191)
* Scale Op for ReduceMeanGrad

* fix Windows build error

* resove PR comments.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-09-29 09:30:49 +08:00
Tang, Cheng
d9ecc0cebf
add bert loss legacy back (#5224) 2020-09-27 13:41:16 -07:00
Guoyu Wang
3a3f26f38e
Move ort flatbuffers helper functions and value info r/w functions into separated lib (#5276)
* Move fbs include from header to cc

* add initial cmake for flatbuffers

* Move most flatbuffers util to ort_flatbuffers

* move code around

* fix

* move test/perf runner to use flatbuffer directly instead of model

* minor update

* Fix build break

* Clean up includes and foward decl

* Fix traning CI build breaks

* Addressed PR comment, replaced some include with forward decls

* Remove ORT_MUST_USE_RESULT temporarily
2020-09-25 05:36:29 -07:00
Changming Sun
17f1178c2e
Downgrade GCC (#5269)
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2020-09-24 21:14:54 -07:00
Sherlock
b03fb82ab7
Transformer layer-wise Recompute (#4526)
* Build Recomputation Graph

* Make topological sort to run FW nodes first

* Pattern match start and end of transformer layer

* Topological sort with Priority

* Add logger to Gradient Graph Builder

* Use Logger

* Introduce Execution Order
2020-09-24 19:56:32 -07:00
Ashwini Khade
16220f3848
Add FusedMatMul contrib op (#5213)
* bug fix transformer

* fuse cpu kernel for transposescalematmul and matmul

* fuse transpose_scale_matmul cpu kernel with matmul

* fix test

* Add FusedMatMul Contrib Op

* fix test

* fix typo

* plus more updates per review
2020-09-23 12:17:50 -07:00
Scott McKay
c52561d044
Rework broadcasting setup to decrease binary size. (#5227)
* Rework broadcasting setup to decrease binary size. Push all the type specific down and separate out the broadcasting/parallelization.

Reductions:
element_wise_ops: 521.0KB -> 268.8KB
where: 25.8 KB -> 17.3 KB
qlinear_binary_op: 28.1 -> 12.8
2020-09-23 14:15:40 +10:00
KeDengMS
8dceebda0e
[Training/Python] Add option to enable symbolic shape inference (#5107)
This change adds symbolic shape inference to ORT training which helps static memory planning for model like BART.
2020-09-22 10:49:07 -07:00
Sherlock
1478643215
Place Shape's output in CPU memory (#5245)
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-21 20:21:59 -07:00
Pranav Sharma
974b9bfc09
Allow sharing of initializers between sessions. (#5092)
* Allow sharing of initializers between sessions.

* Allow sharing of initializers between sessions (2).

* Add test for C#

* Add test for C#; address PR comments

* Address PR comments
Moved AddInitializer logic to internal session options
Added tests for owned buffer
Clarified documentation
Fix bug where memory info and not device was getting compared

* Fix test

* Fix training build

* Add ver 5 end marker and ver 6 starter, add scenario and usage examples.
2020-09-21 14:09:37 -07:00
edgchen1
e9671e93f0
Fix TransposeScaleMatMul and MatMulScaleFusion issues (#5230)
- Rename TransposeScaleMatMul back to TransposeMatMul for backwards compatibility
- Fix MatMulScaleFusion issues:
  - Add check for supported execution providers
  - Add check for supported MatMul input types
2020-09-21 12:34:01 -07:00
Suffian Khan
84589c7e05
Fuse softmax(a + b) in case of simple broadcast (#4937)
* bias softmax kernel

* bias softmax kernel

* remove debug comments

* remove debug comment

* windows build doesnt handle unary minus on unsigned type

* int64 => int treated as error

* only support cuda

* add bias softmax fusion tests

* PR comments

* more PR comments

* use MLTypeCallDispatcher

* break function into pieces

* add loop unroll and add to list for inference as well

* use std::min and move operator==

* revert std::min (doesnt work ci pipeline) and fix int to size_t error

* pr comments

* fixes for windows ci

* fix for windows ci

* pr comments on consistency

* p_model_

* fix formatting and add anonymous namespace

Co-authored-by: suffian khan <sukha@OrtTrainingDev1.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-18 14:15:55 -07:00
Tang, Cheng
e0b49844e9
Provide option to let layernorm stash mean/var as fp32 or bfloat16 (#5215)
* add option to set layernorm stash type

* bug fix

* fix merge error

* fix win build error
2020-09-18 13:42:01 -07:00
Suffian Khan
e01e0b2e40
Fix softmax_warp_backward math when is_log_softmax = True and register LogSoftmax CUDA kernel (#5160)
* register logsoftmax cuda kernel; fix logsoftmaxgrad cuda kernal; fix tests to invoke dispatch_softmax_*

* forgot to remove axis check

* add tests all axis

Co-authored-by: suffian khan <sukha@OrtTrainingDev1.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-17 07:15:25 -07:00
Vincent Wang
c37472a1aa
Mixed Precision Transformer and Gradient Builder Refactor (#4892)
* transform mixed precision before build gradient

* resolve comments

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-09-17 02:44:50 +08:00
edgchen1
a20f8037f6
Install ssh in builder image, fix segfault in TrainingRunnerTest.Basic. (#5186) 2020-09-16 09:53:30 -07:00
Bowen Bao
400ac85565
Improve error message for FE model export checking (#5156) 2020-09-16 09:22:37 -07:00
Rayan-Krishnan
92a8c650ad
[Debuggability] Add feature to ORTTrainer Frontend (#5124)
* add option, feature to orttrainer and test

* address comments

* minor fixes

* further address comments

* minor changes

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-11 12:16:07 -07:00
Scott McKay
59ee8ffb17
Remove SparseTensor support from minimal build. (#5114)
* Remove SparseTensor support from minimal build.

Currently the only valid usage of a SparseTensor is as an attribute of a Constant node. That would have been lifted to a dense tensor initializer when loading the onnx model, so would not exist when saving the ORT format model. Due to that there can be no SparseTensors in an ORT format model.

Co-authored-by: gwang <wanggy@outlook.com>
2020-09-11 17:56:54 +10:00
Wei-Sheng Chin
9ba56dcfed
Support Send and Recv for old NCCL versions (#5097)
If NCCL version < 2.7, MPI is sued. Otherwise, we use NCCL Send and Recv.
2020-09-09 20:58:05 -07:00
Wei-Sheng Chin
934f30fc38
Not to call NVTX when not available (#5095)
* Not to call NVTX when not available

* fix syntax

* Fix a syntax error
2020-09-09 20:01:45 -07:00
Xueyun Zhu
a90fae8c71
unify error handling in pipeline transformer (#5039) 2020-09-09 14:52:04 -07:00
Thiago Crepaldi
6594d6672f
Move onnxruntime.experiment to onnxruntime.training namespace (#5045) 2020-09-09 09:46:06 -07:00
Wei-Sheng Chin
4ccca20def
Replace MPI Send and Recv with NCCL Send and Recv (#5054)
* Prototype NCCL P2P

* Clean code

* Fix NCCL path and some minor bugs

* Add path

* Fix path

* Try fix path

* Add missed files

* Address some comments

* Clean code

* Rename files

* Add MPI path back and fix a path

* Put MPI path under USE_NCCL flag

* not to build Send and Recv when MPI is not installed
2020-09-09 09:39:56 -07:00