Commit graph

597 commits

Author SHA1 Message Date
M. Zeeshan Siddiqui
82108b18e3
Partial graph execution perf improvements. (#7438)
* Partial graph execution perf improvements.

* PR feedback.

* Decrement reference count of tensors in ORTModule.

* PR feedback.

* PR feedback.

* PR feedback.
2021-04-26 17:13:55 -07:00
Thiago Crepaldi
0702a14ee7
Add pytorch version check before loading Python ONNX Runtime training module (#7377) 2021-04-26 14:53:50 -07:00
Vincent Wang
368e4a324f
SqueezeGrad Bugfix (#7412)
* squeezegrad bugfix

* fix ut

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2021-04-26 09:12:03 +08:00
Weixing Zhang
ca9b3f18e9
Explicitly pass cuda stream to thrust function rather than use cuda default stream implicitly (#7414)
* Pass cuda stream to thrust function to not use default stream.

In the commit 299ace0, ORT has been changed to not use cuda default stream.

* update amd_hipify.py

* remove un-necessary stream sync

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-04-25 01:18:56 -07:00
Thiago Crepaldi
410a81b21b
Add support for ORTModule to execute the graph when ONNX drops unused… (#7424) 2021-04-23 18:10:57 -07:00
M. Zeeshan Siddiqui
34ebf7d3dd
Partial graph execution made simple. (#7324)
* Python changes.

* C++ changes.

* fixes/hacks.

* more hacks.

* perf.

* changes.

* changes.

* re-architect partial graph execution and  remove iobinding.

* changes.

* refactor.

* prevent copies from python to c++.

* perf.

* merge conflicts.

* misc.

* fix merge conflicts and tests.

* Ifdef partial executor.

* PR feedback.

* Delete ORT Task et al.

* Clean up.

* clean up.

* Restore SetOutputMLValue().

* PR feedback.

* Re-enable disabled ORTModule tests.

* PR feedback.

* PR feedback.
2021-04-23 15:09:18 -07:00
Tang, Cheng
1fa6d8fe1c
support loading external execution provider from python frontend (#7332)
* initial dynamic load example

* support load EP in the provider options

* support dynamic load EP in orttrainer

* split the provider interface; fix comments in pr

* remove experiment code

* add test

* remove useless file

* add test model file;fix linux brewak

* fix linux build and missing file

* fix python build

* fix python build

* fix python binding

* fix python test

* fix runtime path for posix env

* exclude the shared library from minimal build

* fix comments in pr;

* seperate the provider shared lib loading

* excluded from minimal / macos / ios build

* skip copy the provider shared lib for minimal build and mac os

* fix macos build

* exclude the test for macos build

* exclude from andorid build

* exclude from web assembly build

* enable the invalid ep test

Co-authored-by: Cheng Tang <chenta@microsoft.com>
2021-04-23 09:54:09 -07:00
Ashwini Khade
75e054cd33
pick onnx release candidate (#7177)
* pick onnx release candidate

* fix typo

* filter batchnorm tests

* add implementation for reshape 14

* add identity op kernel for opset 14

* fix typo

* update onnx commit

* update commit to latest master

* add hashes for new kernel registrations and update 1

* TEST commit

* update onnx back to right commit

* Update onnx to latest in rel-1.9.0

* temp fix

* remove nonzeroshapesetter transformer

* pick rel branch latest commit

* fix build failures

* fix build failures

* fix build failures

* update the commit to latest in release branch

* add test filters for not impemented op14 ops in c# tests

* plus review comments
2021-04-22 23:57:09 -07:00
Thiago Crepaldi
771a6d235b
Fix IsContiguousTensor check on backend (#7391) 2021-04-21 17:01:17 -07:00
Sherlock
16ca7677e6
Relax ConvGrad Test tol (#7393)
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-21 08:06:00 -07:00
Thiago Crepaldi
8421124344
Add support to **kwargs in ORTModule forward() method (#7360) 2021-04-20 16:21:52 -07:00
ashbhandare
76cc118dbe
Gemm transpose fusion (#7306)
* Gemm transpose fusion

* Correct rewrite rule effect

* Add to inference transforms to trigger on gradient graph
2021-04-20 09:35:05 -07:00
mindest
1a3ddf0714
Add gradient registration and tests for Min/Max (#7217)
* Add gradient registration and tests for Min/Max

* Add helper function for min/max grad test

* limit Min/Max Grad to accept at most two inputs; modify test case accordingly

* resolve merge error
2021-04-20 18:14:31 +08:00
Sherlock
ce7ff27bac
Fix perf issue in Conv CUDA kernel (#7348)
* Fix perf issue in  Conv CUDA kernel

* Read avaiable memory from device

* assuming 10% fragmentation

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-19 23:37:05 -07:00
ashbhandare
ac346a1b90
Modify SimplifiedLayerNormFusion to allow fusion in the presence of Casts optionally (#7352)
* LN transform partial changes

* LN transform fix

* Make transform optional, remove unnecessary code

* Fix windows build

* review comment, windows CI fix

* review comments
2021-04-19 19:59:23 -07:00
ytaous
7abe1fd392
Identity elimination with graph output (#7312)
* Identity removal

* fix build

* fix build

* fix build

* fix builld

* UTs

* fix UT

* fix UTs

* per comments

* fix UTs

* fix UTs

* per comments

Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-19 16:36:35 -07:00
satyajandhyala
bb1e417da0
Add logging support to Cast Propagation transformation from python (#7353)
* Fixes needed to PropagateCast transformation.

* Added number of passes to the logs.

* Added logging support to OrtModuleGraphBuilder.

* Added new testcases.

* Added NodeArgToConsumerMap
2021-04-19 12:14:30 -07:00
M. Zeeshan Siddiqui
6dda1e0681
Flag for tensor memory re-use in allocation planner. (#7359) 2021-04-16 17:53:25 -07:00
satyajandhyala
0da085ed48
Propagate Cast operations to maximize lower precision (float16) computation (#7191)
* Added propagate_cast_ops option and PropagateCastOps transformation.

* Added test cases to propagate Cast operations.

* Expose GraphTransformerConfiguration to python interface and added propagate_cast_ops options.

* Added functionality to propagate Cast operations.

* Added logging.

* Apply cast propagation to the subgraphs.
2021-04-14 20:54:24 -07:00
Jesse Benson
be79575c6a Use built-in reduce_sum() for simple reduction cases, specifically reduce all to a scalar. 2021-04-14 08:55:35 -07:00
ashbhandare
6ceee5d131
IsInf ReduceSum transform (#7188)
* IsInf ReduceSum transform

* Revert unnecessary changes, add isinf_only and isnan_only attr

* add tests, review comments

* Disable test for non-cuda

* Move IsAllFinite from training to contrib op

* review comments

* Review comment, formatting

* Enable test for ROCm EP
2021-04-13 16:05:21 -07:00
G. Ramalingam
f8a36dd6b3
Add DropoutGrad function body (#7310)
* Add DropoutGrad function body

* Add DropoutGrad function body

* Fix documentation and add test cases

* Fix template specialization

* Check expansion for float16 and bfloat16
2021-04-13 14:31:53 -07:00
harshithapv
a5d3a52d1a
Add Tile grad (#7289)
* tile grad

* fixed bugs

* added tile grad test

* bug fix

* Added tests. Addressed comments

* added optimization recommended and addressed comments

* fixed comment
2021-04-13 12:54:45 -07:00
Weixing Zhang
75c0192e4f
enable more unit tests for ROCM EP (#7307) 2021-04-09 15:15:13 -07:00
baijumeswani
b221a4fd86
Better error message when ORTModule used with torch.DataParallel (#7287)
* Better error message when ORTModule used with torch.DataParallel
2021-04-09 10:07:22 -07:00
Weixing Zhang
c22963c23d
Polish Lamb Kernel (#7299) 2021-04-09 09:55:57 -07:00
Weixing Zhang
8ad5007f8f
Polish Adam kernel (#7294)
* Polish Adam kernel
2021-04-09 01:11:09 -07:00
Thiago Crepaldi
7b4362c21a
Add support to dynamic positional/keyword input for ORTModule (#7189) 2021-04-08 12:46:21 -07:00
ytaous
e14b291ce7
Enable symbolic shape inference in ORTModule (#7282)
Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-08 09:47:09 -07:00
baijumeswani
d272c8434d
Suppress tracer warnings from onnx export in ORTModule (#7221)
* Suppress tracer warnings from onnx export in ORTModule
2021-04-08 03:41:38 -07:00
Sherlock
aa2c465143
Restrict ConvGrad to __CUDA_ARCH__>=700 (#7278)
* Restrict ConvGrad to __CUDA_ARCH__>=700

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-07 20:10:29 -07:00
Vincent Wang
beb299e17d
ConvGrad CUDA Kernel Bugfix (#7273)
* bugfix

* add ut
2021-04-08 08:22:18 +08:00
baijumeswani
844361bc67
Support eval mode and torch.no_grad context in ORTModule and restructure ortmodule.py (#7162) 2021-04-07 09:29:54 -07:00
Sherlock
4bc17ca04e
CUDA ConvGrad Kernel (#7227)
* ConvGrad CUDA impl

* Set up the test case for Deberta Conv1D

* Add fp16 test

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-06 22:09:06 -07:00
Derek Murray
25e261f196
Avoid passing zero bias to Gemm in gradients (#7244)
* Avoid passing zero bias to Gemm in gradients

The bias argument to Gemm is optional and defaults to zero. Therefore we do not need to generate zero initializers and pass them to that argument.

* Remove unused declaration.
2021-04-06 16:49:34 -07:00
ashbhandare
2aa89989c4
Not-where fusion (#7182)
* Not-where fusion

* Change to rewrite rule

* Add to inference transforms

* Support numtiple where consumers

* review comments
2021-04-06 16:12:26 -07:00
raviskolli
5d759e182b
Allocate external Rocm allocator via PyBind (#7148)
* Enabled rocm support for graph transformations

* Support for external Hip allocator

* Added const_cast to reinterpret_cast to fix compiler issue

* Another crack at fixing the compile error

* More compilation fixes

* Added compilation flags to load_inline extension

* Added ROCM, ROCM_PINNED constants

* Changes to address PR comments

* Changed gpu identifier from ROCM to CUDA

* Added HIP compilation flag for torch inline functions

* Fixed a typo in header allocator string formatting

* Fix for runtime error with external_cuda_allocator

* Removed cuda/rocm specific code paths for allocators

* More name changes to generic gpu from rocm/cuda

* Removed duplicate allocator creation

* Rename cuda_external_ config options as gpu_external_

* Rename hip_mem_limit to gpu_mem_limit

* Rename cuda_mem_limit to gpu_mem_limit
2021-04-06 15:23:51 -07:00
G. Ramalingam
a9ff4c29e5
Add function body to GeluGrad schema (#7190)
* Add GeluGrad function definition

* complete gelugrad function definition

* add opset to function definition
2021-04-06 12:40:59 -07:00
ashari4
56b22c1c6b
Fix assert that the tensor's device type is 'cpu' #7248 2021-04-06 09:08:32 -07:00
Pranav Prakash
3b16afc0db
Make dW optional for convgrad (#7083) 2021-04-05 17:05:20 -07:00
Suffian Khan
9f14af9809
Add BERT-L perf regression test on MI100 and re-enable batch size test (#7240)
* restore bs test and add perf test

* update perf number and fix path to results
2021-04-05 15:51:52 -07:00
ashbhandare
2b8513539e
Div mul fusion (#7183)
* Div mul fusion

* Change to rewrite rule

* Add to inference transformers
2021-04-05 09:35:30 -07:00
Weixing Zhang
74ee24cf7f
rename cuda_mem_limit and hip_mem_limit to gpu_mem_limit for both CUDA EP and ROCm EP (#7226)
With this change, differentiating CUDA EP and ROCm EP is not needed in training script when mem_limit option needs to be set.

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-04-05 09:04:04 -07:00
baijumeswani
68b12a6179
Support for saving and loading pytorch compatible state dictionaries (#7220)
* Override methods on torch.nn.Module to get direct access to the methods on the original module.
2021-04-05 03:40:41 -07:00
Weixing Zhang
59b57d8322
HSA_NO_SCRATCH_RECLAIM and RCCL_ALLTOALL_KERNEL_DISABLE are not needed for ROCm 4.1 (#7224)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-04-02 18:19:11 -07:00
Weixing Zhang
ef88dc912c
enable more unit tests for ROCM EP (#7222) 2021-04-02 15:57:08 -07:00
Sherlock
a98c2ebb8c
Enable saving optimized models in OrtModule (#7214)
* Enable saving optimized models in OrtModule

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-02 12:37:05 -07:00
Weixing Zhang
a3f17c8b0d
update lamb and GatherGrad kernel for ROCm EP (#7184)
With ROCm4.1, the CUDA implementation of Lamb and GatherGrad can be
utilized for ROCm EP.
2021-04-02 09:02:49 -07:00
Edward Chen
0ebeaf529d
Check kernel def hashes (#7120)
Add unit test for verifying kernel def hashes.
Add way to add new types to kernel definition without changing hash.
2021-04-01 17:42:58 -07:00
ashbhandare
15c67ddbf0
Make output 1 of ConcatTraining Optional and place on CPU (#7199)
* Optional input 1 on CPU ConcatTraining

* Rename output_1
2021-04-01 16:05:17 -07:00