Commit graph

514 commits

Author SHA1 Message Date
baijumeswani
f1ade14e44
Assert that the data is on the same device as ORTModule (#6942) 2021-03-08 17:03:28 -08:00
Vincent Wang
56c5620fd2
Disable Materializing Grads (#6822)
* disable materialize grads

* gradient builder bugfix

* fix ut

* fix ut

* resolve comments and bugfix

* add more assert

* disable forward compare for now
2021-03-08 16:56:06 +08:00
Thiago Crepaldi
dfc7c18e31
Introducing TrainingAgent interface to performance training using YieldOp (#6898) 2021-03-05 17:03:46 -08:00
baijumeswani
79f832c682
Separate requirements.txt file for ORTModule pipelines (#6879)
* Move all ORTModule dependency installations to ortmodule subfolder
2021-03-05 14:12:11 -08:00
ytaous
ac4d615553
Enable priority-based execution order as default to support inputs with symbolic/dynamic shape (#6892)
* priority-based exec order

* disable 1 failing test

* fix UT

* more comments

Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-03-04 22:36:25 -08:00
Sherlock
749e6a08a6
Add more asserts for ORTModule forward's correctness (#6887)
* Add more asserts on forward outputs

* Found one more failing case

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-03-03 19:57:42 -08:00
Vincent Wang
4238ce341a
Add External Outputs Flag for YieldOp (#6789)
* add external outputs flag for YieldOp

* use kPreExisting

* add ut for mem_pattern

* fix ut after merge from master
2021-03-02 11:38:18 +08:00
Sherlock
12edf22f11
Merge pull request #6838 from microsoft/mzs/ortmodule-api-sync-from-master-210226
Sync from master
2021-02-27 12:32:36 -08:00
Thiago Crepaldi
f71d93ea2b
Enable PyTorch Lightning basic test on CI (#6809) 2021-02-27 09:35:42 -08:00
M. Zeeshan Siddiqui
ca48310d6d Merge branch 'master' of https://github.com/microsoft/onnxruntime into mzs/ortmodule-api-sync-from-master-210226 2021-02-27 04:25:23 +00:00
Sergii Dymchenko
059ed1c241
Copy forward signature from PyTorch model. (#6777) 2021-02-26 13:02:13 -08:00
satyajandhyala
355057cf9c
Added RequiredGrad attribute to YieldOp (#6657)
* Added required_grad attribute to YieldOp

* Chagened YieldOp attribute to hold the indices of the required gradient outputs from the count, and removed the code reordering the outputs.

* Changed backward_output_grad_names to a map from backward output gradient name to the corresponding output index.
2021-02-26 10:38:38 -08:00
Sherlock
8a450d523f
Check gradient correctness in the UTs (#6803)
* Check gradient correctness in the UTs

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-02-25 13:31:07 -08:00
baijumeswani
fa8a9015bd
Mount hf model cache and use cache for loading hf models (#6810) 2021-02-25 13:30:14 -08:00
Sergii Dymchenko
99ffffbe6a
Remove backward workaround from test. (#6811) 2021-02-25 13:23:46 -08:00
ashbhandare
b05403d877
Clear iobinding outputs (#6774) 2021-02-25 11:50:43 -08:00
Sherlock
8e200e13fe
Rewrite ORTModule background task coordination (#6700)
* Introduce OrtTasks to replace EventPool

* return run_id to frontend

* pass run_id to backward

* OrtTasks support multiple bg_events

* make message_queue a member of orttask

* Replace MessageQueue with std::promise

* Move status_promise into Task

* Move terminate flag into Task

* Reenable previously disabled UTs

* Add unit tests

* Replace condition variables with std::promise

* Move to CreateBackgroundTask in the main thread

* return status and output in forward_future

* use throw for terminating background thread

* cleanup tasks at destructor

* reenable test_mixed_nnmodule_ortmodules_training

* add mutex for ORTTasks functions

* add mutex for bg_threads

* delay tests before start

* add ut for multi-task common backbone

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-02-24 18:00:25 -08:00
baijumeswani
7ce4075bbd
Support nested sequence and mapping types in ORTModule (#6791) 2021-02-24 15:45:56 -08:00
Weixing Zhang
40fa40f3ce
Enable more unit tests for ROCM EP (#6776)
* enable more ops and unit tests for ROCM EP
2021-02-24 15:20:50 -08:00
Thiago Crepaldi
aa5cd37ac8
Refactor device handling and basic support for PyTorch Lightning (#6758) 2021-02-24 14:12:55 -08:00
jingyanwangms
c02ec38f8a
[Running CI now] Remove duplicate tests to speed up CI (#6768)
* remove tests to speed up CI

* add back _into_data_parallelism tests to see how long the CI test takes

* remove unnecessary save calls

* add back data_parallelism_full_precision_bart_path

* add data_parallelism_full_precision_path

* remove data parallelism tests

Co-authored-by: Jingyan Wang <jingywa@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-02-23 23:21:06 -08:00
baijumeswani
65ba51d93e
Re-enable test and increase timeout (#6785) 2021-02-23 18:51:06 -08:00
Thiago Crepaldi
563218dcda
Update torchtext usage for pytorch transformer sample (#6767)
* Update torchtext usage for pytorch transformer sample
* Temporarily disable tests to unblock repo (failures are being worked on already)
* Update loss numbers for ORTTrainer UTs
2021-02-23 14:06:35 -08:00
Sergii Dymchenko
58f3aca95d
Support keyword arguments for ORTModule (#6539)
* Support keyword arguments for ORTModule.

* Add backward workaround to the test.

* Specify test name directly without -k.

* Handle unused inputs removed by ONNX exporter.
2021-02-19 13:40:44 -08:00
M. Zeeshan Siddiqui
1a2f1bd23a
Enable external CUDA allocator in ORTModule. (#6745)
* Enable external CUDA allocator in ORTModule.

* Fix assert after unification of allocators.

* Update no grad memory test.

* update comments.

* fix provider options array when not sharing allocator.
2021-02-18 20:01:13 -08:00
Thiago Crepaldi
fb3f1f5cc1
Enable custom ops on ORTModule (#6740) 2021-02-18 09:08:10 -08:00
M. Zeeshan Siddiqui
40dda452cf Merge branch 'master' of https://github.com/microsoft/onnxruntime into mzs/sync-from-master 2021-02-18 03:03:01 +00:00
M. Zeeshan Siddiqui
5b7e7aaa45 Move event_pool and message_queue to core. 2021-02-17 11:50:56 -08:00
M. Zeeshan Siddiqui
eecce31a8b Fix build, cleanup. 2021-02-17 11:50:41 -08:00
Thiago Crepaldi
3184c47ad1 Merge branch 'master' into thiagofc/merge-from-master 2021-02-17 11:49:52 -08:00
Wei-Sheng Chin
9e67b88c83
Use local rank as GPU ID (#6719) 2021-02-17 22:42:54 +08:00
baijumeswani
01dfa8e125
Support non tuple return values from torch.nn.module (#6660)
* Support dictionary, namedtuples and huffingface ModelOutput type for model return values
2021-02-16 20:48:32 -08:00
Thiago Crepaldi
7f33671ade
Handle multiple devices scenarios (#6672)
* Handle multiple devices scenarios
2021-02-16 18:22:30 -08:00
Thiago Crepaldi
7ee5baa60d
Remove monkey patch for PyTorch Nightly + ORTTrainer (#6659) 2021-02-16 17:24:50 -08:00
Edward Chen
b2cddc5337
Consolidate MLTypeCallDispatcher classes (#6651) 2021-02-12 13:26:56 -08:00
Suffian Khan
e6de0eb813
Add nightly pipeline for MI100 to run convergence and batch size test similar to V100. (#6611)
* Partial updating of ROCM reduction code.

* Update reduction_all.cu

* Add reduce template parameters.

* miopen common

* Reuse CUDA's reduction_functions.cc

* Reduction ops.

* Update remaining reduction ops to use MIOpen.  double datatype is not supported, so disable those typed kernels.

* Disable a couple more unsupported tests.

* Code formatting.

* Delete ROCM-specific reduction code that is identical to CUDA reduction code.

* Fix scratch buffer early free.

* Fix merge conflict.

* first attempt nightly amd ci pipeline

* try fix bad yaml file

* try again with corrected model directory

* add convergence test as well

* update reference loss for amd mi100

* include mi100 test results csv

* update the mi100  convergence test reference values

* update batch sizes for mi100 32g

* fix gpu sku for run_convergence_test.py

* undo unrelated changes to master

* pr comments

* pr comment

Co-authored-by: Jesse Benson <jesseb@microsoft.com>
2021-02-12 13:22:06 -08:00
Yufeng Li
1c3168c0f6
Skip constant folding dequantizelinear for quant qdq format (#6643)
* skip constant folding dequantizelinear for quant qdq format
2021-02-11 14:06:13 -08:00
Thiago Crepaldi
0732d72706
Add support for dynamic axes for outputs + check model output type before export (#6648) 2021-02-11 10:18:02 -08:00
Sherlock
9294dde143
Rename ONNX graphs variables in ORTModule (#6645)
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-02-10 20:10:23 -08:00
Vincent Wang
eec602e48a
OrtModule v0.21 (#6395)
* ortmodule v0.2

* use pt module for eval

* get user outputs in yield op

* pass output grads to yield output without copy

* Disable mem_pattern for ORTModule

* Avoid allocating output buffer for Yield op

* Change to WaitAndReset to avoid overriding signal

* remove unnecessory signal/wait at the end of bg thread

* Return Session.Run result as a std::future

* export model with torch.no_grad()

* Handle bg thread's early return in Forward call

* Removed duplicated Yield kernel

* Silence "CUDA kernel missing log"

* Add missing transforms, clear iobinding (#6532)

* revert ortmodule.py to a working state first

* Apply ortmodule.py change from dev branch

* Rename to YieldOp

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: ashbhandare <ash.bhandare@gmail.com>
Co-authored-by: Sherlock <baihan.huang@gmail.com>
2021-02-10 13:27:15 -08:00
Derek Murray
88d48063fa
Log warning when GetGradientForOp() silently fails. (#6586)
* Add warning when GetGradientForOp() silently fails.

In some cases, `GetGradientForOp()` can return without creating any nodes, which may lead to an invalid graph being created.
2021-02-10 10:01:16 -08:00
Wei-Sheng Chin
8972621138
Generate shape-independent graph if any input dimension < 2 (#6581)
* Throw for non-supported case

* Not to go to shape-dependent branch when seeing unsupported shapes
2021-02-10 15:44:25 +08:00
Cian Hayes
16eed68a1e
Fix layer_norm.cc on x86 (#6556)
* Fix LayerNromGrad on x86

* PR feedback
2021-02-08 17:36:14 -08:00
Jesse Benson
d18aa45b46 Enable more ROCM ops that are sharing CUDA code. Some are needed for Turing NLG models. 2021-02-06 14:40:34 -08:00
George Nash
b50b0a89aa Fix build failure when building with --build_wheel on Windows
This resolves issue #6536

Signed-off-by: George Nash <george.nash@intel.com>
2021-02-05 18:59:01 -08:00
Scott McKay
ccfd90291b
Remove condition from ORT_RETURN_IF[_NOT] macro output. (#6563)
Remove condition from ORT_RETURN_IF[_NOT] macro output as repeating the condition doesn't add much value compared to the explicit error message, and the error message includes the file and line anyway so it's easy enough to find the condition if needed.
Update the few places where the macros were used without an explicit error message to provide an explicit error message.

Saves 12.5KB in a minimal MinSizeRel build with all DNN ops, 16KB in full release build.
2021-02-05 17:33:29 -08:00
Weixing Zhang
299ace0759
Support to allow user to specify compute stream per session (#3723)
* Support to allow user to specify compute stream per session

Create computation cuda stream explicitly rather than use default legacy stream or per-thread default stream.

remove some redudant cudaStreamSynchronize

fix gpt2 model test failures

don't use default stream in nccl either.

add stream schronization in OnRunEnd()

using cub::DeviceScan::InclusiveSum which can be called with stream specified.

fix topK failure due to latest rebase

fix tensorrt

support user specified stream

add user_stream support in tensorrt EP

use same stream for both tensort and CUDA EP.

fix ScatterND

specify stream for adasum and p2p kernels.

fix loop

fix CApiTest.custom_op_handler

fix CApiTest.varied_input_custom_op_handler

change for cudaMemcpyFromSymbol

improve provider options for user specified compute stream

* add changes for ROCM EP

* fix GatherGrad UT for ROCM EP

* clean code and fix NonMaxSuppression

* use default stream for ROCM now

* fix CApiTest.custom_op_handler:OrtFormatCustomOpTests.ConvertOnnxModelToOrt

* fix tensorrt ut: CApiTest.io_binding_cuda

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-02-05 15:48:18 -08:00
Jesse Benson
a9e4d70b50 Fix merge conflict. 2021-02-04 15:00:05 -08:00
Jesse Benson
86ac11af1a Delete ROCM-specific reduction code that is identical to CUDA reduction code. 2021-02-04 15:00:05 -08:00
Jesse Benson
5d8792705b Code formatting. 2021-02-04 15:00:05 -08:00