Commit graph

4691 commits

Author SHA1 Message Date
Tracy Sharpe
d13e5b2fd9
NCHWc: ReorderInput improvements (#7442)
Implement various improvements related to reordering a tensor for use by NCHWc operations:

Relax the requirement that the input channel count must be a multiple of the NCHWc block size (either 8 or 16 depending on ISA). The requirement now is that the channel count must be a multiple of 4. The implementation of MlasReorderInputNchw would need further work to support relaxing this further, but I don't have any models where I've observed this to be necessary yet.
Support fusing a Transpose(NHWC->NCHW) into a following ReorderInput. ReorderInput now has a channels_last attribute as was done in the past for ReorderOutput. This helps with models converted from TF where the converter is unable to remove all Transpose operations.
Add threading support to ReorderInput to accelerate performance (ReorderOutput will come later).
2021-04-26 19:16:39 -07:00
M. Zeeshan Siddiqui
82108b18e3
Partial graph execution perf improvements. (#7438)
* Partial graph execution perf improvements.

* PR feedback.

* Decrement reference count of tensors in ORTModule.

* PR feedback.

* PR feedback.

* PR feedback.
2021-04-26 17:13:55 -07:00
Thiago Crepaldi
0702a14ee7
Add pytorch version check before loading Python ONNX Runtime training module (#7377) 2021-04-26 14:53:50 -07:00
Edward Chen
4804ede501
Update build docker image cache cleanup build definition (#7452)
Decrease default cache history length to 4 days.
Other minor updates to build definition.
2021-04-26 14:39:46 -07:00
RandySheriffH
40568d8821
Wait for dispatch done in RunParallelSection to fix random TP UT crash (#7443)
* wait for dispatch done in RunParallelSection

* pass worker_fn by value

* cancel move

* only move work_fn when it is lastly referred

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2021-04-26 14:12:10 -07:00
Zhang Lei
ada0fbbd2d
Implement qlinear concat and unit test. (#7341)
* Implement qlinear concat and unit test.
Add quantization tools for QLinearConcat and it quantization tests.

* Add kernel def hash for QLinearConcat.

* Change according to PR. Add qdq transformer support for QLinearConcat.

* Add QDQ Transformer unittest. Fix typo on domain.

* remove dup logic of no use.

* fix x86 build error.

* Update operator docs.
2021-04-26 13:38:40 -07:00
Changming Sun
b5592856a7
Remove thread pool's cancel method and suppress some warnings (#7411) 2021-04-26 09:33:48 -07:00
Vincent Wang
368e4a324f
SqueezeGrad Bugfix (#7412)
* squeezegrad bugfix

* fix ut

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2021-04-26 09:12:03 +08:00
Weixing Zhang
ca9b3f18e9
Explicitly pass cuda stream to thrust function rather than use cuda default stream implicitly (#7414)
* Pass cuda stream to thrust function to not use default stream.

In the commit 299ace0, ORT has been changed to not use cuda default stream.

* update amd_hipify.py

* remove un-necessary stream sync

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-04-25 01:18:56 -07:00
jeyblu
b9cbbc41ff
dnnl matmul tensor dimension check (#7383) 2021-04-23 23:17:22 -07:00
RandySheriffH
afe912d47c
Reduce perf gap between thread pool and omp (#7333)
* add async dispatch

* minor renamings

* build py38

* restore yml

* fix sync up issue between dispatch thread and main

* fix comments

* refactor SummonWorker and rename to RunInParallelInternal
2021-04-23 18:36:36 -07:00
Thiago Crepaldi
410a81b21b
Add support for ORTModule to execute the graph when ONNX drops unused… (#7424) 2021-04-23 18:10:57 -07:00
Chen Fu
f4f2cc1a00
Add batch interface to floating point GEMM (#7323)
Currently in high dimension matmul, we call multiple GEMM sequentially. In this change we execute these GEMMs in parallel, removing barriers between two adjacent GEMM operations.

Performance tested with Bert and T5 model. Bert model shows no noticeable perf differences, as the heavy lifting is done by the attention operator, which is not changed in this PR. In T5 model, we see no regression on low parallel threads (x4), and performance improvement is more pronounced in high number of threads (8-16). T5 shows 10% speedup with 16 threads. With profiling, we can see the most expensive MatMul operators in T5 achieves around 20% speedup with 16 threads.

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2021-04-23 17:34:22 -07:00
Suffian Khan
7a3c1787af
Add CI pipeline to publish Python training package targeting Rocm (#7417)
* first attempt rocm training wheel

* modifications needed to python packaging pipeline for Rocm 4.1

* changges to not conflict with cuda

missed stage1 changes

remove package push

add option r to getopt

try again without python install

try again without python install

try again without python install

split pipelines and add back push to remote storage

try on cuda gpu pool

try again

try again

try running without az subscription set

try again on original pipeline

change pool

passing AMD Rocm whl on AMD-GPU pool

split rocm pipeline from cuda pipeline

remove comments

* try adding Rocm tests as well

* try with tests in place

* fix trailing ws

* add training data

* try again as root for tests

* use python3

* typo

* try to map video, render group into container

* try again

* try again

* try to avoid yum error code

* make UID 1001

* try without yum downgrade

* define rocm_version=None

* remove CUDA related comments for Rocm Dockerfile

* Dont pin nightly torch torchvision torchtext versions as they expire (for now nightly is required for Rocm 4.1)

* missed requirements-rocm.txt from last commit

* fix whitespace
2021-04-23 17:22:31 -07:00
M. Zeeshan Siddiqui
34ebf7d3dd
Partial graph execution made simple. (#7324)
* Python changes.

* C++ changes.

* fixes/hacks.

* more hacks.

* perf.

* changes.

* changes.

* re-architect partial graph execution and  remove iobinding.

* changes.

* refactor.

* prevent copies from python to c++.

* perf.

* merge conflicts.

* misc.

* fix merge conflicts and tests.

* Ifdef partial executor.

* PR feedback.

* Delete ORT Task et al.

* Clean up.

* clean up.

* Restore SetOutputMLValue().

* PR feedback.

* Re-enable disabled ORTModule tests.

* PR feedback.

* PR feedback.
2021-04-23 15:09:18 -07:00
Changming Sun
5208231126
Fix some warnings in our CUDA code (#7436) 2021-04-23 14:56:20 -07:00
Suffian Khan
8889e717eb
add gather elements (#7435) 2021-04-23 14:05:17 -07:00
Weixing Zhang
ef72764960
Build would fail when nccl is not under standard path (--nccl_home) (#7402)
* Build would fail when nccl is not under standard path (--nccl_home)

* fix build for ROCm EP
2021-04-23 14:04:22 -07:00
Changming Sun
9f683bae78
Revert the TRT change and move the build to a new pool (#7434) 2021-04-23 14:00:26 -07:00
satyajandhyala
979d63159b
Add level two optimizations for constant propagation transformation. (#7410)
* Made the python script generating the testcases modular.

* Modified RemoveBackToBackCasts function to remove cast even if the parent node has other consumers.

* Modified InsertCastNodes to update the graph consistently for other functions to work.

* Moved ConcatNames function to the top.

* PropagateBackward/SearchUpstream and PropagateFP16CastsFromOutputsToInputs insert FP32 casts if the level >1 in order to propagate FP16 casts backwards.

* Added new testcases for level two setting.
2021-04-23 13:25:54 -07:00
Chi Lo
f1c3f3fcc1
TRT EP memory leak fix (#7415)
* fix memory leak

* small refactor

* code refactor
2021-04-23 12:04:23 -07:00
Guoyu Wang
043883b52d
[CoreML EP] Add Gemm/MatMul support (#7403)
* [CoreML EP]Add gemm/matmul support

* remove changes in get_execution_providers

* Address CR comments

* Switch to list initialization

* Minor update
2021-04-23 11:54:59 -07:00
Yufeng Li
e7912736b9
Add qdq propagation support (#7404)
* Add qdq propagation support

* add more unit tests
2021-04-23 11:17:44 -07:00
Tang, Cheng
1fa6d8fe1c
support loading external execution provider from python frontend (#7332)
* initial dynamic load example

* support load EP in the provider options

* support dynamic load EP in orttrainer

* split the provider interface; fix comments in pr

* remove experiment code

* add test

* remove useless file

* add test model file;fix linux brewak

* fix linux build and missing file

* fix python build

* fix python build

* fix python binding

* fix python test

* fix runtime path for posix env

* exclude the shared library from minimal build

* fix comments in pr;

* seperate the provider shared lib loading

* excluded from minimal / macos / ios build

* skip copy the provider shared lib for minimal build and mac os

* fix macos build

* exclude the test for macos build

* exclude from andorid build

* exclude from web assembly build

* enable the invalid ep test

Co-authored-by: Cheng Tang <chenta@microsoft.com>
2021-04-23 09:54:09 -07:00
Ashwini Khade
75e054cd33
pick onnx release candidate (#7177)
* pick onnx release candidate

* fix typo

* filter batchnorm tests

* add implementation for reshape 14

* add identity op kernel for opset 14

* fix typo

* update onnx commit

* update commit to latest master

* add hashes for new kernel registrations and update 1

* TEST commit

* update onnx back to right commit

* Update onnx to latest in rel-1.9.0

* temp fix

* remove nonzeroshapesetter transformer

* pick rel branch latest commit

* fix build failures

* fix build failures

* fix build failures

* update the commit to latest in release branch

* add test filters for not impemented op14 ops in c# tests

* plus review comments
2021-04-22 23:57:09 -07:00
Guoyu Wang
d414039189
Add ios coreml ci, and speedup ios ci run (#7420) 2021-04-22 23:41:58 -07:00
sumitsays
d67c86265b
Enabled fp16-inception-v1 test (#7406)
Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
2021-04-22 23:05:03 -07:00
Yulong Wang
b56dd037d3
increase timeout for nodejs binding test (#7422)
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-22 21:40:40 -07:00
raviskolli
4c8513a627
SimplifiedLayerNormalization kernel for ROCM EP (#7409)
* Add SimplifiedLayerNormalization kernel to ROCM ep.
2021-04-22 21:25:09 -07:00
Changming Sun
6822ae95ec
Reduce the number of TensorRT tests needed to run (#7419) 2021-04-22 19:14:39 -07:00
Thiago Crepaldi
771a6d235b
Fix IsContiguousTensor check on backend (#7391) 2021-04-21 17:01:17 -07:00
Changming Sun
afa7b23609
Update docs/ContribOperators.md and the script that generates it. (#7399) 2021-04-21 16:20:56 -07:00
Brian Popow
1bbe538379 Update references 2021-04-21 13:36:10 -07:00
Brian Popow
aa1ce726aa Remove unnecessary encoding step 2021-04-21 13:36:10 -07:00
Changming Sun
65b2b87f83
Update CI build docker images (#7386)
Update CI build docker images: delete ubuntu 16.04 support.
2021-04-21 13:18:34 -07:00
raviskolli
09313d9e1f
Added GreaterOrEqual and LessOrEqual Ops to RocmEP (#7398)
* Added GreaterOrEqual and LessOrEqual Ops to Rocm EP
2021-04-21 11:44:24 -07:00
Changming Sun
b4cfa88bf7
Update protobuf to the latest version (#7396) 2021-04-21 10:30:06 -07:00
Changming Sun
243713c464
Upload detailed code coverage result to azure blob storage (#7392) 2021-04-21 08:24:44 -07:00
Sherlock
16ca7677e6
Relax ConvGrad Test tol (#7393)
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-21 08:06:00 -07:00
Changming Sun
b5493d724c
Update rnn_helpers.cc: add #ifdef to DumpMatrixImpl (#7389) 2021-04-20 22:11:38 -07:00
Hariharan Seshadri
7b11283af0
Add ability to allocate initialized tensor memory from non-arena memory (#7267) 2021-04-20 20:27:48 -07:00
Thiago Crepaldi
8421124344
Add support to **kwargs in ORTModule forward() method (#7360) 2021-04-20 16:21:52 -07:00
ashbhandare
76cc118dbe
Gemm transpose fusion (#7306)
* Gemm transpose fusion

* Correct rewrite rule effect

* Add to inference transforms to trigger on gradient graph
2021-04-20 09:35:05 -07:00
Xiaoyu Liu
913ea8264b
GPT2 with one step beam search (#7163)
* beam search refactoring checkin
* add factory class and deduplicate code
* one step beam search works on gpu

Co-authored-by: Xiaoyu Liu <xiaoyu@xiaoyu-VM.z4vh1dzj5eoevgybsksdpz2izh.jx.internal.cloudapp.net>
2021-04-20 06:23:52 -07:00
mindest
1a3ddf0714
Add gradient registration and tests for Min/Max (#7217)
* Add gradient registration and tests for Min/Max

* Add helper function for min/max grad test

* limit Min/Max Grad to accept at most two inputs; modify test case accordingly

* resolve merge error
2021-04-20 18:14:31 +08:00
Sherlock
ce7ff27bac
Fix perf issue in Conv CUDA kernel (#7348)
* Fix perf issue in  Conv CUDA kernel

* Read avaiable memory from device

* assuming 10% fragmentation

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-19 23:37:05 -07:00
ashbhandare
ac346a1b90
Modify SimplifiedLayerNormFusion to allow fusion in the presence of Casts optionally (#7352)
* LN transform partial changes

* LN transform fix

* Make transform optional, remove unnecessary code

* Fix windows build

* review comment, windows CI fix

* review comments
2021-04-19 19:59:23 -07:00
ytaous
7abe1fd392
Identity elimination with graph output (#7312)
* Identity removal

* fix build

* fix build

* fix build

* fix builld

* UTs

* fix UT

* fix UTs

* per comments

* fix UTs

* fix UTs

* per comments

Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-19 16:36:35 -07:00
Sheil Kumar
265db2ad96
Fix Microsoft.AI.MachineLearning .NET5 publishing and C# Store Release build (#7373)
* fix .net publishing

* make experimental api build with microsoft.ai.machinelearning.idl import

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2021-04-19 15:36:43 -07:00
satyajandhyala
bb1e417da0
Add logging support to Cast Propagation transformation from python (#7353)
* Fixes needed to PropagateCast transformation.

* Added number of passes to the logs.

* Added logging support to OrtModuleGraphBuilder.

* Added new testcases.

* Added NodeArgToConsumerMap
2021-04-19 12:14:30 -07:00