Commit graph

2248 commits

Author SHA1 Message Date
Sherlock
b4d4ea2e5f
GatherND-12 Implementation (#3645)
* Renamed, UT passing

* Move GatherND CUDA Kerenl into onnxruntime

* Merge GatherNDOpTest

* Refactor Test code

* Merge CPU Kernel Impl

* Handle Negative Indice, Fix UT

* Improve CUDA kernel to handle negative index

* Minor Fixes

* Preserve GatherND-1 Cuda kernel

* Fix Mac build

* fix UT

* Fix Build

* fix GatherNDOpTest.double > CUDA error cudaErrorInvalidDeviceFunction:invalid device function

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
2020-04-24 20:55:30 +08:00
Weixing Zhang
2f8a17dcde
thrustallocator is not needed since cub is used directly for gather now. (#3683)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-24 01:51:54 -07:00
Weixing Zhang
c929963d74
type cast for ratio is not necessary for dropout (#3682)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-24 00:49:37 -07:00
Weixing Zhang
f4a04c04e1
move cpu/cuda related files to coresponding cpu/cuda folder (#3668)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-24 00:12:02 -07:00
Weixing Zhang
336624806e
Simplify and clean code (#3655)
1. It is not necessary to include cudnn_common.h for kernels which are not implemented with CUDNN.
2. Minor change in layer norm kernel to simplify the code and resolve building warning.

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-23 10:12:55 -07:00
XiaocenDong
125f68f305
fixed mnist bug (#3569)
* fixed mnist bug

* fixed train_step param
2020-04-23 23:22:38 +08:00
Xueyun Zhu
f1ba9aaf34
Add pipeline transformer for wait/record node (#3513)
* pipeline transformer

* clean up

* address feedback

* add record/wait for first stage and updated split script

* address feedback

* make recv/send signal as initializer

* merge

* address feedback

* unify input and initializer

* address feedback and bug fix

* minor fix

* windows build

* fix
2020-04-22 23:28:01 -07:00
pengwa
6136fd0789
GatherElementsGrad Kernels (#3627)
* GatherElementsGrad cuda kernel & tests

* Fix comments

* Fix include path
2020-04-23 14:02:34 +08:00
Wei-Sheng Chin
d9641f292d
Try not to modify base name (#3638) 2020-04-22 22:24:43 -07:00
Vincent Wang
ffe19ae49b
Expand elimination and Expand gradient. (#3610)
* Expand elmination and Expand gradient.

* Resolve comments.

* Fix test break.

* Check if graph can remove the node.

* Resolve comment.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-04-23 13:17:15 +08:00
Tang, Cheng
37f4f74308
expose training session so the training app could register custom kernel and transformers (#3642)
Co-authored-by: Cheng Tang <chenta@microsoft.com>
2020-04-22 21:35:41 -07:00
suffiank
0e12d05cd2
fixes for ort_trainer.py to resume from checkpoint (#3510)
* fixes for ort_trainer.py to resume from checkpoint

* define self.state_dict_ during init

* add comment of explanation

* add unit test for restore from checkpoint

* fix file not found

Co-authored-by: suffian khan <sukha@microsoft.com>
2020-04-22 16:33:58 -07:00
Weixing Zhang
e4fc83252d
Refactoring code related to WARP_SIZE. (#3623)
1. Centralize its definition in common.cuh.
2. Rename it to GPU_WARP_SIZE which can be extended to AMD GPU later.
3. Centralize warp shuffle functions.

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-04-22 15:19:06 -07:00
edgchen1
bb9b0ba5b3
Merge pull request #3607 from microsoft/edgchen1/merge_from_master
Merge from master to ort_training
2020-04-22 13:22:32 -07:00
Wei-Sheng Chin
ab70625b29
Add Lamb shape inference (#3634) 2020-04-22 11:32:28 -07:00
Edward Chen
8df5076d96 Merge remote-tracking branch 'origin/master' into edgchen1/merge_from_master 2020-04-22 17:16:00 +00:00
Edward Chen
8d09cefafc Merge remote-tracking branch 'origin/ort_training' into edgchen1/merge_from_master 2020-04-22 16:56:15 +00:00
edgchen1
b518cb2a7a
Clean up OPTIONAL name conflict workarounds in ort_training. (#3622)
* Clean up OPTIONAL name conflict workarounds.

* Cleanup unnecessory header files onnx_protobuf.h

Co-authored-by: Sherlock Huang
2020-04-22 09:07:55 -07:00
Vincent Wang
d3a2ac5c5c
Eliminate Useless Cast during Transformer. (#3606)
* Remove Useless Cast during Transformer.

* Resolve comments.

* Check if graph can remove the node.

Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-04-22 16:36:46 +08:00
Tianlei Wu
d69bc31309
Refine BERT optimization script options (#3618)
* Remove paramters like --gpu_only --sequence_length. Update bert GPU notebook accordingly.
* Remove input_int32 and float16 parameters from constructors of BertOnnxModel class and other classes derived from it. 
* Update gpt2 benchmark. Add comments in gpt2 notebook to indicate work in progress. Clear notebook output before official 1.3.0 release is ready.
2020-04-21 21:28:06 -07:00
Scott McKay
b4508dbdc6
Improve TopK performance. (#3612)
* Update TopK implementation.
  - add faster heap
  - special case k=1
  - update selector for when to use heap and when to use nth_element based on performance testing
  - parallelize if enough work to do
  - reduce templatized code
  - add some extra unit tests.

Perf tested vs. master. Average speedup is 3.75x using this combination of input sizes:

```
    batches = [10, 25, 50]
    batch_size = [8, 16, 32, 64, 128, 256, 512, 1024, 2048]
    k = [1, 2, 4, 6, 8, 16, 24, 32, 48, 64, 128]
```

For larger batches (e.g. 50x2048) the speedup is over 20x.
2020-04-22 10:05:13 +10:00
edgchen1
5492d02c4e
Remove Windows CUDA 9 build definition and helper scripts. (#3615) 2020-04-21 15:22:27 -07:00
Sherlock
d66d5bb86a
Update Optimizer Domain and Opset (#3602)
* Update Domain and Opset for SGD

* Update Adam Domain and Opset

* Update Lamb Domain and Opset
2020-04-21 15:06:02 -07:00
Edward Chen
47f1758fdc Add --skip_onnx_tests to orttraining Windows builds. 2020-04-21 21:50:35 +00:00
Edward Chen
297ab43b0c Add --enable_onnx_tests to Windows builds to allow set up of test data directory. 2020-04-21 20:34:55 +00:00
Edward Chen
2e4b9b1d0e Disable CudaKernelTest.SoftmaxCrossEntropyLoss_LargeSizeTensor because it's flaky. 2020-04-21 20:30:45 +00:00
Edward Chen
28a0c863b1 Revert "Convert Gelu to use TryParallelFor (#3599)"
This reverts commit 2579a72a88.
2020-04-21 18:45:20 +00:00
Edward Chen
d50c3e7a71 Fix GraphTransformationTests tests. 2020-04-21 18:43:49 +00:00
Pranav Sharma
9636da3951
Threadpool related changes. (#3564)
Threadpool related changes.

Don't create ORT threadpool if openmp is enabled (except for inter op threadpool).
Created a new static function ThreadPool::NumThreads to account for openmp settings and null threadpool ptr.
Log a warning when using SetIntraOpNumThreads when openmp is enabled.
Added a document for ORT devs.
Fix LSTM to use the new threadpool abstractions.
Rename GetNumCpuCores to GetThreadAffinityMasks and move it to the Env class.

Co-authored-by: Tracy Sharpe <tracysh@microsoft.com>
2020-04-21 09:57:39 -07:00
Adam Pocock
3dd3f84116
[Java] Adding model metadata support (#3573)
* java - adding deployment information to build.gradle.

* java - adding support for model metadata.
2020-04-21 02:28:15 -07:00
George Wu
1c37d5e6ec
debug option for dumping tensorrt subgraphs. (#3604) 2020-04-21 11:55:30 +08:00
Edward Chen
87fad09c7b Fix merge issue. 2020-04-21 03:44:44 +00:00
Edward Chen
daa14b64e3 Merge remote-tracking branch 'origin/master' into edgchen1/merge_from_master 2020-04-21 03:31:32 +00:00
edgchen1
ead00f97f3
Sync onnx_backend_test_series.py disabled tests (#3603)
Make the set of disabled tests consistent between ort_training and master. Fix some regex patterns.
2020-04-20 18:00:53 -07:00
pengwa
e233e6ba45
Refactor - ScatterElements (#3559)
Refactor ScatterElements using MLTypeCallDispatcherRet to refactor
2020-04-21 08:58:42 +08:00
Changming Sun
2579a72a88
Convert Gelu to use TryParallelFor (#3599) 2020-04-20 17:32:39 -07:00
Changming Sun
911d125323 Remove openmp from gpu build 2020-04-20 17:13:54 -07:00
liqunfu
781e1c36be
Add front-end MNIST test (#3231)
* add frontend minst test

* to use torch nightly with torchvision

* remove incorrect comment per reviewer's comment

* experiment torchvision import failure

* experiment install_deps.sh

* more experiment install_deps.sh

* experiment install_deps.sh with --upgrade

* Experiment with install_deps.sh.

* Experiment with install_ubuntu.sh.

* Use Ubuntu 18.04 and Python 3.6 for CI.

* Update cmake version for CI.

* Install MPI on Ubuntu 18.04 for CI.

* Increase tolerance for MNIST test.

* Go back to Ubuntu 16.04 for CI, fix installing from deadsnakes ppa.

* Clean-up.

* Update ort_trainer.py from ort_training.

* Get default Ubuntu Python ver back to 3.5.

* Add underscore to opset_version parameter name in ORTTrainer constructor.

* Move loss/model wrap before the call for sample output.

* Update expected values for MNIST test.

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com>
2020-04-20 11:19:31 -07:00
edgchen1
f180b71f27
Support ONNX test version parsing from path on Windows in onnx_test_runner. (#3588) 2020-04-20 10:02:51 -07:00
Sheil Kumar
31b6629e99
Fork WinML IDL Guids (#3591)
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2020-04-20 09:17:07 -07:00
Prabhat
381fee47ab
Added support to build onnxruntime with ACL (#3586)
* Added support to build onnxruntime with ACL

* Added ACL build instructions
2020-04-20 13:35:28 +05:30
Changming Sun
75426a3091 Fix build break 2020-04-19 18:32:46 -07:00
Zhang Lei
422266c445
Support conv transpos 1D in cuda provider. (#3300)
* Support conv transpos 1D in cuda provider.

* Clear some old comment. Enable conv_transpose_1d onnx test for cuda.
2020-04-19 22:07:34 +08:00
Scott McKay
7d5348f87e
Add ability to batch device copy for graph inputs and outputs. (#3580)
* Add ability to batch device copy for graph inputs and outputs.
2020-04-19 17:51:07 +10:00
Prabhat
ea62b3435a
Clean up build.py code (#3466) 2020-04-18 20:48:30 -07:00
Maxim Kalinin
fcf0f6ee9f
Generalize reshape fusion (#3554)
* Generalize reshape fusion

* Allow arbitrary number of Concat arguments
* Apply fusion even when an output of an internal node is used elsewhere
* Fix a bug when an internal node's output is the subgraph output
* Simplify code
2020-04-18 20:47:23 -07:00
Tiago Koji Castro Shibata
14e387aa1a
Fix WinML namespace build break (#3583)
* Add missing winrt namespace

* Conditional compilation of dxcore code

* Fix TAEF macros
2020-04-18 20:46:01 -07:00
Sherlock
56b223bc60
Implement OneHot CUDA Kernels (#3390)
* Implement OneHot CUDA Kernels

* Support fp16

* Use HandleNegativeAxis

* Make MLFloat16 test GPU only
2020-04-18 17:41:39 -07:00
Hariharan Seshadri
1599562016 Fix BatchNorm CUDA kernel definition 2020-04-18 17:21:29 -07:00
Zhang Lei
c365822808
Refactor some for the calibate.py. Add QLinearAdd and QLinearMul support. Fix bugs loading jpgs not strict RGB, and typoes in load_batch call. (#3542) 2020-04-18 17:10:55 -07:00