Commit graph

4725 commits

Author SHA1 Message Date
Yulong Wang
00aaa6dabb
update CI for onnxruntime-web (#7497) 2021-04-29 22:22:52 -07:00
Changming Sun
0d107bbb73
Fix CUDA 10.2 pipeline (#7508) 2021-04-29 22:22:35 -07:00
Scott McKay
d6df5764d7
Android package infrastructure (#7430)
* Include ORT format model conversion scripts and infrastructure in ORT python package.
  - tweak existing script setup so it can be easily run directly and from the ORT python package
Add config file and readme for Android minimal build package
Update ORT Mobile doco
Disable warning if 'all' optimizations are enabled but NCHWc transformer is excluded (device specific optimizations don't apply in this scenario so the warning is moot).

* Address PR comments
2021-04-30 14:23:54 +10:00
Tim Harris
3d92723d1c
"Sticky" allocation of worker threads (#7372)
* Sticky thread alloaction

* Test sticky thread assignment

* Test sticky thread assignment

* Test sticky thread assignment

* Expose control over additional worker assignment stats

* Sticky thread alloaction

* Test sticky thread assignment

* Test sticky thread assignment

* Test sticky thread assignment

* Expose control over additional worker assignment stats

* Merge

* Merge

* Merge

* Fix Windows build

* Fix windows build 2

* Build Python 3.8 Windows CPU only

* Add env var to override binding

* Build Python 3.8 Windows CPU only

* Fix windows build

* Remove thread affinity override

* Remove goodworker

* Remove Python build settings

* Remove unneeded changes

* Remove unneeded changes

* Remove unneeded changes

* Remove unneeded changes

* Remove unneeded changes

* Remove unneeded changes

* Tidy

* Tidy

* Avoid race on preferred_worker vector

* Improve assertions

* Improve assertions

* Enum for PushBackWithTag result

* Remove unused field

* Update comments

* Extra debugging

* Extra debugging

* Extra debugging

* Support varying thread pool sizes

* Improve comments

* Remove requirement for thread local to be trivially destructible

* Use unsigned consistently for thread counts, removing casting

* Remove debug code

* Fix webassembly build

* Merge

* Merge

* Merge

* Remove unused code

* Fix build

* Extra test case for varying loop sizes

* Clean variable names

* Clean variable names

* Clean variable names

* Remove unneeded include, fix build

* Fix profiling

* Update from review comments
2021-04-29 20:42:14 -07:00
Edward Chen
ec04b6203b
Remove conditional compilation of std::is_trivially_copyable since we are no longer supporting GCC 4. (#7504) 2021-04-29 19:13:09 -07:00
Changming Sun
1012535dab
Change onnxruntime::make_unique to std::make_unique (#7502)
1. Change onnxruntime::make_unique to std::make_unique
2. Add "-std=c++14" to ROCM EP's build flags.
2021-04-29 17:04:53 -07:00
Yufeng Li
d337fa90e7
Propagate QDQ only when scale and zp are scalar (#7492)
fix crash when DeQuantizeLinear's output is graph output
propagate only when scale and zp are scalar.
fix bug for is_modified= is_modified || TryCancelOutDQQPair(graph, dq_node, q_node); in which TryCancelOutDQQPair wouldn't be invoked if is_modified is true
2021-04-29 14:40:41 -07:00
Scott McKay
e255506bcd
Add another input validation to ReverseSequence (#7445)
* Add another input validation to ReverseSequence

* Limit the bad length test to the CPU EP
2021-04-30 07:24:32 +10:00
Xiaoyu Liu
994c2ed420
GPT2 one step beam search update with configuration support (#7425)
* check in early stop search as separate type
* rename to beam search configurations
* update do sample configuration flag help
* rename to configurable search step
* add option groups
* add more unit tests

Co-authored-by: Xiaoyu Liu <xiaoyu@xiaoyu-VM.z4vh1dzj5eoevgybsksdpz2izh.jx.internal.cloudapp.net>
2021-04-29 13:19:56 -07:00
Ilya Lavrenov
6358e96b63
Added OpenVINO 2021.4 support (#7470)
* Added OpenVINO 2021.4 support

* Added OPENVINO_2021_4 handling
2021-04-29 12:25:04 -07:00
Changming Sun
7b003967b1
Add static code analyzer to Windows CPU/GPU CI builds and fix the warnings (#7489) 2021-04-29 11:54:57 -07:00
Tracy Sharpe
2b0bbfd1a8
MLAS: add SSE 4.1 u8s8 kernel (#7490) 2021-04-29 11:12:32 -07:00
Tang, Cheng
e73c3e0651
rollback the GetRuntimePath impl for linux (#7488)
* rollback the GetRuntimePath impl for linux; limit the dynamic ep load ut for win

* remove the override
2021-04-29 09:11:23 -07:00
Chi Lo
0dbe51b002
Enable TRT EP for C# (#7482)
* enabled TRT EP for C#

* Fix potential leak
2021-04-29 04:56:40 -07:00
RajalakshmiSR
3c7c728989
cmake: Add regex pattern for POWER architecture (#7494)
This patch helps to set architecture as power, when processor
check output matches ppc64le*.

Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
2021-04-28 22:23:14 -07:00
Adrian Tsai
f13b378995
Re-disable tests (#7495) 2021-04-28 21:50:22 -07:00
sabreshao
e6a3308db7
Optimize cuComputeGradInput performance. (#7479)
Move the checking of gamma to host and specialize both case through template.
2021-04-28 17:08:31 -07:00
Chandru Ramakrishnan
6773b4f5dd
Fix implicit-exception-spec-mismatch warning. (#7481) (#7483)
* Fix implicit-exception-spec-mismatch warning. (#7481)

* Suppress implicit-exception-spec-mismatch warning.

* Updated to noexcept.

* Unconditionally use noexcept.
2021-04-28 19:17:39 -04:00
Thiago Crepaldi
3ee63beafa
Fix user input order before ORTModule feed it to backend (#7456) 2021-04-28 14:33:40 -07:00
Changming Sun
d68cedfa85
Fix some C/C++ warnings in the jni part (#7385) 2021-04-28 14:25:58 -07:00
Lifu Huang
ab373d6f03
Lifhuan/force trt sequential (#7440)
* Support sequential TensorRT engine build.

* Add documentation.

* Add tests and fix typos.

* Fix missing field in pybind_state.
2021-04-28 13:59:37 -07:00
Bowen Bao
c584d48283
Add sequence identity for opset 14 & fix sequence insert (#7335)
**Description**: 
- Fix SequenceInsert with last position, which is equal to the current sequence length.
- Implement Identity to support sequence input for opset 14.

**Motivation and Context**
- Required to export Huggingface/transformers T5 with beam search.
2021-04-28 13:26:57 -07:00
thilow
22d7cde725
Fix a 'Squeeze' related issue in symbolic_shape_infer.py (#7380)
* Update symbolic_shape_infer.py

don't rely on static code infer in _infer_Squeeze_

* checking if dorpped axes might be =! 1

* Checking opset. Logging assumption that symbolic dimensions are unequal to 1.

* more checks
2021-04-28 13:13:04 -07:00
Maajid khan
674915208a
Fixes RelWithDebInfo build issue on windows for OV-EP (#7471)
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
2021-04-28 10:44:05 -07:00
G. Ramalingam
044c78f089
Add function body to LayerNorm (#7378)
* LayerNorm function body v1

* LayerNorm function body

* layernorm function test

* Minor fixes

* Fix signed unsigned comparison

* Move contrib ops test

* Handle optional output parameters

* Add test case for optional outputs

* Handle float16 random generation

* Address PR feedback
2021-04-28 09:31:53 -07:00
Pranav Sharma
da5c9263e9
Add log to allow serving platforms to quantify ORT usage. (#7476) 2021-04-28 08:20:02 -07:00
KeDengMS
8e21329206
Update nuphar notebook model download url (#7475) 2021-04-27 21:18:06 -07:00
liqunfu
196e6702ad
to support multiple cuda versions in published onnxruntime-training package (#7468)
to support multiple CUDA versions in published onnxruntime-training package
2021-04-27 17:15:33 -07:00
Zhang Lei
e64e30ee0d
Improve ConvTranspose by transposing const filter during prepacking. (#7388)
* Improve ConvTranspose by transposing const filter during prepacking.

* Fix CI build break for openvino which can not load such onnx model now.
2021-04-27 16:49:03 -07:00
Edward Chen
d21304ceb0
Initial Objective-C API (#7366)
Initial implementation of an Objective-C API.
2021-04-27 10:06:30 -07:00
Changming Sun
78e583d08c
Add CMAKE_CUDA_ARCHITECTURES=52 to TensorRT CI pipelines (#7455) 2021-04-27 09:55:23 -07:00
Yulong Wang
c2418a1f42
[wasm] fix memory info creation (#7461) 2021-04-27 09:29:21 -07:00
liqunfu
4cbd2cce9b
. (#7466)
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-27 09:20:21 -07:00
Yulong Wang
4ebc9c3b5e
[JS] onnxruntime-web (#7394)
* add web

* add script and test

* fix lint

* add test/data/ops

* add test/data/node/ to gitignore

* modify scripts

* add onnxjs

* fix tests

* fix test-runner

* fix sourcemap

* fix onnxjs profiling

* update test list

* update README

* resolve comments

* set wasm as default backend

* rename package

* update copyright header

* do not use class "Buffer" in browser context

* revise readme
2021-04-27 00:04:25 -07:00
Tracy Sharpe
d13e5b2fd9
NCHWc: ReorderInput improvements (#7442)
Implement various improvements related to reordering a tensor for use by NCHWc operations:

Relax the requirement that the input channel count must be a multiple of the NCHWc block size (either 8 or 16 depending on ISA). The requirement now is that the channel count must be a multiple of 4. The implementation of MlasReorderInputNchw would need further work to support relaxing this further, but I don't have any models where I've observed this to be necessary yet.
Support fusing a Transpose(NHWC->NCHW) into a following ReorderInput. ReorderInput now has a channels_last attribute as was done in the past for ReorderOutput. This helps with models converted from TF where the converter is unable to remove all Transpose operations.
Add threading support to ReorderInput to accelerate performance (ReorderOutput will come later).
2021-04-26 19:16:39 -07:00
M. Zeeshan Siddiqui
82108b18e3
Partial graph execution perf improvements. (#7438)
* Partial graph execution perf improvements.

* PR feedback.

* Decrement reference count of tensors in ORTModule.

* PR feedback.

* PR feedback.

* PR feedback.
2021-04-26 17:13:55 -07:00
Thiago Crepaldi
0702a14ee7
Add pytorch version check before loading Python ONNX Runtime training module (#7377) 2021-04-26 14:53:50 -07:00
Edward Chen
4804ede501
Update build docker image cache cleanup build definition (#7452)
Decrease default cache history length to 4 days.
Other minor updates to build definition.
2021-04-26 14:39:46 -07:00
RandySheriffH
40568d8821
Wait for dispatch done in RunParallelSection to fix random TP UT crash (#7443)
* wait for dispatch done in RunParallelSection

* pass worker_fn by value

* cancel move

* only move work_fn when it is lastly referred

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2021-04-26 14:12:10 -07:00
Zhang Lei
ada0fbbd2d
Implement qlinear concat and unit test. (#7341)
* Implement qlinear concat and unit test.
Add quantization tools for QLinearConcat and it quantization tests.

* Add kernel def hash for QLinearConcat.

* Change according to PR. Add qdq transformer support for QLinearConcat.

* Add QDQ Transformer unittest. Fix typo on domain.

* remove dup logic of no use.

* fix x86 build error.

* Update operator docs.
2021-04-26 13:38:40 -07:00
Changming Sun
b5592856a7
Remove thread pool's cancel method and suppress some warnings (#7411) 2021-04-26 09:33:48 -07:00
Vincent Wang
368e4a324f
SqueezeGrad Bugfix (#7412)
* squeezegrad bugfix

* fix ut

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2021-04-26 09:12:03 +08:00
Weixing Zhang
ca9b3f18e9
Explicitly pass cuda stream to thrust function rather than use cuda default stream implicitly (#7414)
* Pass cuda stream to thrust function to not use default stream.

In the commit 299ace0, ORT has been changed to not use cuda default stream.

* update amd_hipify.py

* remove un-necessary stream sync

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-04-25 01:18:56 -07:00
jeyblu
b9cbbc41ff
dnnl matmul tensor dimension check (#7383) 2021-04-23 23:17:22 -07:00
RandySheriffH
afe912d47c
Reduce perf gap between thread pool and omp (#7333)
* add async dispatch

* minor renamings

* build py38

* restore yml

* fix sync up issue between dispatch thread and main

* fix comments

* refactor SummonWorker and rename to RunInParallelInternal
2021-04-23 18:36:36 -07:00
Thiago Crepaldi
410a81b21b
Add support for ORTModule to execute the graph when ONNX drops unused… (#7424) 2021-04-23 18:10:57 -07:00
Chen Fu
f4f2cc1a00
Add batch interface to floating point GEMM (#7323)
Currently in high dimension matmul, we call multiple GEMM sequentially. In this change we execute these GEMMs in parallel, removing barriers between two adjacent GEMM operations.

Performance tested with Bert and T5 model. Bert model shows no noticeable perf differences, as the heavy lifting is done by the attention operator, which is not changed in this PR. In T5 model, we see no regression on low parallel threads (x4), and performance improvement is more pronounced in high number of threads (8-16). T5 shows 10% speedup with 16 threads. With profiling, we can see the most expensive MatMul operators in T5 achieves around 20% speedup with 16 threads.

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2021-04-23 17:34:22 -07:00
Suffian Khan
7a3c1787af
Add CI pipeline to publish Python training package targeting Rocm (#7417)
* first attempt rocm training wheel

* modifications needed to python packaging pipeline for Rocm 4.1

* changges to not conflict with cuda

missed stage1 changes

remove package push

add option r to getopt

try again without python install

try again without python install

try again without python install

split pipelines and add back push to remote storage

try on cuda gpu pool

try again

try again

try running without az subscription set

try again on original pipeline

change pool

passing AMD Rocm whl on AMD-GPU pool

split rocm pipeline from cuda pipeline

remove comments

* try adding Rocm tests as well

* try with tests in place

* fix trailing ws

* add training data

* try again as root for tests

* use python3

* typo

* try to map video, render group into container

* try again

* try again

* try to avoid yum error code

* make UID 1001

* try without yum downgrade

* define rocm_version=None

* remove CUDA related comments for Rocm Dockerfile

* Dont pin nightly torch torchvision torchtext versions as they expire (for now nightly is required for Rocm 4.1)

* missed requirements-rocm.txt from last commit

* fix whitespace
2021-04-23 17:22:31 -07:00
M. Zeeshan Siddiqui
34ebf7d3dd
Partial graph execution made simple. (#7324)
* Python changes.

* C++ changes.

* fixes/hacks.

* more hacks.

* perf.

* changes.

* changes.

* re-architect partial graph execution and  remove iobinding.

* changes.

* refactor.

* prevent copies from python to c++.

* perf.

* merge conflicts.

* misc.

* fix merge conflicts and tests.

* Ifdef partial executor.

* PR feedback.

* Delete ORT Task et al.

* Clean up.

* clean up.

* Restore SetOutputMLValue().

* PR feedback.

* Re-enable disabled ORTModule tests.

* PR feedback.

* PR feedback.
2021-04-23 15:09:18 -07:00
Changming Sun
5208231126
Fix some warnings in our CUDA code (#7436) 2021-04-23 14:56:20 -07:00