Commit graph

1566 commits

Author SHA1 Message Date
Zhang Lei
aa37e2de8f
Direct use python numpy array's memory if already contiguous. (#2355)
* Direct use python numpy array's memory if already contiguous. This
could greatly improve performance for session with large input,
like big image 1920x1080 fastrcnn, 30~40% speed up could be achieved.

* Add test case enforce contiguous/non-contiguos numpy array as inputs.
2019-11-11 13:46:55 -08:00
Zhang Lei
ed6da0d191
Implement cuda nonzero op. (#2056)
Implement cuda nonzero op.
2019-11-11 13:45:52 -08:00
avidiyal
3d3cf0e159 Openvino EP R3.1 onnxrt server (#2357)
* onnxrt server with OVEP

* onnxrt server with OVEP

* Update Dockerfile.server.openvino

* onnxrt server OVEP fix reviews

* onnxrt server OVEP fix reviews
2019-11-11 12:22:19 -08:00
Scott McKay
599d72a94f
Fix/test dim value of 0 handling in a couple of places (#2337)
* Update the CUDA Where implementation broadcasting logic to handle a dim with value of 0.
Add unit test
Also add unit test for unary op with dim value of 0

* Exclude ngraph from Where test with 0 dim.
2019-11-11 07:57:19 +10:00
Dmitri Smirnov
25b3c51661
Introduce PrimitiveType into a Type System along with an integer constant (#2307)
Improve perf by avoiding GetType<T>() calls. Introduce MLTypeCallDispatcher to switch on Input Type. Add Tensor IsType<T>() fast method.
2019-11-08 17:47:06 -08:00
jignparm
fa30b1e758
Set ElementType to String type of node metadata, instead of byte[] (#2348)
* Set ElementType to String type of node metadata, instead of byte[]

* Fix spacing
2019-11-08 14:52:56 -08:00
Zhang Lei
7fcd752393
Cuda Reverse Sequence Op, maping types of same size using same template function. (#2281) 2019-11-08 13:52:26 -08:00
Changming Sun
080a0a3186
Nuget pipeline changes (#2305)
1. refactor the pipeline, remove some duplicated code
2. Move Windows_py_GPU_Wheels job to Win-GPU-CUDA10. We'll deprecated the "Win-GPU" pool
3. Delete cpu-nocontribops-esrp-pipeline.yml and cpu-nocontribops-pipeline.yml
4. In Linux nuget jobs, run "make install" before creating the package. So that extra RPAH info will be removed
2019-11-08 09:45:52 -08:00
Scott McKay
5a3ea7469a
Remove unused initializer from GraphProto as well as name_to_initial_tensor_ in CleanUnusedInitializers. (#2320)
* Remove unused initializer from GraphProto as well as name_to_initial_tensor_ in CleanupUnusedInitializers.

This means initializers that have been replaced during graph optimizations are not left in the GraphProto when we save an optimized model.

* Handle edge case where a model has an unused initializer with matching graph input by also removing the graph input.

* Use non-const iterators in std::find_if calls to make centos build happy.
2019-11-08 16:29:50 +10:00
Yulong Wang
da3c0ba14b
implement CPU contrib OP Attention (#2333) 2019-11-07 17:14:59 -08:00
Tianlei Wu
b539cc74c7
Add FastGelu Cuda Op for Gelu and Add bias fusion (#2293)
* Add FastGelu cuda op

* Add AddBiasGelu for experiment

* Revert "Add AddBiasGelu for experiment"

This reverts commit 5c1ee019858c657e6bb75887265cb85675626e5b.

* Add bias

* Add unit tests

* update comment

* update script

* fix build error

* update coding style

* update for CR feedback
Enable half2 optimization only when cuda arch >= 7.0

* move _Tanh to common.cuh
2019-11-07 17:05:55 -08:00
liuziyue
259bff8cf1
Layer Normalization Fusion (#2319)
basic layer normalization transform
2019-11-07 12:00:08 -08:00
Hariharan Seshadri
553537ed52
Add CUDA GatherElements kernel (#2310)
* Updates

* Update test

* Update

* Updates

* nits

* PR feedback

* Update

* Update

* PR feedback

* PR comments

* Update

* Fix build

* Fix build

* Nits

* Fix
2019-11-07 10:54:20 -08:00
Yufeng Li
6651d2f662
Make elementwise op run 4 items per thread (#2335)
Description: Describe your changes.
Make elementwise op run 4 items per thread
unroll for loop to leverage ILP
remove unnessary N==0 check inside elementwise GPU kernel
Motivation and Context
Why is this change required? What problem does it solve?
It can improve the performance of GPU elementwise ops. ~2% performance gain on popular NLP bert model.
If it fixes an open issue, please link to the issue here.
2019-11-06 17:15:25 -08:00
George Wu
ba0e7daf20
update dockerfiles/README (#2336) 2019-11-06 16:54:10 -08:00
baowenlei
0f1e24f4a9 [NupharEP] tensorize int8 GEMM for avx (#2142)
* finish avx tensorization and save state

* split tests for better debug

* add missing avx option

* update configure for AVX

* update tensorize avx support

* Merged PR 5327: Fix llvm cross compilation

Fix llvm cross compilation

Related work items: #4080
2019-11-06 14:35:13 -08:00
KeDengMS
58e6aaa414
Fix crash in releasing TLS from CUDA EP dtor (#2329)
thread_local/global/static destruction order depends on implementation details of compilers and OS. The bug happens when thread_local is already out of scope while static EP being destructed, thus causing access violation in EP's destructor when accessing thread_local.

The fix is to maintain ownership inside EP with a mapping from tid to ThreadLocalContext, to avoid accessing thread_local in EP's destructor. This way, no matter what the destruction order is, no access violation would be triggered.
2019-11-06 13:00:17 -08:00
Yulong Wang
c0b8926863
implement CPU contrib OP EmbedLayerNormalization (#2332) 2019-11-06 12:27:08 -08:00
George Wu
06a6d74a67
update ngraph dockerfile. add python lib location to LD_LIBRARY_PATH for cuda/tensorrt Dockerfiles. (#2330) 2019-11-06 11:29:55 -08:00
Vinitra Swamy
ace19129b9 MCR Docker Images v1.0.0 refresh (#2302)
* update dockerfile table with new MCR tags

* add new openvino dockerfiles to table
2019-11-05 22:06:47 -08:00
Patrick Foley
151075790d [OpenVINO-EP] Update to latest version: OpenVINO 2019 R3.1 (#2308)
* Updates OpenVINO EP to latest version: 2019 R3.1

* Reviews fixed

* Update Dockerfile.openvino

* Addressed PR comments and disabled model tests temporarily

* Update Dockerfile.ubuntu_openvino
2019-11-05 19:55:46 -08:00
Dwayne Robinson
db454beacf
TensorDesc::Placement test failure - cherry pick Vibranium fix. (#2328) 2019-11-05 18:18:31 -08:00
Scott McKay
67ec626d88
Copy blocks in Slice when possible (#2312)
* Add logic to try and flatten inner dimensions being copied by Slice and do a block copy if they can be.
Do a block copy for just the inner most dimension where possible (applies even if we don't flatten inner dimensions).
2019-11-06 10:53:30 +10:00
Changming Sun
104f3b2a59 Exclude candy from CUDA tests 2019-11-05 15:22:09 -08:00
Changming Sun
143ae98a37
Fix a bug in onnxruntime_pybind_state.cc when TENSORRT is enabled (#2326) 2019-11-05 15:04:50 -08:00
George
8a102c6e99 apply eigen patch only for ACL. 2019-11-05 13:53:53 -08:00
Changming Sun
5ce4d4fc49 Fix a test failure when it runs on FreeBSD 2019-11-04 23:47:37 -08:00
Yufeng Li
035913d42f
Support int32_t for Reduction (#2317) 2019-11-04 20:52:01 -08:00
manashgoswami
d5c36bfff2 Updated links in docs (#2303)
* Update README.md

* Update README.md

* Update README.md
2019-11-03 09:10:56 -08:00
Faith Xu
556bae17a5 Fix versions table (#2309)
* Update table values

* Fix onnxml opset version
2019-11-03 08:58:21 -08:00
Yulong Wang
cba93f7c8d fix Gelu CPU: remove MayInplace() declaration (#2306) 2019-11-01 18:10:05 -07:00
Yulong Wang
204a6872d3
remove unused param 'input_count' in ConcatImpl (#2304) 2019-11-01 15:50:11 -07:00
Tianlei Wu
a6b2c9fc09
Fix mask in EmbedLayerNormalization (#2300) 2019-11-01 13:49:55 -07:00
KeDengMS
6e65dcf588
[NupharEP] symbolic_shape_infer improvements (#2299)
- Improves symbolic shape inference in following ways:
1. Extend suggested merge to map to literals with --auto_merge. For example, MatMul of ['ax1', 'ax2'] x [128, 256] would now map 'ax2' to 128
2. Add --int_max option to simplify computations like Min(100000, 'dim') to be 'dim'. This helps ops like Slice to generate correct shape, i.e. start=0, end=Min(100000, dim - 2) on dim. It was previously treated as equal, since sympy cannot determine Min(100000, dim - 2) < dim.
- Fix a bug in create_shared script on Windows, that AOT dll is not generated because of failure in link, when there are too many obj files
- Fix a bug for Split since TOPI does not support split on symbolic dimension.
- Some build warning fixes for NupharEP.
2019-11-01 11:34:52 -07:00
Tianlei Wu
bc85d43809
Dump cuda tensor data (#2243)
* dump cuda tensor

* move data_type definition

* Dump cuda tensors for cuda build only.
Output tensor location (if it is not in CPU or pinned)

* update for cuda build

* Update for code review feedback

* update for CR feedback

* use data transfer manager for tensor copy
2019-10-31 21:09:10 -07:00
Scott McKay
7a5de9c958
Add a python script with a number of helper actions for creating/editing/dumping onnx test runner format pb files (#2294)
* Add a python script with a number of helper actions for creating/editing/dumping onnx test running format pb files.
2019-11-01 06:39:14 +10:00
mikecaraman
358b517d49 [v2] Add ACL (Arm Compute Library) execution provider (#2258)
* Guard unused parameter

Guard unused parameter for Linux Arm and other cases.

* Add ACL (Arm Compute Library) execution provider

Add a new execution provider targeting Arm architecture based on Arm Compute Library.
Validated on NXP i.MX8QM CPU with ResNet50, MobileNetv2 and VGG models.
All unit tests are passing.

Comparative performance improvements for ResNet50v1 model obtained with
onnxruntime_perf_test:
		A72	2xA72	A53	4xA53
ACL vs CPU  	16%	9%	21%	13%

Usage documentation available in ACL-ExecutionProvider.

* Fix eigen unused parameter

Fix eigen unused parameter error for Arm cross-compilation.
2019-10-31 12:25:36 -07:00
Yulong Wang
bf7fa091cc
NonMaxSuppression cuda implementation (#2082) 2019-10-31 11:53:22 -07:00
Changming Sun
67755adfd8 Bug Fix: NodeArg class has a move constructor but doesn't have a move assignment operator 2019-10-31 10:29:54 -07:00
RandySheriffH
d6849bd26c
Rashuai/cuda top k (#1919)
* implement cuda topk

* implement heap

* add type support

* refactor interface

* add support for sorting by index

* add test case

* use cub device radix sort

* register for opset 9 and 10

* add opset 9/10 delaration

* refactor code

* refactor code

* fix comment

* fix comment

* switch to scratched mem
2019-10-31 10:26:00 -07:00
Hariharan Seshadri
4bcd8bfca1
Fix CUDA Reduce ops (#2268)
* Add some tests for Reduction ops

* Exclude tensorrt for new tests

* Fix bug in CUDA Reduce ops

* Fix nit
2019-10-31 10:11:59 -07:00
Changming Sun
a5da5ff6f4 Remove onnxruntime_USE_EIGEN_THREADPOOL cmake option 2019-10-30 21:51:54 -07:00
KeDengMS
ff64d1f55b
Relax check for optimized model saving (#2291)
So user may save model with layout optimization.
2019-10-30 21:48:49 -07:00
Maik Riechert
ecfbb1bb99 Add missing guards to profiling calls (#1374)
* guard remaining profiler calls

* enforce proper usage of profile class
2019-10-31 14:28:49 +10:00
George Wu
aa041026e3
update Dockerfile.openvino (#2286)
* install miniconda before openvino installation. add networkx, defusedxml dependencies.

* apt-get update

* apt-get update

* merge Intel changes.
2019-10-30 13:58:24 -07:00
Tomasz Dołbniak
427e627805 Support for the Expand op with constant shape inputs (#2278)
* Disable the Expand op for non-const shape inputs

* Check if an input is constant with IsConstantInitializer
2019-10-30 13:22:45 -07:00
KeDengMS
e18c9582a8
[NupharEP] performance improvements (#2283)
* [Nuphar EP] performance improvements
1. Add new ops: Shape, Expand
2. Add support for steps in Slice
3. Simplify Gather
4. Always inline alias nodes
5. Transpose nodes with inner loop being symbolic falls back to CPU provider when vectorization is not possible
6. Add opt_inproj option to model_editor to extract MatMuls inside Scan for input projection to outside
2019-10-30 10:15:04 -07:00
zhijxu
63e9961637 fix typo 2019-10-30 09:57:56 -07:00
zhijxu
8dabe0502b merge two RUN to avoid making docker image too larger 2019-10-30 09:57:56 -07:00
Changming Sun
7b11f05a97 Update version number 2019-10-30 08:13:09 -07:00