Commit graph

1583 commits

Author SHA1 Message Date
KeDengMS
b15e43a541
[NupharEP] Multiple optimizations (#2380)
Fuse transpose into MatMul
Implement Pow and constant scalar simplification
Vectorize ReduceMean
Improve symbolic shape inference
Minor updates for better debugging in fused function name
2019-11-14 10:40:33 -08:00
Pranav Sharma
7e164eaa6a
Fix reuse logic in allocation planner. (#2393)
* Fix reuse logic in allocation planner.

* PR comments

* Add helpful comments

* Don't allow reuse across string tensors.
2019-11-13 22:51:12 -08:00
Ilya Lavrenov
b90d55b7ea Fixed compilation with ngraph (#2388) 2019-11-13 17:49:00 -08:00
nihui
dde410e073 fix BUILD.md typo (#2375)
build.py: error: argument --config: invalid choice: 'RelWithDebugInfo' (choose from 'Debug', 'MinSizeRel', 'Release', 'RelWithDebInfo')
2019-11-13 17:48:08 -08:00
KeDengMS
51571030ef
Another try to stabilize CUDA CI (#2383)
The root cause seems to be failure in CUDA dealloc when tear down. cudaFree return code was ignored before, so should the debug check.
2019-11-13 15:58:15 -08:00
liuziyue
ffa2812587
Skip layer norm transform (#2350)
* skip layer normalization transformer
2019-11-13 13:46:09 -08:00
Yufeng Li
8ed2928dd5
Fuse Add + Gelu (#2360)
Implement the transformer to fuse add + gelu
Implement the accurate kernel
2019-11-13 09:26:00 -08:00
liuziyue
4b72fedbd5
Layer Norm Fusion Fix (#2379)
* layer norm fusion fix

* Add input shape check in code and unit tests
2019-11-12 17:19:51 -08:00
Scott McKay
8c733c8d82
Add opset 11 version of Split to CUDA ops (#2376)
Organize the CUDA ops definitions so all the opset 10 and 11 parts are together (same setup used for CPU ops)
2019-11-13 07:40:13 +10:00
Scott McKay
c0d23d5ffe
Fix bug with Slice. Need to pass in flattened input dimensions so the initial offset into the input is calculated correctly. (#2372) 2019-11-13 07:00:26 +10:00
KeDengMS
9e26f4de6f
Extend OneHot CPU kernel to support more types (#2311)
* Extend OneHot CPU kernel to support input int64_t, depth int32_t, output float

* Skip BERT before the test data fix is picked up
2019-11-12 11:54:06 -08:00
Ashwini Khade
437772d5bc
update output size calculation for resize (#2366)
* change how output size is calculated for resize op

* add tests for ver 10 resize
2019-11-12 10:06:17 -08:00
KeDengMS
192dcfaa8e
Fix a bug in TLS refcount that may destabilized CUDA CI (#2374) 2019-11-12 00:48:31 -08:00
Yang Chen
41b9f01e4c
test bidaf with nuphar for avx target (#2370)
increase nuphar test coverage a bit
2019-11-12 00:47:13 -08:00
Changming Sun
fc6773a65b
Add Tracelogging for profiling (#1639)
Enabled only if onnxruntime_ENABLE_INSTRUMENT is ON
2019-11-11 21:34:10 -08:00
George Wu
0c6e9f94d0
fix builds enabling onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS (#2369)
* fix builds enabling onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS

* update
2019-11-11 15:26:18 -08:00
Scott McKay
53ed36a3da
Add helper to create output to minimize binary size. (#2365)
Add ConstEigenTensorMap typedef so we don't unnecessarily const_cast the const input Tensor.
2019-11-12 09:08:04 +10:00
Zhang Lei
aa37e2de8f
Direct use python numpy array's memory if already contiguous. (#2355)
* Direct use python numpy array's memory if already contiguous. This
could greatly improve performance for session with large input,
like big image 1920x1080 fastrcnn, 30~40% speed up could be achieved.

* Add test case enforce contiguous/non-contiguos numpy array as inputs.
2019-11-11 13:46:55 -08:00
Zhang Lei
ed6da0d191
Implement cuda nonzero op. (#2056)
Implement cuda nonzero op.
2019-11-11 13:45:52 -08:00
avidiyal
3d3cf0e159 Openvino EP R3.1 onnxrt server (#2357)
* onnxrt server with OVEP

* onnxrt server with OVEP

* Update Dockerfile.server.openvino

* onnxrt server OVEP fix reviews

* onnxrt server OVEP fix reviews
2019-11-11 12:22:19 -08:00
Scott McKay
599d72a94f
Fix/test dim value of 0 handling in a couple of places (#2337)
* Update the CUDA Where implementation broadcasting logic to handle a dim with value of 0.
Add unit test
Also add unit test for unary op with dim value of 0

* Exclude ngraph from Where test with 0 dim.
2019-11-11 07:57:19 +10:00
Dmitri Smirnov
25b3c51661
Introduce PrimitiveType into a Type System along with an integer constant (#2307)
Improve perf by avoiding GetType<T>() calls. Introduce MLTypeCallDispatcher to switch on Input Type. Add Tensor IsType<T>() fast method.
2019-11-08 17:47:06 -08:00
jignparm
fa30b1e758
Set ElementType to String type of node metadata, instead of byte[] (#2348)
* Set ElementType to String type of node metadata, instead of byte[]

* Fix spacing
2019-11-08 14:52:56 -08:00
Zhang Lei
7fcd752393
Cuda Reverse Sequence Op, maping types of same size using same template function. (#2281) 2019-11-08 13:52:26 -08:00
Changming Sun
080a0a3186
Nuget pipeline changes (#2305)
1. refactor the pipeline, remove some duplicated code
2. Move Windows_py_GPU_Wheels job to Win-GPU-CUDA10. We'll deprecated the "Win-GPU" pool
3. Delete cpu-nocontribops-esrp-pipeline.yml and cpu-nocontribops-pipeline.yml
4. In Linux nuget jobs, run "make install" before creating the package. So that extra RPAH info will be removed
2019-11-08 09:45:52 -08:00
Scott McKay
5a3ea7469a
Remove unused initializer from GraphProto as well as name_to_initial_tensor_ in CleanUnusedInitializers. (#2320)
* Remove unused initializer from GraphProto as well as name_to_initial_tensor_ in CleanupUnusedInitializers.

This means initializers that have been replaced during graph optimizations are not left in the GraphProto when we save an optimized model.

* Handle edge case where a model has an unused initializer with matching graph input by also removing the graph input.

* Use non-const iterators in std::find_if calls to make centos build happy.
2019-11-08 16:29:50 +10:00
Yulong Wang
da3c0ba14b
implement CPU contrib OP Attention (#2333) 2019-11-07 17:14:59 -08:00
Tianlei Wu
b539cc74c7
Add FastGelu Cuda Op for Gelu and Add bias fusion (#2293)
* Add FastGelu cuda op

* Add AddBiasGelu for experiment

* Revert "Add AddBiasGelu for experiment"

This reverts commit 5c1ee019858c657e6bb75887265cb85675626e5b.

* Add bias

* Add unit tests

* update comment

* update script

* fix build error

* update coding style

* update for CR feedback
Enable half2 optimization only when cuda arch >= 7.0

* move _Tanh to common.cuh
2019-11-07 17:05:55 -08:00
liuziyue
259bff8cf1
Layer Normalization Fusion (#2319)
basic layer normalization transform
2019-11-07 12:00:08 -08:00
Hariharan Seshadri
553537ed52
Add CUDA GatherElements kernel (#2310)
* Updates

* Update test

* Update

* Updates

* nits

* PR feedback

* Update

* Update

* PR feedback

* PR comments

* Update

* Fix build

* Fix build

* Nits

* Fix
2019-11-07 10:54:20 -08:00
Yufeng Li
6651d2f662
Make elementwise op run 4 items per thread (#2335)
Description: Describe your changes.
Make elementwise op run 4 items per thread
unroll for loop to leverage ILP
remove unnessary N==0 check inside elementwise GPU kernel
Motivation and Context
Why is this change required? What problem does it solve?
It can improve the performance of GPU elementwise ops. ~2% performance gain on popular NLP bert model.
If it fixes an open issue, please link to the issue here.
2019-11-06 17:15:25 -08:00
George Wu
ba0e7daf20
update dockerfiles/README (#2336) 2019-11-06 16:54:10 -08:00
baowenlei
0f1e24f4a9 [NupharEP] tensorize int8 GEMM for avx (#2142)
* finish avx tensorization and save state

* split tests for better debug

* add missing avx option

* update configure for AVX

* update tensorize avx support

* Merged PR 5327: Fix llvm cross compilation

Fix llvm cross compilation

Related work items: #4080
2019-11-06 14:35:13 -08:00
KeDengMS
58e6aaa414
Fix crash in releasing TLS from CUDA EP dtor (#2329)
thread_local/global/static destruction order depends on implementation details of compilers and OS. The bug happens when thread_local is already out of scope while static EP being destructed, thus causing access violation in EP's destructor when accessing thread_local.

The fix is to maintain ownership inside EP with a mapping from tid to ThreadLocalContext, to avoid accessing thread_local in EP's destructor. This way, no matter what the destruction order is, no access violation would be triggered.
2019-11-06 13:00:17 -08:00
Yulong Wang
c0b8926863
implement CPU contrib OP EmbedLayerNormalization (#2332) 2019-11-06 12:27:08 -08:00
George Wu
06a6d74a67
update ngraph dockerfile. add python lib location to LD_LIBRARY_PATH for cuda/tensorrt Dockerfiles. (#2330) 2019-11-06 11:29:55 -08:00
Vinitra Swamy
ace19129b9 MCR Docker Images v1.0.0 refresh (#2302)
* update dockerfile table with new MCR tags

* add new openvino dockerfiles to table
2019-11-05 22:06:47 -08:00
Patrick Foley
151075790d [OpenVINO-EP] Update to latest version: OpenVINO 2019 R3.1 (#2308)
* Updates OpenVINO EP to latest version: 2019 R3.1

* Reviews fixed

* Update Dockerfile.openvino

* Addressed PR comments and disabled model tests temporarily

* Update Dockerfile.ubuntu_openvino
2019-11-05 19:55:46 -08:00
Dwayne Robinson
db454beacf
TensorDesc::Placement test failure - cherry pick Vibranium fix. (#2328) 2019-11-05 18:18:31 -08:00
Scott McKay
67ec626d88
Copy blocks in Slice when possible (#2312)
* Add logic to try and flatten inner dimensions being copied by Slice and do a block copy if they can be.
Do a block copy for just the inner most dimension where possible (applies even if we don't flatten inner dimensions).
2019-11-06 10:53:30 +10:00
Changming Sun
104f3b2a59 Exclude candy from CUDA tests 2019-11-05 15:22:09 -08:00
Changming Sun
143ae98a37
Fix a bug in onnxruntime_pybind_state.cc when TENSORRT is enabled (#2326) 2019-11-05 15:04:50 -08:00
George
8a102c6e99 apply eigen patch only for ACL. 2019-11-05 13:53:53 -08:00
Changming Sun
5ce4d4fc49 Fix a test failure when it runs on FreeBSD 2019-11-04 23:47:37 -08:00
Yufeng Li
035913d42f
Support int32_t for Reduction (#2317) 2019-11-04 20:52:01 -08:00
manashgoswami
d5c36bfff2 Updated links in docs (#2303)
* Update README.md

* Update README.md

* Update README.md
2019-11-03 09:10:56 -08:00
Faith Xu
556bae17a5 Fix versions table (#2309)
* Update table values

* Fix onnxml opset version
2019-11-03 08:58:21 -08:00
Yulong Wang
cba93f7c8d fix Gelu CPU: remove MayInplace() declaration (#2306) 2019-11-01 18:10:05 -07:00
Yulong Wang
204a6872d3
remove unused param 'input_count' in ConcatImpl (#2304) 2019-11-01 15:50:11 -07:00
Tianlei Wu
a6b2c9fc09
Fix mask in EmbedLayerNormalization (#2300) 2019-11-01 13:49:55 -07:00