Commit graph

11997 commits

Author SHA1 Message Date
Edward Chen
8647201ac7 Improved documentation for onnxruntime::utils::SwapByteOrderCopy(), added precondition check. 2019-11-19 11:03:40 -08:00
Scott McKay
be12cdc73f
Add CUDA If operator. (#2377)
* Add CUDA If operator.
Uses CPU operator for implementation.
By adding a CUDA version the inputs/outputs (with the exception of the 'cond' input) stay on GPU, and no other logic is required to avoid a copy to CPU across the control flow node.
2019-11-19 12:01:46 +10:00
Patrick Foley
1cb6bdc33c Added support for Pad-2 operator in OpenVINO-EP (#2405) 2019-11-18 15:57:27 -08:00
avidiyal
95e8c3377e onnxrt server documentation update (#2396) 2019-11-18 15:31:07 -08:00
Nick Groszewski
0e947bd328 feat(treeregressor): Update TreeEnsembleRegressor for type support (#2389)
Updates the `TreeEnsembleRegressor` to allow for `double`, `float`,
`int64`, and `int32` inputs to match the upstream specification.

Signed-off-by: Nick Groszewski <nicholas.groszewski@capitalone.com>
2019-11-18 13:07:38 -08:00
Hector Li
367361fc74
Fix the issue in matmul_add_fusion (#2407)
Fix the issue in matmul_add_fusion

If Muatmul + Add has shape [K] * [K, N], reset it to [1, K] * [K, N] will make the output shape to [1, N] will also requires a reshape on the output.
Fix: just remove the shape reset to not fuse it.

Add a negative test case for matmul+add fusion
2019-11-18 10:50:44 -08:00
KeDengMS
aa7c79eac9 [NupharEP] Update notebook and docker image (#2416)
Add BERT squad in Nuphar tutorial
Enhance speed comparsion readability
2019-11-18 10:38:14 -08:00
Tianlei Wu
e1c17fd126
Add Reshape Fusion (#2395)
* Add reshape fusion

* Add some comments

* update comments

* update comment format

* update according to feedback

* update for recent logger change

* fix build error

* (1) Support both input and output edges in find path in graphutils
(2) Add a test case of only one constant initializer of Concat input.
(3) Refactor ReshapeFusion class to allow add more subgraph fusion in the future.

* fix error

* (1) loose constraint on initializer: non constant is allowed for reshape fusion.
(2) Change versions type to vector.
(3) Add logging.
(4) Return false when multiple output edges matched in FindPath. Add comments.

* only allow one direction (input or output) in FindPath
2019-11-18 10:07:10 -08:00
Pranav Sharma
f268e69c79
Minor optimization: if a node has already been placed, there's no need to find a kernel for it. (#2417) 2019-11-17 20:08:33 -08:00
baowenlei
5ab7041fa7 fix cross compile bug (#2415) 2019-11-16 01:32:57 -08:00
KeDengMS
1e03ce84eb
[NupharEP] force some low/zero cost ops to be inlined (#2409) 2019-11-15 16:03:35 -08:00
Scott McKay
c1d757a00b
Add opset 11 versions of the existing CUDA operators that had negative axis support explicitly added. (#2398)
* Add opset 11 versions of the existing CUDA operators that had negative axis support explicitly added.
2019-11-15 12:10:00 +10:00
Scott McKay
6cc57721f4
Change CUDA implementation of Transpose to support all fixed size tensor types (#2387)
* Change CUDA implementation of Transpose to not use a typed kernel so we can support more types with minimum binary size.
Add support for 8, 16, 32 and 64 bit types.
Add unit tests.
Add method so the implementation can be called directly (will be used by CUDA Scan very soon).

* Disable TensorRT for MLFloat16 and int8 unit tests.

* Address PR comment and add support for calling cublas implementation if type is mlfloat16.
2019-11-15 10:36:28 +10:00
Changming Sun
109b3cb450
Avoid using the default logger in the graph lib and optimizers (#2361)
1. Use the session logger if it is available.
2. Don't disable warning 4100 globally. We should fix the warnings instead of disabling it.
2019-11-14 13:23:28 -08:00
KeDengMS
b15e43a541
[NupharEP] Multiple optimizations (#2380)
Fuse transpose into MatMul
Implement Pow and constant scalar simplification
Vectorize ReduceMean
Improve symbolic shape inference
Minor updates for better debugging in fused function name
2019-11-14 10:40:33 -08:00
Pranav Sharma
7e164eaa6a
Fix reuse logic in allocation planner. (#2393)
* Fix reuse logic in allocation planner.

* PR comments

* Add helpful comments

* Don't allow reuse across string tensors.
2019-11-13 22:51:12 -08:00
Ilya Lavrenov
b90d55b7ea Fixed compilation with ngraph (#2388) 2019-11-13 17:49:00 -08:00
nihui
dde410e073 fix BUILD.md typo (#2375)
build.py: error: argument --config: invalid choice: 'RelWithDebugInfo' (choose from 'Debug', 'MinSizeRel', 'Release', 'RelWithDebInfo')
2019-11-13 17:48:08 -08:00
KeDengMS
51571030ef
Another try to stabilize CUDA CI (#2383)
The root cause seems to be failure in CUDA dealloc when tear down. cudaFree return code was ignored before, so should the debug check.
2019-11-13 15:58:15 -08:00
liuziyue
ffa2812587
Skip layer norm transform (#2350)
* skip layer normalization transformer
2019-11-13 13:46:09 -08:00
Yufeng Li
8ed2928dd5
Fuse Add + Gelu (#2360)
Implement the transformer to fuse add + gelu
Implement the accurate kernel
2019-11-13 09:26:00 -08:00
liuziyue
4b72fedbd5
Layer Norm Fusion Fix (#2379)
* layer norm fusion fix

* Add input shape check in code and unit tests
2019-11-12 17:19:51 -08:00
Scott McKay
8c733c8d82
Add opset 11 version of Split to CUDA ops (#2376)
Organize the CUDA ops definitions so all the opset 10 and 11 parts are together (same setup used for CPU ops)
2019-11-13 07:40:13 +10:00
Scott McKay
c0d23d5ffe
Fix bug with Slice. Need to pass in flattened input dimensions so the initial offset into the input is calculated correctly. (#2372) 2019-11-13 07:00:26 +10:00
KeDengMS
9e26f4de6f
Extend OneHot CPU kernel to support more types (#2311)
* Extend OneHot CPU kernel to support input int64_t, depth int32_t, output float

* Skip BERT before the test data fix is picked up
2019-11-12 11:54:06 -08:00
Ashwini Khade
437772d5bc
update output size calculation for resize (#2366)
* change how output size is calculated for resize op

* add tests for ver 10 resize
2019-11-12 10:06:17 -08:00
KeDengMS
192dcfaa8e
Fix a bug in TLS refcount that may destabilized CUDA CI (#2374) 2019-11-12 00:48:31 -08:00
Yang Chen
41b9f01e4c
test bidaf with nuphar for avx target (#2370)
increase nuphar test coverage a bit
2019-11-12 00:47:13 -08:00
Changming Sun
fc6773a65b
Add Tracelogging for profiling (#1639)
Enabled only if onnxruntime_ENABLE_INSTRUMENT is ON
2019-11-11 21:34:10 -08:00
George Wu
0c6e9f94d0
fix builds enabling onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS (#2369)
* fix builds enabling onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS

* update
2019-11-11 15:26:18 -08:00
Scott McKay
53ed36a3da
Add helper to create output to minimize binary size. (#2365)
Add ConstEigenTensorMap typedef so we don't unnecessarily const_cast the const input Tensor.
2019-11-12 09:08:04 +10:00
Zhang Lei
aa37e2de8f
Direct use python numpy array's memory if already contiguous. (#2355)
* Direct use python numpy array's memory if already contiguous. This
could greatly improve performance for session with large input,
like big image 1920x1080 fastrcnn, 30~40% speed up could be achieved.

* Add test case enforce contiguous/non-contiguos numpy array as inputs.
2019-11-11 13:46:55 -08:00
Zhang Lei
ed6da0d191
Implement cuda nonzero op. (#2056)
Implement cuda nonzero op.
2019-11-11 13:45:52 -08:00
avidiyal
3d3cf0e159 Openvino EP R3.1 onnxrt server (#2357)
* onnxrt server with OVEP

* onnxrt server with OVEP

* Update Dockerfile.server.openvino

* onnxrt server OVEP fix reviews

* onnxrt server OVEP fix reviews
2019-11-11 12:22:19 -08:00
Scott McKay
599d72a94f
Fix/test dim value of 0 handling in a couple of places (#2337)
* Update the CUDA Where implementation broadcasting logic to handle a dim with value of 0.
Add unit test
Also add unit test for unary op with dim value of 0

* Exclude ngraph from Where test with 0 dim.
2019-11-11 07:57:19 +10:00
Dmitri Smirnov
25b3c51661
Introduce PrimitiveType into a Type System along with an integer constant (#2307)
Improve perf by avoiding GetType<T>() calls. Introduce MLTypeCallDispatcher to switch on Input Type. Add Tensor IsType<T>() fast method.
2019-11-08 17:47:06 -08:00
jignparm
fa30b1e758
Set ElementType to String type of node metadata, instead of byte[] (#2348)
* Set ElementType to String type of node metadata, instead of byte[]

* Fix spacing
2019-11-08 14:52:56 -08:00
Zhang Lei
7fcd752393
Cuda Reverse Sequence Op, maping types of same size using same template function. (#2281) 2019-11-08 13:52:26 -08:00
Changming Sun
080a0a3186
Nuget pipeline changes (#2305)
1. refactor the pipeline, remove some duplicated code
2. Move Windows_py_GPU_Wheels job to Win-GPU-CUDA10. We'll deprecated the "Win-GPU" pool
3. Delete cpu-nocontribops-esrp-pipeline.yml and cpu-nocontribops-pipeline.yml
4. In Linux nuget jobs, run "make install" before creating the package. So that extra RPAH info will be removed
2019-11-08 09:45:52 -08:00
Scott McKay
5a3ea7469a
Remove unused initializer from GraphProto as well as name_to_initial_tensor_ in CleanUnusedInitializers. (#2320)
* Remove unused initializer from GraphProto as well as name_to_initial_tensor_ in CleanupUnusedInitializers.

This means initializers that have been replaced during graph optimizations are not left in the GraphProto when we save an optimized model.

* Handle edge case where a model has an unused initializer with matching graph input by also removing the graph input.

* Use non-const iterators in std::find_if calls to make centos build happy.
2019-11-08 16:29:50 +10:00
Yulong Wang
da3c0ba14b
implement CPU contrib OP Attention (#2333) 2019-11-07 17:14:59 -08:00
Tianlei Wu
b539cc74c7
Add FastGelu Cuda Op for Gelu and Add bias fusion (#2293)
* Add FastGelu cuda op

* Add AddBiasGelu for experiment

* Revert "Add AddBiasGelu for experiment"

This reverts commit 5c1ee019858c657e6bb75887265cb85675626e5b.

* Add bias

* Add unit tests

* update comment

* update script

* fix build error

* update coding style

* update for CR feedback
Enable half2 optimization only when cuda arch >= 7.0

* move _Tanh to common.cuh
2019-11-07 17:05:55 -08:00
liuziyue
259bff8cf1
Layer Normalization Fusion (#2319)
basic layer normalization transform
2019-11-07 12:00:08 -08:00
Hariharan Seshadri
553537ed52
Add CUDA GatherElements kernel (#2310)
* Updates

* Update test

* Update

* Updates

* nits

* PR feedback

* Update

* Update

* PR feedback

* PR comments

* Update

* Fix build

* Fix build

* Nits

* Fix
2019-11-07 10:54:20 -08:00
Yufeng Li
6651d2f662
Make elementwise op run 4 items per thread (#2335)
Description: Describe your changes.
Make elementwise op run 4 items per thread
unroll for loop to leverage ILP
remove unnessary N==0 check inside elementwise GPU kernel
Motivation and Context
Why is this change required? What problem does it solve?
It can improve the performance of GPU elementwise ops. ~2% performance gain on popular NLP bert model.
If it fixes an open issue, please link to the issue here.
2019-11-06 17:15:25 -08:00
George Wu
ba0e7daf20
update dockerfiles/README (#2336) 2019-11-06 16:54:10 -08:00
baowenlei
0f1e24f4a9 [NupharEP] tensorize int8 GEMM for avx (#2142)
* finish avx tensorization and save state

* split tests for better debug

* add missing avx option

* update configure for AVX

* update tensorize avx support

* Merged PR 5327: Fix llvm cross compilation

Fix llvm cross compilation

Related work items: #4080
2019-11-06 14:35:13 -08:00
KeDengMS
58e6aaa414
Fix crash in releasing TLS from CUDA EP dtor (#2329)
thread_local/global/static destruction order depends on implementation details of compilers and OS. The bug happens when thread_local is already out of scope while static EP being destructed, thus causing access violation in EP's destructor when accessing thread_local.

The fix is to maintain ownership inside EP with a mapping from tid to ThreadLocalContext, to avoid accessing thread_local in EP's destructor. This way, no matter what the destruction order is, no access violation would be triggered.
2019-11-06 13:00:17 -08:00
Yulong Wang
c0b8926863
implement CPU contrib OP EmbedLayerNormalization (#2332) 2019-11-06 12:27:08 -08:00
George Wu
06a6d74a67
update ngraph dockerfile. add python lib location to LD_LIBRARY_PATH for cuda/tensorrt Dockerfiles. (#2330) 2019-11-06 11:29:55 -08:00