onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-08 00:23:03 +00:00

Author	SHA1	Message	Date
Scott McKay	6cc57721f4	Change CUDA implementation of Transpose to support all fixed size tensor types (#2387 ) * Change CUDA implementation of Transpose to not use a typed kernel so we can support more types with minimum binary size. Add support for 8, 16, 32 and 64 bit types. Add unit tests. Add method so the implementation can be called directly (will be used by CUDA Scan very soon). * Disable TensorRT for MLFloat16 and int8 unit tests. * Address PR comment and add support for calling cublas implementation if type is mlfloat16.	2019-11-15 10:36:28 +10:00
Changming Sun	109b3cb450	Avoid using the default logger in the graph lib and optimizers (#2361 ) 1. Use the session logger if it is available. 2. Don't disable warning 4100 globally. We should fix the warnings instead of disabling it.	2019-11-14 13:23:28 -08:00
KeDengMS	b15e43a541	[NupharEP] Multiple optimizations (#2380 ) Fuse transpose into MatMul Implement Pow and constant scalar simplification Vectorize ReduceMean Improve symbolic shape inference Minor updates for better debugging in fused function name	2019-11-14 10:40:33 -08:00
Pranav Sharma	7e164eaa6a	Fix reuse logic in allocation planner. (#2393 ) * Fix reuse logic in allocation planner. * PR comments * Add helpful comments * Don't allow reuse across string tensors.	2019-11-13 22:51:12 -08:00
Ilya Lavrenov	b90d55b7ea	Fixed compilation with ngraph (#2388 )	2019-11-13 17:49:00 -08:00
nihui	dde410e073	fix BUILD.md typo (#2375 ) build.py: error: argument --config: invalid choice: 'RelWithDebugInfo' (choose from 'Debug', 'MinSizeRel', 'Release', 'RelWithDebInfo')	2019-11-13 17:48:08 -08:00
KeDengMS	51571030ef	Another try to stabilize CUDA CI (#2383 ) The root cause seems to be failure in CUDA dealloc when tear down. cudaFree return code was ignored before, so should the debug check.	2019-11-13 15:58:15 -08:00
liuziyue	ffa2812587	Skip layer norm transform (#2350 ) * skip layer normalization transformer	2019-11-13 13:46:09 -08:00
Yufeng Li	8ed2928dd5	Fuse Add + Gelu (#2360 ) Implement the transformer to fuse add + gelu Implement the accurate kernel	2019-11-13 09:26:00 -08:00
liuziyue	4b72fedbd5	Layer Norm Fusion Fix (#2379 ) * layer norm fusion fix * Add input shape check in code and unit tests	2019-11-12 17:19:51 -08:00
Scott McKay	8c733c8d82	Add opset 11 version of Split to CUDA ops (#2376 ) Organize the CUDA ops definitions so all the opset 10 and 11 parts are together (same setup used for CPU ops)	2019-11-13 07:40:13 +10:00
Scott McKay	c0d23d5ffe	Fix bug with Slice. Need to pass in flattened input dimensions so the initial offset into the input is calculated correctly. (#2372 )	2019-11-13 07:00:26 +10:00
KeDengMS	9e26f4de6f	Extend OneHot CPU kernel to support more types (#2311 ) * Extend OneHot CPU kernel to support input int64_t, depth int32_t, output float * Skip BERT before the test data fix is picked up	2019-11-12 11:54:06 -08:00
Ashwini Khade	437772d5bc	update output size calculation for resize (#2366 ) * change how output size is calculated for resize op * add tests for ver 10 resize	2019-11-12 10:06:17 -08:00
KeDengMS	192dcfaa8e	Fix a bug in TLS refcount that may destabilized CUDA CI (#2374 )	2019-11-12 00:48:31 -08:00
Yang Chen	41b9f01e4c	test bidaf with nuphar for avx target (#2370 ) increase nuphar test coverage a bit	2019-11-12 00:47:13 -08:00
Changming Sun	fc6773a65b	Add Tracelogging for profiling (#1639 ) Enabled only if onnxruntime_ENABLE_INSTRUMENT is ON	2019-11-11 21:34:10 -08:00
George Wu	0c6e9f94d0	fix builds enabling onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS (#2369 ) * fix builds enabling onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS * update	2019-11-11 15:26:18 -08:00
Scott McKay	53ed36a3da	Add helper to create output to minimize binary size. (#2365 ) Add ConstEigenTensorMap typedef so we don't unnecessarily const_cast the const input Tensor.	2019-11-12 09:08:04 +10:00
Zhang Lei	aa37e2de8f	Direct use python numpy array's memory if already contiguous. (#2355 ) * Direct use python numpy array's memory if already contiguous. This could greatly improve performance for session with large input, like big image 1920x1080 fastrcnn, 30~40% speed up could be achieved. * Add test case enforce contiguous/non-contiguos numpy array as inputs.	2019-11-11 13:46:55 -08:00
Zhang Lei	ed6da0d191	Implement cuda nonzero op. (#2056 ) Implement cuda nonzero op.	2019-11-11 13:45:52 -08:00
avidiyal	3d3cf0e159	Openvino EP R3.1 onnxrt server (#2357 ) * onnxrt server with OVEP * onnxrt server with OVEP * Update Dockerfile.server.openvino * onnxrt server OVEP fix reviews * onnxrt server OVEP fix reviews	2019-11-11 12:22:19 -08:00
Scott McKay	599d72a94f	Fix/test dim value of 0 handling in a couple of places (#2337 ) * Update the CUDA Where implementation broadcasting logic to handle a dim with value of 0. Add unit test Also add unit test for unary op with dim value of 0 * Exclude ngraph from Where test with 0 dim.	2019-11-11 07:57:19 +10:00
Dmitri Smirnov	25b3c51661	Introduce PrimitiveType into a Type System along with an integer constant (#2307 ) Improve perf by avoiding GetType<T>() calls. Introduce MLTypeCallDispatcher to switch on Input Type. Add Tensor IsType<T>() fast method.	2019-11-08 17:47:06 -08:00
jignparm	fa30b1e758	Set ElementType to String type of node metadata, instead of byte[] (#2348 ) * Set ElementType to String type of node metadata, instead of byte[] * Fix spacing	2019-11-08 14:52:56 -08:00
Zhang Lei	7fcd752393	Cuda Reverse Sequence Op, maping types of same size using same template function. (#2281 )	2019-11-08 13:52:26 -08:00
Changming Sun	080a0a3186	Nuget pipeline changes (#2305 ) 1. refactor the pipeline, remove some duplicated code 2. Move Windows_py_GPU_Wheels job to Win-GPU-CUDA10. We'll deprecated the "Win-GPU" pool 3. Delete cpu-nocontribops-esrp-pipeline.yml and cpu-nocontribops-pipeline.yml 4. In Linux nuget jobs, run "make install" before creating the package. So that extra RPAH info will be removed	2019-11-08 09:45:52 -08:00
Scott McKay	5a3ea7469a	Remove unused initializer from GraphProto as well as name_to_initial_tensor_ in CleanUnusedInitializers. (#2320 ) * Remove unused initializer from GraphProto as well as name_to_initial_tensor_ in CleanupUnusedInitializers. This means initializers that have been replaced during graph optimizations are not left in the GraphProto when we save an optimized model. * Handle edge case where a model has an unused initializer with matching graph input by also removing the graph input. * Use non-const iterators in std::find_if calls to make centos build happy.	2019-11-08 16:29:50 +10:00
Yulong Wang	da3c0ba14b	implement CPU contrib OP Attention (#2333 )	2019-11-07 17:14:59 -08:00
Tianlei Wu	b539cc74c7	Add FastGelu Cuda Op for Gelu and Add bias fusion (#2293 ) * Add FastGelu cuda op * Add AddBiasGelu for experiment * Revert "Add AddBiasGelu for experiment" This reverts commit 5c1ee019858c657e6bb75887265cb85675626e5b. * Add bias * Add unit tests * update comment * update script * fix build error * update coding style * update for CR feedback Enable half2 optimization only when cuda arch >= 7.0 * move _Tanh to common.cuh	2019-11-07 17:05:55 -08:00
liuziyue	259bff8cf1	Layer Normalization Fusion (#2319 ) basic layer normalization transform	2019-11-07 12:00:08 -08:00
Hariharan Seshadri	553537ed52	Add CUDA GatherElements kernel (#2310 ) * Updates * Update test * Update * Updates * nits * PR feedback * Update * Update * PR feedback * PR comments * Update * Fix build * Fix build * Nits * Fix	2019-11-07 10:54:20 -08:00
Yufeng Li	6651d2f662	Make elementwise op run 4 items per thread (#2335 ) Description: Describe your changes. Make elementwise op run 4 items per thread unroll for loop to leverage ILP remove unnessary N==0 check inside elementwise GPU kernel Motivation and Context Why is this change required? What problem does it solve? It can improve the performance of GPU elementwise ops. ~2% performance gain on popular NLP bert model. If it fixes an open issue, please link to the issue here.	2019-11-06 17:15:25 -08:00
George Wu	ba0e7daf20	update dockerfiles/README (#2336 )	2019-11-06 16:54:10 -08:00
baowenlei	0f1e24f4a9	[NupharEP] tensorize int8 GEMM for avx (#2142 ) * finish avx tensorization and save state * split tests for better debug * add missing avx option * update configure for AVX * update tensorize avx support * Merged PR 5327: Fix llvm cross compilation Fix llvm cross compilation Related work items: #4080	2019-11-06 14:35:13 -08:00
KeDengMS	58e6aaa414	Fix crash in releasing TLS from CUDA EP dtor (#2329 ) thread_local/global/static destruction order depends on implementation details of compilers and OS. The bug happens when thread_local is already out of scope while static EP being destructed, thus causing access violation in EP's destructor when accessing thread_local. The fix is to maintain ownership inside EP with a mapping from tid to ThreadLocalContext, to avoid accessing thread_local in EP's destructor. This way, no matter what the destruction order is, no access violation would be triggered.	2019-11-06 13:00:17 -08:00
Yulong Wang	c0b8926863	implement CPU contrib OP EmbedLayerNormalization (#2332 )	2019-11-06 12:27:08 -08:00
George Wu	06a6d74a67	update ngraph dockerfile. add python lib location to LD_LIBRARY_PATH for cuda/tensorrt Dockerfiles. (#2330 )	2019-11-06 11:29:55 -08:00
Vinitra Swamy	ace19129b9	MCR Docker Images v1.0.0 refresh (#2302 ) * update dockerfile table with new MCR tags * add new openvino dockerfiles to table	2019-11-05 22:06:47 -08:00
Patrick Foley	151075790d	[OpenVINO-EP] Update to latest version: OpenVINO 2019 R3.1 (#2308 ) * Updates OpenVINO EP to latest version: 2019 R3.1 * Reviews fixed * Update Dockerfile.openvino * Addressed PR comments and disabled model tests temporarily * Update Dockerfile.ubuntu_openvino	2019-11-05 19:55:46 -08:00
Dwayne Robinson	db454beacf	TensorDesc::Placement test failure - cherry pick Vibranium fix. (#2328 )	2019-11-05 18:18:31 -08:00
Scott McKay	67ec626d88	Copy blocks in Slice when possible (#2312 ) * Add logic to try and flatten inner dimensions being copied by Slice and do a block copy if they can be. Do a block copy for just the inner most dimension where possible (applies even if we don't flatten inner dimensions).	2019-11-06 10:53:30 +10:00
Changming Sun	104f3b2a59	Exclude candy from CUDA tests	2019-11-05 15:22:09 -08:00
Changming Sun	143ae98a37	Fix a bug in onnxruntime_pybind_state.cc when TENSORRT is enabled (#2326 )	2019-11-05 15:04:50 -08:00
George	8a102c6e99	apply eigen patch only for ACL.	2019-11-05 13:53:53 -08:00
Changming Sun	5ce4d4fc49	Fix a test failure when it runs on FreeBSD	2019-11-04 23:47:37 -08:00
Yufeng Li	035913d42f	Support int32_t for Reduction (#2317 )	2019-11-04 20:52:01 -08:00
manashgoswami	d5c36bfff2	Updated links in docs (#2303 ) * Update README.md * Update README.md * Update README.md	2019-11-03 09:10:56 -08:00
Faith Xu	556bae17a5	Fix versions table (#2309 ) * Update table values * Fix onnxml opset version	2019-11-03 08:58:21 -08:00
Yulong Wang	cba93f7c8d	fix Gelu CPU: remove MayInplace() declaration (#2306 )	2019-11-01 18:10:05 -07:00

1 2 3 4 5 ...

1585 commits