onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-16 18:31:27 +00:00

Author	SHA1	Message	Date
Vincent Wang	1fa1c51544	bug fix for name of gradient constant (#5626 ) Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>	2020-10-30 07:08:19 +08:00
KeDengMS	b4869926d3	[CUDA EP] remove per-thread allocator (#5415 ) Now that we are using legacy default stream, which is shared among all inference threads, there is no need to have per-thread allocator. In the past, the race could happen when two threads running concurrently on GPU: thread1: allocA->copyA->computeA->freeA thread2: allocB->copyB->computeB->freeB Note that freeA/B only means the buffer is ready to be allocated on CPU, while the corresponding operation on GPU is not finished yet. It is possible for thread1/2 use the same buffer, when the alloc/free pair are not interleaved (note that alloc/free is thread-safe) If the GPU commands run in separate per-thread default stream, there's a chance that copyA/computeA are interleaved with copyB/computeB, even when the order in CPU execution is not interleaved. This would cause incorrect results if computeB uses copyA's results. By using one legacy default stream, CPU execution order would match the GPU execution order, so if A and B use the same buffer from alloc, the correpsonding copy/compute won't be interleaved. If the copy/compute is indeed interleaved, then allocA and allocB would return different buffers, thus no racing either.	2020-10-29 11:33:05 -07:00
Sergii Dymchenko	2e1fa3ccb7	Fix GeluRecompute for 2 inputs case. (#5573 ) * Add test for FastGelu + GeluRecompute. * Fix GeluRecompute for 2 inputs case. * Fix test for BiasGelu + GeluRecompute. * Copy all inputs to Gelu, not just 2. * Move GeluRecompute test to training-specific file.	2020-10-29 00:07:13 -07:00
Dwayne Robinson	b85e7a19ea	isalnum is not defined - include cctype (#5623 )	2020-10-28 23:31:34 -07:00
Changming Sun	e6956be40c	Publish no-openmp python packages to test pypi (#5610 ) Publish no-openmp python packages to test pypi	2020-10-28 19:49:53 -07:00
Tracy Sharpe	b68e98e0b0	optimize QLinearConv depthwise convolutions (#5605 )	2020-10-28 16:42:53 -07:00
liqunfu	5129b4d5bc	batch size tests (#5508 ) * batch size tests Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-10-28 15:55:40 -07:00
Rohith_Kvsp	50582abe93	Fix IS_ANDROID Issue (#5599 ) Fixed static IS_ANDROID detection final static IS_ANDROID is causing an Error Unsupport arch:aarch64, so removed IS_ANDROID & replaced with IS_ANDROID with isAndroid().	2020-10-28 14:42:33 -07:00
Ryan Lai	bbfd914d72	Skip new model test additions (#5611 )	2020-10-28 13:27:49 -07:00
Juliana Franco	27c6d1eeb2	move variable declaration to avoid unused variable error (#5603 ) Co-authored-by: Juliana <jufranc@microsoft.com>	2020-10-28 09:23:58 -07:00
George Wu	0dbf3e8893	enable arena for arm64 (#5613 )	2020-10-28 08:40:43 -07:00
Tim Harris	5e8952ef89	ThreadPool clean up : mm_pause in loops, correctly spin-then-wait, and adopt static methods consistently in the API (#5590 ) Description: This change makes three changes to the ThreadPool class to clean up issues identified during performance analysis and optimization. (1) It uses mm_pause intrinsics in spin loops, helping avoid consuming pipeline resources while waiting. (2) It re-organizes the spin-then-steal loop for work distribution to start out spinning as intended, rather than to start out trying to steal. (3) It updates the ThreadPool class's API to be consistent in the use of static methods for public functions. The PR includes minor doc updates and corresponding changes to test cases. Motivation and Context The change helps ensure consistency in behavior between the OpenMP and Eigen-based implementations. Unlike the instance methods, the static methods abstract over the different ways in which threading can be implemented; they will map onto the OpenMP or Eigen-based implementations when threading is used. When threading is not used they will run work sequentially.	2020-10-28 09:49:18 +00:00
liqunfu	92662659ba	Liqun/remove number matching (#5606 ) replace number matching with relaxed comparison in frontend tests Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-10-27 21:27:37 -07:00
Ryan Hill	e90b6f06d1	Factor out IAllocator so that it can be shared with shared providers (#5567 ) * Factor out IAllocator so shared providers can use it directly.	2020-10-27 17:28:17 -07:00
Suffian Khan	e5b0d192f4	pin transformers dependence to sentencepiece==0.1.92 due to ci fail (#5607 )	2020-10-27 16:21:40 -07:00
Maajid khan	ddf83d1ace	Maajid/multi threading 2 (#5568 ) * Enabled multi-threading for OpenVino EP ->Enabled support for concurrent_session_runs Run UEP using concurrent_session_runs > 1 Enabled support for ORT_PARALLEL ExecutionMode ->Documentation Added for Enabling MultiThreading Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com> * Minor Fixes added Configure the value of nireq during Runtime Documentation typos rectified and details added for Multi_Threaded Inference Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com> * Some checks added for this fix Added checks to invalidate wrong nireq value and assigned it to default value of 8 Added new config options for enable_vpu_fast_compile which were changed w.r.t OpenVINO_2021.1 Release Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>	2020-10-27 14:48:12 -07:00
Weixing Zhang	b851973f22	pipeline_worker_pool_.JoinAll() should be called in pipeline code path (#5604 ) Co-authored-by: Weixing Zhang <wezhan@microsoft.com>	2020-10-27 11:57:46 -07:00
ytaous	6f824c25e5	Dropout op elimination - enable for ORT training (#5588 ) * dropout elimination * per comments * fix build * fix build Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-10-27 11:51:23 -07:00
Dmitri Smirnov	3433576fd3	Support for Sparse Initializers (#5540 ) Introduce sparse_initializers support. Convert them to dense on model load and prune graph_proto_ so they don't consume space. Convert back to sparse on ORT Format model save. Implement serializing sparse initializers to OrtFormat. Fix Model::ToProto() to return original sparse initializers Set a flag that graph_sync is needed when loading a simple ORT Format model. otherwise nothing is resolved. Add ORT Format history to README.md ifdef MINIMAL build for DenseToSparseTensorInitializer Allow duplicate initializers to support existing models. Issue a warning instead of aborting. * Revert "Remove SparseTensor support from minimal build. (#5114)" This reverts commit `59ee8ffb17`. Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>	2020-10-27 10:32:06 -07:00
Yufeng Li	30cdc74bc0	Enable prepacking in subgraph (#5433 ) Prepacking in subgraph is not supported currently. We see more and more models with subgraph, which has MatMul, MatMulInteger and other ops. Prepacking can speed up those models significantly.	2020-10-26 22:22:31 -07:00
Changming Sun	564da960ce	Fix nuphar docker file build break	2020-10-26 20:08:07 -07:00
Hariharan Seshadri	6c310858e3	Support opset-13 Resize kernels (#5575 )	2020-10-26 17:26:06 -07:00
Ramakrishnan Sivakumar	5bcb5f5a3d	MLAS: Add support for AVXVNNI (#5592 ) Adds Gemm kernels with AVXVNNI support for Int8 acceleration	2020-10-26 16:27:48 -07:00
Sherlock	694a4d6413	Add more loggings for GradientBuilder (#5556 ) * Add more loggings for GradientBuilder Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-10-26 15:15:52 -07:00
edgchen1	68fe722691	GatherGrad optimization (#5524 ) The existing implementation of the GatherGrad CUDA kernel does not do work in a very parallel manner for certain inputs which can lead to poor performance. The computation essentially involves multiple summations. The values are gathered from the input and the sums are scattered to the output. Previously, each sum was computed by a single thread. If there is an instance of a summation of a large number of values, it can significantly impact the overall kernel execution time. The updated version has an alternate implementation which splits the sums into partial sums which get accumulated together later. This allows for more parallelism. A significant downside is that the alternate implementation requires CPU and GPU synchronization because intermediate GPU results are required by the CPU computation. The original implementation outperformed the alternate for certain inputs (e.g., where the maximum number of values in a sum was not large), so the updated version chooses between them based on the input. The input analysis has some overhead. The implementation was adapted from PyTorch (`b186831c08/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu`).	2020-10-26 12:53:53 -07:00
Sergii Dymchenko	8224718f8f	Enable CommonSubexpressionElimination in training. (#5504 ) * Add test for CommonSubexpressionElimination in training. * Enable CommonSubexpressionElimination in training. * Add ommonSubexpressionEliminationApplyOnce for training.	2020-10-26 11:25:15 -07:00
Hariharan Seshadri	44773c60e3	Add a CUDA based IOBinding test (#5572 )	2020-10-26 10:57:36 -07:00
Xavier Dupré	f4cee22b9b	Handle -inf in ReduceSumLogExp, fix regression introduced in PR #5370 (#5583 ) * Handle -inf in ReduceSumLogExp operator * Update reduction_ops_test.cc * Remove a case which has a different behaviour CPU/GPU	2020-10-26 09:58:02 +01:00
Tracy Sharpe	502f67ba58	MLAS: implement u8x8 GEMM for aarch32 (#5580 )	2020-10-25 23:05:12 -07:00
Andrew McDowell	b2da700e4d	Allow Upper case letters in RHS of einsum equations. (#5569 ) Co-authored-by: Andrew McDowell <andrew@neva-labs.com>	2020-10-25 18:11:12 -07:00
Ye Wang	51af108af5	Support older version of slice in reshape fusion (#5574 ) * support older version of slice in reshape fusion * fix * review partial comments * add test * add gen file	2020-10-24 14:48:18 -07:00
Du Li	860cb22260	Bug fix for C API (#5520 ) * remove if_def from C api * Fix CI issues. * revert change for symbols.txt	2020-10-24 13:37:58 -07:00
Pranav Sharma	3f3b202e36	Optimize GatherElements further, add threshold for parallelizing Scaler. (#5579 ) * Optimize GatherElements more. * Optimize GatherElements further, add threshold for parallelizing Scaler. * Add basic tests to exercises the parallel path	2020-10-24 12:38:31 -07:00
Guoyu Wang	3f06286154	Add Flatten support for NNAPI (#5545 ) * Add flatten support for NNAPI, correct some typo in NNAPI code files * Address review comments * Update CanSkipReshape * Add test for verify NNAPI is actually running for a supported model * Adding test for reshape/flatten test for NNAPI * Add one extra verbose log for skipping reshape * Fix Android CI failure * Correct test file name to fix Android CI failure	2020-10-22 18:15:53 -07:00
ytaous	7da5949279	NVTX label change (#5562 ) * label change * more info on label Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-10-22 10:34:20 -07:00
Andrews548	20bc83400b	ACL/ArmNN update (#5515 ) * Build ACL and ArmNN with custom library path * Define import to tensor as a separate function for maintenance and readability * Enabled optimized depthwise convolution for ACL v20.02 * Check operation status for ACL and ArmNN Execution Providers * Enabled fused operation for convolution-activation Co-authored-by: Andrei-Alexandru <andrei-alexandru.avram@nxp.com>	2020-10-22 09:29:44 -07:00
Ryan Lai	98538580c8	give more tolerance to DirectML runs (#5564 )	2020-10-21 23:14:51 -07:00
Tianlei Wu	1f304fbee7	Attention with past and no unidirectional mask (#5557 ) * Update fusions to support shared node, and mask of all ones	2020-10-21 20:12:02 -07:00
ashbhandare	0a9b83a313	Add zero test (#5476 )	2020-10-21 17:12:00 -07:00
Scott McKay	6d35be215f	Add `--skip_tests` to example command line as the included ops are being reduced. (#5554 )	2020-10-22 08:55:42 +10:00
RandySheriffH	d220c9f950	Resolve crash in MatMul optimization (#5551 ) * check pointer before referencing * add test case * switch to ASSERT_EQ	2020-10-21 13:18:19 -07:00
Changming Sun	5802fe1699	Remove MKLML build config (#5559 ) Remove MKLML build config	2020-10-21 13:11:25 -07:00
Ryan Hill	82c7a9756e	Fix shared provider unload crash (#5553 )	2020-10-21 13:01:21 -07:00
Hariharan Seshadri	4291c57322	[C# and Python APIs] Expose knobs to enable/disable platform telemetry collection (#5481 )	2020-10-21 10:32:13 -07:00
Ashwini Khade	df22611026	Update ONNX commit (#5487 ) * update ONNX * update onnx + register kernels for reduction ops * bug fix kernel reg * update cgmanifests * revert unsqueeze op 13 registration * filter ops which are not implemented yet * filter some tests * update onnx commit to include conv transpose bug fix * update docker images * undo not required test changes * fix test failures	2020-10-21 07:22:20 -07:00
Vincent Wang	b48f596a91	GatherElementsGrad CPU Kernel and TopKGrad CPU/CUDA Kernel (#5511 ) * TopKGrad CPU kernel * use Scatter for GatherElementsGrad and TopKGrad. * rollback convgrad change. Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2020-10-21 09:29:29 +08:00
Yufeng Li	6c2162e97a	Fix quantization of Conv1D with bias (#5491 ) * Fix reshape for Conv with bias	2020-10-20 15:27:26 -07:00
Pranav Sharma	1038f9cc8b	Optimize GatherElements and Scaler. (#5543 ) * Optimize GatherElements and Scaler. * Address PR comments * Fix build	2020-10-20 10:36:20 -07:00
edgchen1	2f4fc83231	Add NVTX profiling range around kernel computation. (#5542 )	2020-10-20 09:58:58 -07:00
Tracy Sharpe	45483dcf1f	Add QLinearConv for activations=u8, weights=s8 (#5510 )	2020-10-20 08:45:13 -07:00

1 2 3 4 5 ...

3627 commits