onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-29 03:30:52 +00:00

Author	SHA1	Message	Date
Tim Harris	2e09d9921a	"Sticky" allocation of worker threads (#7551 ) [ PR previously merged as https://github.com//pull/7372, then reverted pending investigation of lost-wake-up issue seen with ParallelExecutor. Issue was a missing test for new work pushed to thread concurrent with a worker blocking. Change from 7372 is the addition of: https://github.com/microsoft/onnxruntime/blob/tiharr/dev-sticky-4/include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h#L1473-L1492 ] Description: This change updates the heuristics used when a thread selects which worker threads to push work to on entering a parallel loop. Previously, worker threads would maintain a best-effort bitmap of "good worker hints" indicating the threads that were likely to be spinning waiting for work. This change uses a simpler heuristic where a thread records which workers ran its previous loop, and then re-submits its next loop to those same workers. The aim is to retain affinity between a thread and a set of workers, and to avoid maintaining the "good worker hints" bitmaps. Motivation and Context: Profiling suggested that maintaining the "good worker hints" was taking unexpected time, particularly on NUMA systems. In addition, when running many concurrent workloads, the hints did not provide a way to help retain locality of workers and hence data in caches. Testing to confirm no regressions on microbenchmark (./build/Linux/Release/onnxruntime_benchmark --benchmark_filter=BM_ThreadPoolParallelFor) and on Linux mobilenet_v1_1.0_224.onnx, comparing p50 and p99 with vs without this change: 1 concurrent: p50 0.0172s vs 0.0181s p99 0.0204s vs 0.0216s 2 concurrent: p50 0.0172s vs 0.0181s p99 0.0213s vs 0.0221s	2021-05-03 18:28:13 +01:00
Tim Harris	9c1900866a	Revert ""Sticky" allocation of worker threads (#7372 )" This reverts commit `3d92723d1c`.	2021-04-30 14:39:58 -07:00
Tim Harris	3d92723d1c	"Sticky" allocation of worker threads (#7372 ) * Sticky thread alloaction * Test sticky thread assignment * Test sticky thread assignment * Test sticky thread assignment * Expose control over additional worker assignment stats * Sticky thread alloaction * Test sticky thread assignment * Test sticky thread assignment * Test sticky thread assignment * Expose control over additional worker assignment stats * Merge * Merge * Merge * Fix Windows build * Fix windows build 2 * Build Python 3.8 Windows CPU only * Add env var to override binding * Build Python 3.8 Windows CPU only * Fix windows build * Remove thread affinity override * Remove goodworker * Remove Python build settings * Remove unneeded changes * Remove unneeded changes * Remove unneeded changes * Remove unneeded changes * Remove unneeded changes * Remove unneeded changes * Tidy * Tidy * Avoid race on preferred_worker vector * Improve assertions * Improve assertions * Enum for PushBackWithTag result * Remove unused field * Update comments * Extra debugging * Extra debugging * Extra debugging * Support varying thread pool sizes * Improve comments * Remove requirement for thread local to be trivially destructible * Use unsigned consistently for thread counts, removing casting * Remove debug code * Fix webassembly build * Merge * Merge * Merge * Remove unused code * Fix build * Extra test case for varying loop sizes * Clean variable names * Clean variable names * Clean variable names * Remove unneeded include, fix build * Fix profiling * Update from review comments	2021-04-29 20:42:14 -07:00
Changming Sun	1012535dab	Change onnxruntime::make_unique to std::make_unique (#7502 ) 1. Change onnxruntime::make_unique to std::make_unique 2. Add "-std=c++14" to ROCM EP's build flags.	2021-04-29 17:04:53 -07:00
RandySheriffH	40568d8821	Wait for dispatch done in RunParallelSection to fix random TP UT crash (#7443 ) * wait for dispatch done in RunParallelSection * pass worker_fn by value * cancel move * only move work_fn when it is lastly referred Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2021-04-26 14:12:10 -07:00
Changming Sun	b5592856a7	Remove thread pool's cancel method and suppress some warnings (#7411 )	2021-04-26 09:33:48 -07:00
RandySheriffH	afe912d47c	Reduce perf gap between thread pool and omp (#7333 ) * add async dispatch * minor renamings * build py38 * restore yml * fix sync up issue between dispatch thread and main * fix comments * refactor SummonWorker and rename to RunInParallelInternal	2021-04-23 18:36:36 -07:00
RandySheriffH	865c67611c	Exclude profiler from minimal build (#7115 ) * Exclude TP profiler from minimum build * fix typo * remove Clock * fix comments Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2021-03-25 09:06:14 -07:00
RandySheriffH	529da3b003	Thread pool profiler (#6748 ) * add profiler * add thread id * refactoring * switch to vector * add override keyword * fix comments * renaming * add revoke time * restore statics * restore enable flag * fix end error * fix comments * add comment * add comments * make profiler thread-safe * switch to shared_lock * switch to shared_timed_mutex * switch to OrtMutex * add per child thread counters * switch to vector * refactor LogCore * fix comments * cancel spin and block counter to reduce overhead * fix minor format issue Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2021-03-22 10:49:57 -07:00
Tim Harris	b491d7c179	Avoid false sharing on thread pool data structures (#6298 ) Description: This change adds alignment and padding to avoid false sharing on fields in the thread pool. It also adds a new microbenchmark to profile thread-pool performance over short loops. Motivation and Context MobileNet on a 212-core system showed a performance gap between the ORT thread pool and OpenMP. One cause appeared to be false sharing on fields in the thread pool: ThreadPoolParallelSection::tasks_finished (which the main thread spins on waiting for workers to complete a loop), and the RunQueue::front_ and back_ fields (used respectively by the worker thread and the main thread). The additional micro-benchmark BM_ThreadPoolSimpleParallelFor tests performance of loops of different sizes at different thread counts. The results below are on a machine with 214-core processors (E5-2690 v4) running with 1, 14, 15, and 28 threads. For each test, the microbenchmark has N threads run a loop with N iterations; hence a perfect result is for the time taken to be constant as additional threads are added (although we will also see power management effects helping at very low thread counts). The loop durations (100000, 10000, 1000) correspond roughly to 200us, 20us, and 2us on this machine. Before change: BM_ThreadPoolSimpleParallelFor/1/1/100000/real_time 17153 us 17154 us 32 BM_ThreadPoolSimpleParallelFor/14/14/100000/real_time 22553 us 22553 us 30 BM_ThreadPoolSimpleParallelFor/15/15/100000/real_time 21521 us 21521 us 29 BM_ThreadPoolSimpleParallelFor/28/28/100000/real_time 24111 us 24111 us 24 BM_ThreadPoolSimpleParallelFor/1/1/10000/real_time 1719 us 1719 us 407 BM_ThreadPoolSimpleParallelFor/14/14/10000/real_time 3409 us 3409 us 200 BM_ThreadPoolSimpleParallelFor/15/15/10000/real_time 3541 us 3541 us 201 BM_ThreadPoolSimpleParallelFor/28/28/10000/real_time 4576 us 4576 us 151 BM_ThreadPoolSimpleParallelFor/1/1/1000/real_time 174 us 174 us 4017 BM_ThreadPoolSimpleParallelFor/14/14/1000/real_time 1586 us 1586 us 402 BM_ThreadPoolSimpleParallelFor/15/15/1000/real_time 1586 us 1586 us 397 BM_ThreadPoolSimpleParallelFor/28/28/1000/real_time 2864 us 2864 us 232 After change: BM_ThreadPoolSimpleParallelFor/1/1/100000/real_time 17160 us 17160 us 33 BM_ThreadPoolSimpleParallelFor/14/14/100000/real_time 20989 us 20989 us 31 BM_ThreadPoolSimpleParallelFor/15/15/100000/real_time 22286 us 22286 us 31 BM_ThreadPoolSimpleParallelFor/28/28/100000/real_time 24631 us 24631 us 25 BM_ThreadPoolSimpleParallelFor/1/1/10000/real_time 1718 us 1718 us 407 BM_ThreadPoolSimpleParallelFor/14/14/10000/real_time 2868 us 2868 us 242 BM_ThreadPoolSimpleParallelFor/15/15/10000/real_time 2907 us 2907 us 240 BM_ThreadPoolSimpleParallelFor/28/28/10000/real_time 3872 us 3872 us 186 BM_ThreadPoolSimpleParallelFor/1/1/1000/real_time 175 us 175 us 3938 BM_ThreadPoolSimpleParallelFor/14/14/1000/real_time 933 us 933 us 659 BM_ThreadPoolSimpleParallelFor/15/15/1000/real_time 912 us 912 us 591 BM_ThreadPoolSimpleParallelFor/28/28/1000/real_time 1976 us 1976 us 317	2021-01-12 19:58:41 +00:00
Tim Harris	48b14b52b8	Remove Env::Task wrapper around std::function (#5753 ) This is a small perf / clean-up change. It removes the Env::Task abstraction which wraps a single std::function field, and adds at least one virtual method call overhead when creating a Task and when executing it. The POSIX and Windows implementations are now identical.	2020-11-10 20:22:07 +00:00
Tim Harris	5e44d25c5a	Support multi-loop parallel sections, use multi-loop sections in GRU (#5602 ) This PR updates the ThreadPool API to support multi-loop parallel sections. As with the OpenMP "parallel" construct, this allows per-loop work to be amortized over a series of loops. For ORT, it also promotes locality between successive loops in the sense that iteration X of one loop will tend to run on the same worker thread as iteration X of preceding loops. The change was developed while optimizing the implementation of a model that performed better with OpenMP. Profiling indicated that OpenMP was providing lower loop entry/exit costs and that, via OpenMP's static scheduling, it was leading to a lower L2 miss rate in the series of parallel loops used in GRU. The main changes are: - Addition of ThreadPool::ParallelSection and underlying support in the modified Eigen thread pool. - In EigenNonBlockingThreadPool.h, refactoring the RunInParallel method to support two variants: one that takes an existing parallel section object created by the caller, and another (used by default) that creates its own parallel section. - Simplify ThreadPool::LoopCounter (used by worker threads to claim loop iterations), basing it an ID supplied by the underlying Eigen thread pool for affinity in a series of loops. - Fix a possible perf issue where a loop with iterations scheduled in batches would have more threads than batches available. - Use of parallel sections in the GRU operator. - Additional test cases in threadpool_test.h. - Additional comments at the top of threadpool.h and EigenNonBlockingThreadPool.h.	2020-11-10 12:24:57 +00:00
Tim Harris	5e8952ef89	ThreadPool clean up : mm_pause in loops, correctly spin-then-wait, and adopt static methods consistently in the API (#5590 ) Description: This change makes three changes to the ThreadPool class to clean up issues identified during performance analysis and optimization. (1) It uses mm_pause intrinsics in spin loops, helping avoid consuming pipeline resources while waiting. (2) It re-organizes the spin-then-steal loop for work distribution to start out spinning as intended, rather than to start out trying to steal. (3) It updates the ThreadPool class's API to be consistent in the use of static methods for public functions. The PR includes minor doc updates and corresponding changes to test cases. Motivation and Context The change helps ensure consistency in behavior between the OpenMP and Eigen-based implementations. Unlike the instance methods, the static methods abstract over the different ways in which threading can be implemented; they will map onto the OpenMP or Eigen-based implementations when threading is used. When threading is not used they will run work sequentially.	2020-10-28 09:49:18 +00:00
Sunghoon	645d978589	Sunghcho/denormals (#5391 ) * Add session option and global thread pool option to set denormal as zero. * Revert unneccessary changes. * Add cpuinfo submodule * Add more comments * Remove cpuinfo submodule dependency and check only SSE3 support for ftz and daz inspired by Tensorflow * Preserve API order in C api * Clean up and utilize SSE3 detection logic from existeing cpuid_info.h * Keep the same order with header file * Fix build issue with Linux pipeline, which has old g++ compiler * Fix broken build on Linux and remove a duplicated unit test * Remove reformatting at eigen thread pool * Remove flatbuffers which is not intentionally added * Revert "Remove flatbuffers which is not intentionally added" This reverts commit 9f509a9aaaa3c7832d88854c82fd26b234770b7f. * Remove flatbuffers which is not intentionally added * Resolve comments - Put details on APIs - Add a log for ftz/daz initialization - Add clang - Fix typo * Remove unnecessary header include * Resolve comments	2020-10-15 12:47:42 -07:00
Tim Harris	9cec98ec1b	Honor allow_spinning at barrier at end of parallel sections (#4767 ) This commit means that when the thread pool is configured to spin, then we spin at the barrier at the end of parallel sections in the main thread, in addition to having workers spin waiting for work. The change updates Barrier.h to take an additional boolean to select spin/block, and passes this in based on the thread pool configuration. It adds an additional test case for barriers, although no problems were identified by the test case.	2020-08-13 09:40:40 +01:00
Tim Harris	4bd9e8d05c	Stress-test and fix thread pool when work queues are full (#4690 ) While investigating an unrelated issue, I noticed that the thread pool may drop tasks when a burst of 1024+ tasks is submitted by a thread from inside the pool. Today, in general, we execute work synchronously in this case. However, there is a bug where work submitted by a thread already inside the pool will be discarded instead of executed. Currently the only scenario where I can see this occurring is when the parallel executor is used with a model in which such a large number of nodes become eligible to run all at once. This PR fixes the underlying issue and adds a test case for burst-submission of work.	2020-08-04 10:19:49 +01:00
Tim Harris	3fc68cb150	Remove non-trivially-destructible thread-local from thread pool state, blocking ARM64 builds (#4336 ) - Move thread hint vectors from thread-local struct - Add static_assert that the per-thread state in the thread pool is trivially-destructible - Rename "thread_data" to "worker_data" (only allocated for workers in the pool, not threads calling into the pool)	2020-06-25 19:04:31 +01:00
Tim Harris	9e3b5c62fb	Use OpenMP-like synchronization patterns in Eigen thread pool (#4236 ) Updates the thread pool implementation to make work distribution over the Eigen thread pool more closely resemble techniques used in OpenMP. In particular: (1) A thread entering a parallel loop works on the iterations itself, rather than requiring a thread switch to/from a thread in the pool, if called from outside the thread pool. (2) To support this, work items pushed to the thread pool run a loop to claim iterations from a shared counter via atomic-fetch-and-add, as opposed to having work items themselves represent individual batches of iterations. This means that any thread working on the loop can execute any batch of iterations, including having the main thread run through all of the batches itself if the loop turns out to be short-running. (3) As with OpenMP active scheduling, the worker loop spins waiting for work prior to blocking. This avoids OS blocking / wake-up paths in workloads with series of short-running parallel sections.	2020-06-22 10:04:53 +01:00
Changming Sun	bd78364411	Parallel all the activations ops (#3722 ) 1. Parallel all the activations ops. 2. Parallel the performance critical path of the LRN op, which makes the ONNX model zoo googlenet model runs 60% faster(latency reduced from 21ms to 13ms). 3. Make the Gemm-Activation fusion support with all the activations ops. Before this change, it only supports LeakyRelu/Relu/Sigmoid/Tanh. 4. Delete onnxruntime/test/framework/op_kernel_test.cc because the file is almost empty. 5. Remove the loggings in KernelRegistry::TryFindKernel, return Status with error message instead.	2020-05-05 01:18:17 -07:00
Changming Sun	06fc9506fd	Thread pool changes (#3153 ) 1. Copy tensorflow's thread pool class to ORT, so that we can get a better implementation of thread pool based parallelfor 2. Copy Eigen's thread pool class to ORT 3. Support thread affinity 4. Remove RNN kernel’s private thread pool 5. Modify pool kernels to use the thread pool when openmp is disabled.	2020-03-30 12:18:40 -07:00

20 commits