onnxruntime/include/onnxruntime/core
Tim Harris 2e09d9921a
"Sticky" allocation of worker threads (#7551)
[ PR previously merged as https://github.com//pull/7372, then reverted pending investigation of lost-wake-up issue seen with ParallelExecutor. Issue was a missing test for new work pushed to thread concurrent with a worker blocking. Change from 7372 is the addition of: https://github.com/microsoft/onnxruntime/blob/tiharr/dev-sticky-4/include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h#L1473-L1492 ]

Description: This change updates the heuristics used when a thread selects which worker threads to push work to on entering a parallel loop. Previously, worker threads would maintain a best-effort bitmap of "good worker hints" indicating the threads that were likely to be spinning waiting for work. This change uses a simpler heuristic where a thread records which workers ran its previous loop, and then re-submits its next loop to those same workers. The aim is to retain affinity between a thread and a set of workers, and to avoid maintaining the "good worker hints" bitmaps.

Motivation and Context: Profiling suggested that maintaining the "good worker hints" was taking unexpected time, particularly on NUMA systems. In addition, when running many concurrent workloads, the hints did not provide a way to help retain locality of workers and hence data in caches. Testing to confirm no regressions on microbenchmark (./build/Linux/Release/onnxruntime_benchmark --benchmark_filter=BM_ThreadPoolParallelFor) and on Linux mobilenet_v1_1.0_224.onnx, comparing p50 and p99 with vs without this change:

1 concurrent:
p50 0.0172s vs 0.0181s
p99 0.0204s vs 0.0216s

2 concurrent:
p50 0.0172s vs 0.0181s
p99 0.0213s vs 0.0221s
2021-05-03 18:28:13 +01:00
..
common Set CMAKE_CUDA_STANDARD to 14 because we are using std::make_unique (#7534) 2021-04-30 20:20:00 -07:00
eager kerne invoker api for eager mode (#7473) 2021-04-30 13:33:58 -07:00
framework Change onnxruntime::make_unique to std::make_unique (#7502) 2021-04-29 17:04:53 -07:00
graph Change onnxruntime::make_unique to std::make_unique (#7502) 2021-04-29 17:04:53 -07:00
optimizer Enable NHWC transformer when generating ORT format model (#7126) 2021-03-29 18:39:48 +10:00
platform "Sticky" allocation of worker threads (#7551) 2021-05-03 18:28:13 +01:00
providers rename cuda_mem_limit and hip_mem_limit to gpu_mem_limit for both CUDA EP and ROCm EP (#7226) 2021-04-05 09:04:04 -07:00
session Lifhuan/force trt sequential (#7440) 2021-04-28 13:59:37 -07:00