onnxruntime/docs/NotesOnThreading.md
Tim Harris 5e44d25c5a
Support multi-loop parallel sections, use multi-loop sections in GRU (#5602)
This PR updates the ThreadPool API to support multi-loop parallel sections. As with the OpenMP "parallel" construct, this allows per-loop work to be amortized over a series of loops. For ORT, it also promotes locality between successive loops in the sense that iteration X of one loop will tend to run on the same worker thread as iteration X of preceding loops.

The change was developed while optimizing the implementation of a model that performed better with OpenMP. Profiling indicated that OpenMP was providing lower loop entry/exit costs and that, via OpenMP's static scheduling, it was leading to a lower L2 miss rate in the series of parallel loops used in GRU.

The main changes are:

- Addition of ThreadPool::ParallelSection and underlying support in the modified Eigen thread pool.

- In EigenNonBlockingThreadPool.h, refactoring the RunInParallel method to support two variants: one that takes an existing parallel section object created by the caller, and another (used by default) that creates its own parallel section.

- Simplify ThreadPool::LoopCounter (used by worker threads to claim loop iterations), basing it an ID supplied by the underlying Eigen thread pool for affinity in a series of loops.

- Fix a possible perf issue where a loop with iterations scheduled in batches would have more threads than batches available.

- Use of parallel sections in the GRU operator.

- Additional test cases in threadpool_test.h.

- Additional comments at the top of threadpool.h and EigenNonBlockingThreadPool.h.
2020-11-10 12:24:57 +00:00

27 lines
1.7 KiB
Markdown

# Notes on Threading in ORT
This document is intended for ORT developers.
ORT allows the usage of either OpenMP or non-OpenMP (ORT) threads for execution. Threadpool management
is abstracted behind: (1) ThreadPool class in [threadpool.h](https://github.com/microsoft/onnxruntime/blob/master/include/onnxruntime/core/platform/threadpool.h) and (2) functions in [thread_utils.h](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/core/util/thread_utils.h).
When developing an op, please use these abstractions to parallelize your code. These abstractions centralize 2 things.
When OpenMP is enabled, they resort to using OpenMP. When OpenMP is disabled they resort to sequential execution if the threadpool ptr is NULL or schedule the tasks on the threadpool otherwise.
Examples of these abstractions are: ([threadpool.h](https://github.com/microsoft/onnxruntime/blob/master/include/onnxruntime/core/platform/threadpool.h) has more documentation for these)
* TryParallelFor
* TrySimpleParallelFor
* TryBatchParallelFor
* ShouldParallelize
* DegreeOfParallelism
These static methods abstract over the different implementation choices. They can run over the ORT thread pool, or run over OpenMP, or run sequentially.
In addition, ThreadPool::ParallelSection allows a series of loops to
be grouped together in a single parallel section. This allows an
operator to amortize loop entry/exit costs in cases where it is
impractical to refactor code into a single large loop.
**Please do not write #ifdef pragma omp in operator code**.
For intra op parallelism ORT users can use either OpenMP or ORT threadpool. The choice of using OpenMP is indicated by building ORT with ```--use_openmp``` switch. For inter op parallelism, however, we always use the ORT threadpool.