The following simple script shows how a runtime of matrix multiplication changes with the number of threads:
..code-block:: python
import timeit
runtimes = []
threads = [1] + [t for t in range(2, 49, 2)]
for t in threads:
torch.set_num_threads(t)
r = timeit.timeit(setup = "import torch; x = torch.randn(1024, 1024); y = torch.randn(1024, 1024)", stmt="torch.mm(x, y)", number=100)
runtimes.append(r)
# ... plotting (threads, runtimes) ...
Running the script on a system with 24 physical CPU cores (Xeon E5-2680, MKL and OpenMP based build) results in the following runtimes:
..image:: cpu_threading_runtimes.svg
:width:75%
The following considerations should be taken into account when tuning the number of intra- and inter-op threads:
* When choosing the number of threads one needs to avoid `oversubscription` (using too many threads, leads to performance degradation). For example, in an application that uses a large application thread pool or heavily relies on
inter-op parallelism, one might find disabling intra-op parallelism as a possible option (i.e. by calling ``set_num_threads(1)``);
* In a typical application one might encounter a trade off between `latency` (time spent on processing an inference request) and `throughput` (amount of work done per unit of time). Tuning the number of threads can be a useful
tool to adjust this trade off in one way or another. For example, in latency critical applications one might want to increase the number of intra-op threads to process each request as fast as possible. At the same time, parallel implementations
of ops may add an extra overhead that increases amount work done per single request and thus reduces the overall throughput.
..warning::
OpenMP does not guarantee that a single per-process intra-op thread
pool is going to be used in the application. On the contrary, two different application or inter-op
threads may use different OpenMP thread pools for intra-op work.
This might result in a large number of threads used by the application.
Extra care in tuning the number of threads is needed to avoid
oversubscription in multi-threaded applications in OpenMP case.
..note::
Pre-built PyTorch releases are compiled with OpenMP support.