pytorch/docs/source/notes/cpu_threading_torchscript_inference.rst

.. _cpu-threading-torchscript-inference:

CPU threading and TorchScript inference
=================================================

PyTorch allows using multiple CPU threads during TorchScript model inference.
The following figure shows different levels of parallelism one would find in a
typical application:

.. image:: cpu_threading_torchscript_inference.svg
   :width: 75%

One or more inference threads execute a model's forward pass on the given inputs.
Each inference thread invokes a JIT interpreter that executes the ops
of a model inline, one by one. A model can utilize a ``fork`` TorchScript
primitive to launch an asynchronous task. Forking several operations at once
results in a task that is executed in parallel. The ``fork`` operator returns a
``Future`` object which can be used to synchronize on later, for example:

.. code-block:: python

    @torch.jit.script
    def compute_z(x):
        return torch.mm(x, self.w_z)

    @torch.jit.script
    def forward(x):
        # launch compute_z asynchronously:
        fut = torch.jit._fork(compute_z, x)
        # execute the next operation in parallel to compute_z:
        y = torch.mm(x, self.w_y)
        # wait for the result of compute_z:
        z = torch.jit._wait(fut)
        return y + z


PyTorch uses a single thread pool for the inter-op parallelism, this thread pool
is shared by all inference tasks that are forked within the application process.

In addition to the inter-op parallelism, PyTorch can also utilize multiple threads
within the ops (`intra-op parallelism`). This can be useful in many cases,
including element-wise ops on large tensors, convolutions, GEMMs, embedding
lookups and others.


Build options
-------------

PyTorch uses an internal ATen library to implement ops. In addition to that,
PyTorch can also be built with support of external libraries, such as MKL_ and MKL-DNN_,
to speed up computations on CPU.

ATen, MKL and MKL-DNN support intra-op parallelism and depend on the
following parallelization libraries to implement it:

* OpenMP_ - a standard (and a library, usually shipped with a compiler), widely used in external libraries;
* TBB_ - a newer parallelization library optimized for task-based parallelism and concurrent environments.

OpenMP historically has been used by a large number of libraries. It is known
for a relative ease of use and support for loop-based parallelism and other primitives.

TBB is used to a lesser extent in external libraries, but, at the same time,
is optimized for the concurrent environments. PyTorch's TBB backend guarantees that
there's a separate, single, per-process intra-op thread pool used by all of the
ops running in the application.

Depending of the use case, one might find one or another parallelization
library a better choice in their application.

PyTorch allows selecting of the parallelization backend used by ATen and other
libraries at the build time with the following build options:

+------------+------------------------+-----------------------------+----------------------------------------+
| Library    | Build Option           | Values                      | Notes                                  |
+============+========================+=============================+========================================+
| ATen       | ``ATEN_THREADING``     | ``OMP`` (default), ``TBB``  |                                        |
+------------+------------------------+-----------------------------+----------------------------------------+
| MKL        | ``MKL_THREADING``      | (same)                      | To enable MKL use ``BLAS=MKL``         |
+------------+------------------------+-----------------------------+----------------------------------------+
| MKL-DNN    | ``MKLDNN_CPU_RUNTIME`` | (same)                      | To enable MKL-DNN use ``USE_MKLDNN=1`` |
+------------+------------------------+-----------------------------+----------------------------------------+

It is recommended not to mix OpenMP and TBB within one build.

Any of the ``TBB`` values above require ``USE_TBB=1`` build setting (default: OFF).
A separate setting ``USE_OPENMP=1`` (default: ON) is required for OpenMP parallelism.

Runtime API
-----------

The following API is used to control thread settings:

+------------------------+-----------------------------------------------------------+---------------------------------------------------------+
| Type of parallelism    | Settings                                                  | Notes                                                   |
+========================+===========================================================+=========================================================+
| Inter-op parallelism   | ``at::set_num_interop_threads``,                          | Default number of threads: number of CPU cores.         |
|                        | ``at::get_num_interop_threads`` (C++)                     |                                                         |
|                        |                                                           |                                                         |
|                        | ``set_num_interop_threads``,                              |                                                         |
|                        | ``get_num_interop_threads`` (Python, :mod:`torch` module) |                                                         |
+------------------------+-----------------------------------------------------------+                                                         |
| Intra-op parallelism   | ``at::set_num_threads``,                                  |                                                         |
|                        | ``at::get_num_threads`` (C++)                             |                                                         |
|                        | ``set_num_threads``,                                      |                                                         |
|                        | ``get_num_threads`` (Python, :mod:`torch` module)         |                                                         |
|                        |                                                           |                                                         |
|                        | Environment variables:                                    |                                                         |
|                        | ``OMP_NUM_THREADS`` and ``MKL_NUM_THREADS``               |                                                         |
+------------------------+-----------------------------------------------------------+---------------------------------------------------------+

For the intra-op parallelism settings, ``at::set_num_threads``, ``torch.set_num_threads`` always take precedence
over environment variables, ``MKL_NUM_THREADS`` variable takes precedence over ``OMP_NUM_THREADS``.

Tuning the number of threads
----------------------------

The following simple script shows how a runtime of matrix multiplication changes with the number of threads:

.. code-block:: python

    import timeit
    runtimes = []
    threads = [1] + [t for t in range(2, 49, 2)]
    for t in threads:
        torch.set_num_threads(t)
        r = timeit.timeit(setup = "import torch; x = torch.randn(1024, 1024); y = torch.randn(1024, 1024)", stmt="torch.mm(x, y)", number=100)
        runtimes.append(r)
    # ... plotting (threads, runtimes) ...

Running the script on a system with 24 physical CPU cores (Xeon E5-2680, MKL and OpenMP based build) results in the following runtimes:

.. image:: cpu_threading_runtimes.svg
   :width: 75%

The following considerations should be taken into account when tuning the number of intra- and inter-op threads:

* When choosing the number of threads one needs to avoid `oversubscription` (using too many threads, leads to performance degradation). For example, in an application that uses a large application thread pool or heavily relies on
  inter-op parallelism, one might find disabling intra-op parallelism as a possible option (i.e. by calling ``set_num_threads(1)``);

* In a typical application one might encounter a trade off between `latency` (time spent on processing an inference request) and `throughput` (amount of work done per unit of time). Tuning the number of threads can be a useful
  tool to adjust this trade off in one way or another. For example, in latency critical applications one might want to increase the number of intra-op threads to process each request as fast as possible. At the same time, parallel implementations
  of ops may add an extra overhead that increases amount work done per single request and thus reduces the overall throughput.

.. warning::
    OpenMP does not guarantee that a single per-process intra-op thread
    pool is going to be used in the application. On the contrary, two different application or inter-op
    threads may use different OpenMP thread pools for intra-op work.
    This might result in a large number of threads used by the application.
    Extra care in tuning the number of threads is needed to avoid
    oversubscription in multi-threaded applications in OpenMP case.

.. note::
    Pre-built PyTorch releases are compiled with OpenMP support.

.. note::
    ``parallel_info`` utility prints information about thread settings and can be used for debugging.
    Similar output can be also obtained in Python with ``torch.__config__.parallel_info()`` call.

.. _OpenMP: https://www.openmp.org/
.. _TBB: https://github.com/intel/tbb
.. _MKL: https://software.intel.com/en-us/mkl
.. _MKL-DNN: https://github.com/intel/mkl-dnn
Threading and CPU Inference note Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23417 Test Plan: cd docs; make html Imported from OSS Differential Revision: D16523781 Pulled By: ilia-cher fbshipit-source-id: d6c09e8a85d39e6185bbdc4b312fea44fcdfff06 2019-07-29 22:32:23 +00:00			`.. _cpu-threading-torchscript-inference:`

			`CPU threading and TorchScript inference`
			`=================================================`

			`PyTorch allows using multiple CPU threads during TorchScript model inference.`
			`The following figure shows different levels of parallelism one would find in a`
			`typical application:`

			`.. image:: cpu_threading_torchscript_inference.svg`
			`:width: 75%`

			`One or more inference threads execute a model's forward pass on the given inputs.`
			`Each inference thread invokes a JIT interpreter that executes the ops`
			of a model inline, one by one. A model can utilize a ``fork`` TorchScript
			`primitive to launch an asynchronous task. Forking several operations at once`
			results in a task that is executed in parallel. The ``fork`` operator returns a
2020-02-12 nightly release (9d9fa2eace07f2833bcd73551b52bf31bbdbc093) 2020-02-12 08:08:14 +00:00			``Future`` object which can be used to synchronize on later, for example:
Threading and CPU Inference note Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23417 Test Plan: cd docs; make html Imported from OSS Differential Revision: D16523781 Pulled By: ilia-cher fbshipit-source-id: d6c09e8a85d39e6185bbdc4b312fea44fcdfff06 2019-07-29 22:32:23 +00:00
			`.. code-block:: python`

			`@torch.jit.script`
			`def compute_z(x):`
			`return torch.mm(x, self.w_z)`

			`@torch.jit.script`
			`def forward(x):`
			`# launch compute_z asynchronously:`
			`fut = torch.jit._fork(compute_z, x)`
			`# execute the next operation in parallel to compute_z:`
			`y = torch.mm(x, self.w_y)`
			`# wait for the result of compute_z:`
			`z = torch.jit._wait(fut)`
			`return y + z`


			`PyTorch uses a single thread pool for the inter-op parallelism, this thread pool`
			`is shared by all inference tasks that are forked within the application process.`

			`In addition to the inter-op parallelism, PyTorch can also utilize multiple threads`
			within the ops (`intra-op parallelism`). This can be useful in many cases,
			`including element-wise ops on large tensors, convolutions, GEMMs, embedding`
			`lookups and others.`


			`Build options`
			`-------------`

			`PyTorch uses an internal ATen library to implement ops. In addition to that,`
			`PyTorch can also be built with support of external libraries, such as MKL_ and MKL-DNN_,`
			`to speed up computations on CPU.`

			`ATen, MKL and MKL-DNN support intra-op parallelism and depend on the`
			`following parallelization libraries to implement it:`
Bag of documentation fixes; fix more sphinx warnings (#27850) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27850 Many of these are real problems in the documentation (i.e., link or bullet point doesn't display correctly). Test Plan: - built and viewed the documentation for each change locally. Differential Revision: D17908123 Pulled By: zou3519 fbshipit-source-id: 65c92a352c89b90fb6b508c388b0874233a3817a 2019-10-15 14:29:36 +00:00
			`* OpenMP_ - a standard (and a library, usually shipped with a compiler), widely used in external libraries;`
			`* TBB_ - a newer parallelization library optimized for task-based parallelism and concurrent environments.`

Threading and CPU Inference note Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23417 Test Plan: cd docs; make html Imported from OSS Differential Revision: D16523781 Pulled By: ilia-cher fbshipit-source-id: d6c09e8a85d39e6185bbdc4b312fea44fcdfff06 2019-07-29 22:32:23 +00:00			`OpenMP historically has been used by a large number of libraries. It is known`
			`for a relative ease of use and support for loop-based parallelism and other primitives.`

			`TBB is used to a lesser extent in external libraries, but, at the same time,`
			`is optimized for the concurrent environments. PyTorch's TBB backend guarantees that`
			`there's a separate, single, per-process intra-op thread pool used by all of the`
			`ops running in the application.`

			`Depending of the use case, one might find one or another parallelization`
			`library a better choice in their application.`

			`PyTorch allows selecting of the parallelization backend used by ATen and other`
			`libraries at the build time with the following build options:`

2020-03-27 nightly release (ac639d927a37f58bda16c7a1b96a2dfcf3a21666) 2020-03-27 07:02:02 +00:00			`+------------+------------------------+-----------------------------+----------------------------------------+`
			`\| Library \| Build Option \| Values \| Notes \|`
			`+============+========================+=============================+========================================+`
			\| ATen \| ``ATEN_THREADING`` \| ``OMP`` (default), ``TBB`` \| \|
			`+------------+------------------------+-----------------------------+----------------------------------------+`
			\| MKL \| ``MKL_THREADING`` \| (same) \| To enable MKL use ``BLAS=MKL`` \|
			`+------------+------------------------+-----------------------------+----------------------------------------+`
			\| MKL-DNN \| ``MKLDNN_CPU_RUNTIME`` \| (same) \| To enable MKL-DNN use ``USE_MKLDNN=1`` \|
			`+------------+------------------------+-----------------------------+----------------------------------------+`
Threading and CPU Inference note Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23417 Test Plan: cd docs; make html Imported from OSS Differential Revision: D16523781 Pulled By: ilia-cher fbshipit-source-id: d6c09e8a85d39e6185bbdc4b312fea44fcdfff06 2019-07-29 22:32:23 +00:00
2020-02-12 nightly release (9d9fa2eace07f2833bcd73551b52bf31bbdbc093) 2020-02-12 08:08:14 +00:00			`It is recommended not to mix OpenMP and TBB within one build.`
Threading and CPU Inference note Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23417 Test Plan: cd docs; make html Imported from OSS Differential Revision: D16523781 Pulled By: ilia-cher fbshipit-source-id: d6c09e8a85d39e6185bbdc4b312fea44fcdfff06 2019-07-29 22:32:23 +00:00
			Any of the ``TBB`` values above require ``USE_TBB=1`` build setting (default: OFF).
			A separate setting ``USE_OPENMP=1`` (default: ON) is required for OpenMP parallelism.

			`Runtime API`
			`-----------`

			`The following API is used to control thread settings:`

			`+------------------------+-----------------------------------------------------------+---------------------------------------------------------+`
			`\| Type of parallelism \| Settings \| Notes \|`
			`+========================+===========================================================+=========================================================+`
2020-02-12 nightly release (9d9fa2eace07f2833bcd73551b52bf31bbdbc093) 2020-02-12 08:08:14 +00:00			\| Inter-op parallelism \| ``at::set_num_interop_threads``, \| Default number of threads: number of CPU cores. \|
			\| \| ``at::get_num_interop_threads`` (C++) \| \|
Threading and CPU Inference note Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23417 Test Plan: cd docs; make html Imported from OSS Differential Revision: D16523781 Pulled By: ilia-cher fbshipit-source-id: d6c09e8a85d39e6185bbdc4b312fea44fcdfff06 2019-07-29 22:32:23 +00:00			`\| \| \| \|`
2020-02-12 nightly release (9d9fa2eace07f2833bcd73551b52bf31bbdbc093) 2020-02-12 08:08:14 +00:00			\| \| ``set_num_interop_threads``, \| \|
Threading and CPU Inference note Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23417 Test Plan: cd docs; make html Imported from OSS Differential Revision: D16523781 Pulled By: ilia-cher fbshipit-source-id: d6c09e8a85d39e6185bbdc4b312fea44fcdfff06 2019-07-29 22:32:23 +00:00			\| \| ``get_num_interop_threads`` (Python, :mod:`torch` module) \| \|
			`+------------------------+-----------------------------------------------------------+ \|`
			\| Intra-op parallelism \| ``at::set_num_threads``, \| \|
			\| \| ``at::get_num_threads`` (C++) \| \|
			\| \| ``set_num_threads``, \| \|
			\| \| ``get_num_threads`` (Python, :mod:`torch` module) \| \|
			`\| \| \| \|`
			`\| \| Environment variables: \| \|`
			\| \| ``OMP_NUM_THREADS`` and ``MKL_NUM_THREADS`` \| \|
			`+------------------------+-----------------------------------------------------------+---------------------------------------------------------+`

			For the intra-op parallelism settings, ``at::set_num_threads``, ``torch.set_num_threads`` always take precedence
			over environment variables, ``MKL_NUM_THREADS`` variable takes precedence over ``OMP_NUM_THREADS``.

2020-02-12 nightly release (9d9fa2eace07f2833bcd73551b52bf31bbdbc093) 2020-02-12 08:08:14 +00:00			`Tuning the number of threads`
			`----------------------------`

			`The following simple script shows how a runtime of matrix multiplication changes with the number of threads:`

			`.. code-block:: python`

			`import timeit`
			`runtimes = []`
			`threads = [1] + [t for t in range(2, 49, 2)]`
			`for t in threads:`
			`torch.set_num_threads(t)`
			`r = timeit.timeit(setup = "import torch; x = torch.randn(1024, 1024); y = torch.randn(1024, 1024)", stmt="torch.mm(x, y)", number=100)`
			`runtimes.append(r)`
			`# ... plotting (threads, runtimes) ...`

			`Running the script on a system with 24 physical CPU cores (Xeon E5-2680, MKL and OpenMP based build) results in the following runtimes:`

			`.. image:: cpu_threading_runtimes.svg`
			`:width: 75%`

			`The following considerations should be taken into account when tuning the number of intra- and inter-op threads:`

			* When choosing the number of threads one needs to avoid `oversubscription` (using too many threads, leads to performance degradation). For example, in an application that uses a large application thread pool or heavily relies on
			inter-op parallelism, one might find disabling intra-op parallelism as a possible option (i.e. by calling ``set_num_threads(1)``);

			* In a typical application one might encounter a trade off between `latency` (time spent on processing an inference request) and `throughput` (amount of work done per unit of time). Tuning the number of threads can be a useful
			`tool to adjust this trade off in one way or another. For example, in latency critical applications one might want to increase the number of intra-op threads to process each request as fast as possible. At the same time, parallel implementations`
			`of ops may add an extra overhead that increases amount work done per single request and thus reduces the overall throughput.`

			`.. warning::`
			`OpenMP does not guarantee that a single per-process intra-op thread`
			`pool is going to be used in the application. On the contrary, two different application or inter-op`
			`threads may use different OpenMP thread pools for intra-op work.`
			`This might result in a large number of threads used by the application.`
			`Extra care in tuning the number of threads is needed to avoid`
			`oversubscription in multi-threaded applications in OpenMP case.`

			`.. note::`
			`Pre-built PyTorch releases are compiled with OpenMP support.`

Threading and CPU Inference note Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23417 Test Plan: cd docs; make html Imported from OSS Differential Revision: D16523781 Pulled By: ilia-cher fbshipit-source-id: d6c09e8a85d39e6185bbdc4b312fea44fcdfff06 2019-07-29 22:32:23 +00:00			`.. note::`
			``parallel_info`` utility prints information about thread settings and can be used for debugging.
			Similar output can be also obtained in Python with ``torch.__config__.parallel_info()`` call.

			`.. _OpenMP: https://www.openmp.org/`
			`.. _TBB: https://github.com/intel/tbb`
			`.. _MKL: https://software.intel.com/en-us/mkl`
			`.. _MKL-DNN: https://github.com/intel/mkl-dnn`