diff --git a/docs/source/notes/cpu_threading_torchscript_inference.rst b/docs/source/notes/cpu_threading_torchscript_inference.rst new file mode 100644 index 00000000000..600f6813770 --- /dev/null +++ b/docs/source/notes/cpu_threading_torchscript_inference.rst @@ -0,0 +1,124 @@ +.. _cpu-threading-torchscript-inference: + +CPU threading and TorchScript inference +================================================= + +PyTorch allows using multiple CPU threads during TorchScript model inference. +The following figure shows different levels of parallelism one would find in a +typical application: + +.. image:: cpu_threading_torchscript_inference.svg + :width: 75% + +One or more inference threads execute a model's forward pass on the given inputs. +Each inference thread invokes a JIT interpreter that executes the ops +of a model inline, one by one. A model can utilize a ``fork`` TorchScript +primitive to launch an asynchronous task. Forking several operations at once +results in a task that is executed in parallel. The ``fork`` operator returns a +``future`` object which can be used to synchronize on later, for example: + +.. code-block:: python + + @torch.jit.script + def compute_z(x): + return torch.mm(x, self.w_z) + + @torch.jit.script + def forward(x): + # launch compute_z asynchronously: + fut = torch.jit._fork(compute_z, x) + # execute the next operation in parallel to compute_z: + y = torch.mm(x, self.w_y) + # wait for the result of compute_z: + z = torch.jit._wait(fut) + return y + z + + +PyTorch uses a single thread pool for the inter-op parallelism, this thread pool +is shared by all inference tasks that are forked within the application process. + +In addition to the inter-op parallelism, PyTorch can also utilize multiple threads +within the ops (`intra-op parallelism`). This can be useful in many cases, +including element-wise ops on large tensors, convolutions, GEMMs, embedding +lookups and others. + + +Build options +------------- + +PyTorch uses an internal ATen library to implement ops. In addition to that, +PyTorch can also be built with support of external libraries, such as MKL_ and MKL-DNN_, +to speed up computations on CPU. + +ATen, MKL and MKL-DNN support intra-op parallelism and depend on the +following parallelization libraries to implement it: + * OpenMP_ - a standard (and a library, usually shipped with a compiler), widely used in external libraries; + * TBB_ - a newer parallelization library optimized for task-based parallelism and concurrent environments. +OpenMP historically has been used by a large number of libraries. It is known +for a relative ease of use and support for loop-based parallelism and other primitives. +At the same time OpenMP is not known for a good interoperability with other threading +libraries used by the application. In particular, OpenMP does not guarantee that a single per-process intra-op thread +pool is going to be used in the application. On the contrary, two different inter-op +threads will likely use different OpenMP thread pools for intra-op work. +This might result in a large number of threads used by the application. + +TBB is used to a lesser extent in external libraries, but, at the same time, +is optimized for the concurrent environments. PyTorch's TBB backend guarantees that +there's a separate, single, per-process intra-op thread pool used by all of the +ops running in the application. + +Depending of the use case, one might find one or another parallelization +library a better choice in their application. + +PyTorch allows selecting of the parallelization backend used by ATen and other +libraries at the build time with the following build options: + ++------------+-----------------------+-----------------------------+----------------------------------------+ +| Library | Build Option | Values | Notes | ++============+=======================+=============================+========================================+ +| ATen | ``ATEN_THREADING`` | ``OMP`` (default), ``TBB`` | | ++------------+-----------------------+-----------------------------+----------------------------------------+ +| MKL | ``MKL_THREADING`` | (same) | To enable MKL use ``BLAS=MKL`` | ++------------+-----------------------+-----------------------------+----------------------------------------+ +| MKL-DNN | ``MKLDNN_THREADING`` | (same) | To enable MKL-DNN use ``USE_MKLDNN=1`` | ++------------+-----------------------+-----------------------------+----------------------------------------+ + +It is strongly recommended not to mix OpenMP and TBB within one build. + +Any of the ``TBB`` values above require ``USE_TBB=1`` build setting (default: OFF). +A separate setting ``USE_OPENMP=1`` (default: ON) is required for OpenMP parallelism. + +Runtime API +----------- + +The following API is used to control thread settings: + ++------------------------+-----------------------------------------------------------+---------------------------------------------------------+ +| Type of parallelism | Settings | Notes | ++========================+===========================================================+=========================================================+ +| Inter-op parallelism | ``at::set_num_interop_threads``, | ``set*`` functions can only be called once and only | +| | ``at::get_num_interop_threads`` (C++) | during the startup, before the actual operators running;| +| | | | +| | ``set_num_interop_threads``, | Default number of threads: number of CPU cores. | +| | ``get_num_interop_threads`` (Python, :mod:`torch` module) | | ++------------------------+-----------------------------------------------------------+ | +| Intra-op parallelism | ``at::set_num_threads``, | | +| | ``at::get_num_threads`` (C++) | | +| | ``set_num_threads``, | | +| | ``get_num_threads`` (Python, :mod:`torch` module) | | +| | | | +| | Environment variables: | | +| | ``OMP_NUM_THREADS`` and ``MKL_NUM_THREADS`` | | ++------------------------+-----------------------------------------------------------+---------------------------------------------------------+ + +For the intra-op parallelism settings, ``at::set_num_threads``, ``torch.set_num_threads`` always take precedence +over environment variables, ``MKL_NUM_THREADS`` variable takes precedence over ``OMP_NUM_THREADS``. + +.. note:: + ``parallel_info`` utility prints information about thread settings and can be used for debugging. + Similar output can be also obtained in Python with ``torch.__config__.parallel_info()`` call. + +.. _OpenMP: https://www.openmp.org/ +.. _TBB: https://github.com/intel/tbb +.. _MKL: https://software.intel.com/en-us/mkl +.. _MKL-DNN: https://github.com/intel/mkl-dnn diff --git a/docs/source/notes/cpu_threading_torchscript_inference.svg b/docs/source/notes/cpu_threading_torchscript_inference.svg new file mode 100644 index 00000000000..67f8ec884a3 --- /dev/null +++ b/docs/source/notes/cpu_threading_torchscript_inference.svg @@ -0,0 +1,681 @@ + +image/svg+xml +Inputs +Application Thread Pool + +Op +Op +Op +Inference thread +Fork +Op +Join + + +Inter +- +op parallelism +Intra +- +op parallelism + +ATen/Parallel +(e.g. at::parallel_for) + +MKL + +MKL +- +DNN + +... +OpenMP +TBB + + \ No newline at end of file