[DOCS][CUDA] Update TF32 docs for sm90 (#111337)

For #110252.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111337
Approved by: https://github.com/msaroufim
This commit is contained in:
eqy 2023-10-19 09:36:09 +00:00 committed by PyTorch MergeBot
parent 503f44fbb8
commit 894b9957c8
2 changed files with 15 additions and 11 deletions

View file

@ -56,13 +56,13 @@ Below you can find a small example showcasing this::
.. _tf32_on_ampere:
TensorFloat-32(TF32) on Ampere devices
--------------------------------------
TensorFloat-32 (TF32) on Ampere (and later) devices
---------------------------------------------------
Starting in PyTorch 1.7, there is a new flag called `allow_tf32`. This flag
defaults to True in PyTorch 1.7 to PyTorch 1.11, and False in PyTorch 1.12 and later.
This flag controls whether PyTorch is allowed to use the TensorFloat32 (TF32) tensor cores,
available on new NVIDIA GPUs since Ampere, internally to compute matmul (matrix multiplies
available on NVIDIA GPUs since Ampere, internally to compute matmul (matrix multiplies
and batched matrix multiplies) and convolutions.
TF32 tensor cores are designed to achieve better performance on matmul and convolutions on
@ -80,11 +80,12 @@ matmuls and convolutions are controlled separately, and their corresponding flag
# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True
The precision of matmuls can also be set more broadly (limited not just to CUDA) via :meth:`~torch.set_float_32_matmul_precision`.
Note that besides matmuls and convolutions themselves, functions and nn modules that internally uses
matmuls or convolutions are also affected. These include `nn.Linear`, `nn.Conv*`, cdist, tensordot,
affine grid and grid sample, adaptive log softmax, GRU and LSTM.
To get an idea of the precision and speed, see the example code below:
To get an idea of the precision and speed, see the example code and benchmark data (on A100) below:
.. code:: python
@ -108,9 +109,12 @@ To get an idea of the precision and speed, see the example code below:
error = (ab_fp32 - ab_full).abs().max() # 0.0031
relative_error = error / mean # 0.000039
From the above example, we can see that with TF32 enabled, the speed is ~7x faster, relative error
compared to double precision is approximately 2 orders of magnitude larger. If full FP32 precision
is needed, users can disable TF32 by:
From the above example, we can see that with TF32 enabled, the speed is ~7x faster on A100, and that
relative error compared to double precision is approximately 2 orders of magnitude larger. Note that
the exact ratio of TF32 to single precision speed depends on the hardware generation, as properties
such as the ratio of memory bandwidth to compute as well as the ratio of TF32 to FP32 matmul throughput
may vary from generation to generation or model to model.
If full FP32 precision is needed, users can disable TF32 by:
.. code:: python

View file

@ -86,10 +86,10 @@ Analyzing the spectrum of the inputs via :func:`torch.linalg.svdvals` or their c
may help to detect these issues.
TensorFloat-32(TF32) on Nvidia Ampere devices
---------------------------------------------
TensorFloat-32(TF32) on Nvidia Ampere (and later) devices
---------------------------------------------------------
On Ampere Nvidia GPUs, PyTorch can use TensorFloat32 (TF32) to speed up mathematically intensive operations, in particular matrix multiplications and convolutions.
On Ampere (and later) Nvidia GPUs, PyTorch can use TensorFloat32 (TF32) to speed up mathematically intensive operations, in particular matrix multiplications and convolutions.
When an operation is performed using TF32 tensor cores, only the first 10 bits of the input mantissa are read.
This may reduce accuracy and produce surprising results (e.g., multiplying a matrix by the identity matrix may produce results that are different from the input).
By default, TF32 tensor cores are disabled for matrix multiplications and enabled for convolutions, although most neural network workloads have the same convergence behavior when using TF32 as they have with fp32.
@ -98,7 +98,7 @@ If your network needs full float32 precision for both matrix multiplications and
For more information see :ref:`TensorFloat32<tf32_on_ampere>`.
Reduced Precision Reduction for FP16 and BF16 GEMMs
Reduced Precision Reduction for FP16 and BF16 GEMMs
----------------------------------------------------
Half-precision GEMM operations are typically done with intermediate accumulations (reduction) in single-precision for numerical accuracy and improved resilience to overflow. For performance, certain GPU architectures, especially more recent ones, allow a few truncations of the intermediate accumulation results to the reduced precision (e.g., half-precision). This change is often benign from the perspective of model convergence, though it may lead to unexpected results (e.g., ``inf`` values when the final result should be be representable in half-precision).
If reduced-precision reductions are problematic, they can be turned off with