diff --git a/docs/source/notes/cuda.rst b/docs/source/notes/cuda.rst index 0b49f581536..d3ba91c2277 100644 --- a/docs/source/notes/cuda.rst +++ b/docs/source/notes/cuda.rst @@ -56,13 +56,13 @@ Below you can find a small example showcasing this:: .. _tf32_on_ampere: -TensorFloat-32(TF32) on Ampere devices --------------------------------------- +TensorFloat-32 (TF32) on Ampere (and later) devices +--------------------------------------------------- Starting in PyTorch 1.7, there is a new flag called `allow_tf32`. This flag defaults to True in PyTorch 1.7 to PyTorch 1.11, and False in PyTorch 1.12 and later. This flag controls whether PyTorch is allowed to use the TensorFloat32 (TF32) tensor cores, -available on new NVIDIA GPUs since Ampere, internally to compute matmul (matrix multiplies +available on NVIDIA GPUs since Ampere, internally to compute matmul (matrix multiplies and batched matrix multiplies) and convolutions. TF32 tensor cores are designed to achieve better performance on matmul and convolutions on @@ -80,11 +80,12 @@ matmuls and convolutions are controlled separately, and their corresponding flag # The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True. torch.backends.cudnn.allow_tf32 = True +The precision of matmuls can also be set more broadly (limited not just to CUDA) via :meth:`~torch.set_float_32_matmul_precision`. Note that besides matmuls and convolutions themselves, functions and nn modules that internally uses matmuls or convolutions are also affected. These include `nn.Linear`, `nn.Conv*`, cdist, tensordot, affine grid and grid sample, adaptive log softmax, GRU and LSTM. -To get an idea of the precision and speed, see the example code below: +To get an idea of the precision and speed, see the example code and benchmark data (on A100) below: .. code:: python @@ -108,9 +109,12 @@ To get an idea of the precision and speed, see the example code below: error = (ab_fp32 - ab_full).abs().max() # 0.0031 relative_error = error / mean # 0.000039 -From the above example, we can see that with TF32 enabled, the speed is ~7x faster, relative error -compared to double precision is approximately 2 orders of magnitude larger. If full FP32 precision -is needed, users can disable TF32 by: +From the above example, we can see that with TF32 enabled, the speed is ~7x faster on A100, and that +relative error compared to double precision is approximately 2 orders of magnitude larger. Note that +the exact ratio of TF32 to single precision speed depends on the hardware generation, as properties +such as the ratio of memory bandwidth to compute as well as the ratio of TF32 to FP32 matmul throughput +may vary from generation to generation or model to model. +If full FP32 precision is needed, users can disable TF32 by: .. code:: python diff --git a/docs/source/notes/numerical_accuracy.rst b/docs/source/notes/numerical_accuracy.rst index 07b4c208138..f1ad3183087 100644 --- a/docs/source/notes/numerical_accuracy.rst +++ b/docs/source/notes/numerical_accuracy.rst @@ -86,10 +86,10 @@ Analyzing the spectrum of the inputs via :func:`torch.linalg.svdvals` or their c may help to detect these issues. -TensorFloat-32(TF32) on Nvidia Ampere devices ---------------------------------------------- +TensorFloat-32(TF32) on Nvidia Ampere (and later) devices +--------------------------------------------------------- -On Ampere Nvidia GPUs, PyTorch can use TensorFloat32 (TF32) to speed up mathematically intensive operations, in particular matrix multiplications and convolutions. +On Ampere (and later) Nvidia GPUs, PyTorch can use TensorFloat32 (TF32) to speed up mathematically intensive operations, in particular matrix multiplications and convolutions. When an operation is performed using TF32 tensor cores, only the first 10 bits of the input mantissa are read. This may reduce accuracy and produce surprising results (e.g., multiplying a matrix by the identity matrix may produce results that are different from the input). By default, TF32 tensor cores are disabled for matrix multiplications and enabled for convolutions, although most neural network workloads have the same convergence behavior when using TF32 as they have with fp32. @@ -98,7 +98,7 @@ If your network needs full float32 precision for both matrix multiplications and For more information see :ref:`TensorFloat32`. -Reduced Precision Reduction for FP16 and BF16 GEMMs +Reduced Precision Reduction for FP16 and BF16 GEMMs ---------------------------------------------------- Half-precision GEMM operations are typically done with intermediate accumulations (reduction) in single-precision for numerical accuracy and improved resilience to overflow. For performance, certain GPU architectures, especially more recent ones, allow a few truncations of the intermediate accumulation results to the reduced precision (e.g., half-precision). This change is often benign from the perspective of model convergence, though it may lead to unexpected results (e.g., ``inf`` values when the final result should be be representable in half-precision). If reduced-precision reductions are problematic, they can be turned off with