mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
[DOCS][CUDA] Update TF32 docs for sm90 (#111337)
For #110252. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111337 Approved by: https://github.com/msaroufim
This commit is contained in:
parent
503f44fbb8
commit
894b9957c8
2 changed files with 15 additions and 11 deletions
|
|
@ -56,13 +56,13 @@ Below you can find a small example showcasing this::
|
|||
|
||||
.. _tf32_on_ampere:
|
||||
|
||||
TensorFloat-32(TF32) on Ampere devices
|
||||
--------------------------------------
|
||||
TensorFloat-32 (TF32) on Ampere (and later) devices
|
||||
---------------------------------------------------
|
||||
|
||||
Starting in PyTorch 1.7, there is a new flag called `allow_tf32`. This flag
|
||||
defaults to True in PyTorch 1.7 to PyTorch 1.11, and False in PyTorch 1.12 and later.
|
||||
This flag controls whether PyTorch is allowed to use the TensorFloat32 (TF32) tensor cores,
|
||||
available on new NVIDIA GPUs since Ampere, internally to compute matmul (matrix multiplies
|
||||
available on NVIDIA GPUs since Ampere, internally to compute matmul (matrix multiplies
|
||||
and batched matrix multiplies) and convolutions.
|
||||
|
||||
TF32 tensor cores are designed to achieve better performance on matmul and convolutions on
|
||||
|
|
@ -80,11 +80,12 @@ matmuls and convolutions are controlled separately, and their corresponding flag
|
|||
# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
|
||||
torch.backends.cudnn.allow_tf32 = True
|
||||
|
||||
The precision of matmuls can also be set more broadly (limited not just to CUDA) via :meth:`~torch.set_float_32_matmul_precision`.
|
||||
Note that besides matmuls and convolutions themselves, functions and nn modules that internally uses
|
||||
matmuls or convolutions are also affected. These include `nn.Linear`, `nn.Conv*`, cdist, tensordot,
|
||||
affine grid and grid sample, adaptive log softmax, GRU and LSTM.
|
||||
|
||||
To get an idea of the precision and speed, see the example code below:
|
||||
To get an idea of the precision and speed, see the example code and benchmark data (on A100) below:
|
||||
|
||||
.. code:: python
|
||||
|
||||
|
|
@ -108,9 +109,12 @@ To get an idea of the precision and speed, see the example code below:
|
|||
error = (ab_fp32 - ab_full).abs().max() # 0.0031
|
||||
relative_error = error / mean # 0.000039
|
||||
|
||||
From the above example, we can see that with TF32 enabled, the speed is ~7x faster, relative error
|
||||
compared to double precision is approximately 2 orders of magnitude larger. If full FP32 precision
|
||||
is needed, users can disable TF32 by:
|
||||
From the above example, we can see that with TF32 enabled, the speed is ~7x faster on A100, and that
|
||||
relative error compared to double precision is approximately 2 orders of magnitude larger. Note that
|
||||
the exact ratio of TF32 to single precision speed depends on the hardware generation, as properties
|
||||
such as the ratio of memory bandwidth to compute as well as the ratio of TF32 to FP32 matmul throughput
|
||||
may vary from generation to generation or model to model.
|
||||
If full FP32 precision is needed, users can disable TF32 by:
|
||||
|
||||
.. code:: python
|
||||
|
||||
|
|
|
|||
|
|
@ -86,10 +86,10 @@ Analyzing the spectrum of the inputs via :func:`torch.linalg.svdvals` or their c
|
|||
may help to detect these issues.
|
||||
|
||||
|
||||
TensorFloat-32(TF32) on Nvidia Ampere devices
|
||||
---------------------------------------------
|
||||
TensorFloat-32(TF32) on Nvidia Ampere (and later) devices
|
||||
---------------------------------------------------------
|
||||
|
||||
On Ampere Nvidia GPUs, PyTorch can use TensorFloat32 (TF32) to speed up mathematically intensive operations, in particular matrix multiplications and convolutions.
|
||||
On Ampere (and later) Nvidia GPUs, PyTorch can use TensorFloat32 (TF32) to speed up mathematically intensive operations, in particular matrix multiplications and convolutions.
|
||||
When an operation is performed using TF32 tensor cores, only the first 10 bits of the input mantissa are read.
|
||||
This may reduce accuracy and produce surprising results (e.g., multiplying a matrix by the identity matrix may produce results that are different from the input).
|
||||
By default, TF32 tensor cores are disabled for matrix multiplications and enabled for convolutions, although most neural network workloads have the same convergence behavior when using TF32 as they have with fp32.
|
||||
|
|
@ -98,7 +98,7 @@ If your network needs full float32 precision for both matrix multiplications and
|
|||
|
||||
For more information see :ref:`TensorFloat32<tf32_on_ampere>`.
|
||||
|
||||
Reduced Precision Reduction for FP16 and BF16 GEMMs
|
||||
Reduced Precision Reduction for FP16 and BF16 GEMMs
|
||||
----------------------------------------------------
|
||||
Half-precision GEMM operations are typically done with intermediate accumulations (reduction) in single-precision for numerical accuracy and improved resilience to overflow. For performance, certain GPU architectures, especially more recent ones, allow a few truncations of the intermediate accumulation results to the reduced precision (e.g., half-precision). This change is often benign from the perspective of model convergence, though it may lead to unexpected results (e.g., ``inf`` values when the final result should be be representable in half-precision).
|
||||
If reduced-precision reductions are problematic, they can be turned off with
|
||||
|
|
|
|||
Loading…
Reference in a new issue