pytorch/test/distributed
Yu, Guangye c09205a057 Deprecate device-specific GradScaler autocast API (#126527)
# Motivation

## for `torch.amp.GradScaler`,
- `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`.
- `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`.

So, we intend to depreate them and **strongly recommend** developer to use `torch.amp.GradScaler`.

## for `custom_fwd` and `custom_bwd`,
this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU.
So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`.

# Additional Context
Add UT to cover the deprecated warning.
No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them.
To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang
2024-05-25 06:41:34 +00:00
..
_composable [FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024) 2024-05-24 17:09:12 +00:00
_shard
_spmd
_tensor [dtensor][debug] add c10d allreduce_coalesced_ tracing to CommDebugMode (#127040) 2024-05-24 22:25:44 +00:00
_tools
algorithms
bin
checkpoint [DSD] Add a test to verify FSDP lazy initialization case (#127069) 2024-05-24 21:09:11 +00:00
elastic [TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743) 2024-05-22 18:24:11 +00:00
fsdp Deprecate device-specific GradScaler autocast API (#126527) 2024-05-25 06:41:34 +00:00
launcher
nn/jit
optim
pipeline/sync
pipelining [pipelining] test composability with DDP and FSDP (#127066) 2024-05-25 04:30:40 +00:00
rpc
tensor/parallel [DTensor] Turn on foreach implementation of optimizer for DTensor by default (#123394) 2024-05-15 16:45:42 +00:00
argparse_util_test.py
test_c10d_common.py [C10D] make get_node_local_rank() accept fallback_rank (#126737) 2024-05-21 03:38:02 +00:00
test_c10d_functional_native.py [Traceable FSDP2] Add all_gather_into_tensor out variant (#126334) 2024-05-16 10:27:06 +00:00
test_c10d_gloo.py
test_c10d_logger.py
test_c10d_nccl.py Capture dtype in Flight Recorder (#126581) 2024-05-22 03:38:09 +00:00
test_c10d_object_collectives.py [Distributed] Add P2P versions of *object_list operations (#124379) 2024-05-03 23:22:58 +00:00
test_c10d_ops_nccl.py [c10d] Reduce test time by reusing ProcessGroup (#125648) 2024-05-08 22:33:40 +00:00
test_c10d_pypg.py
test_c10d_spawn.py
test_c10d_spawn_gloo.py
test_c10d_spawn_nccl.py
test_c10d_spawn_ucc.py
test_c10d_ucc.py
test_collective_utils.py
test_compute_comm_reordering.py Remove Inductor IRs for legacy functional collectives (#124992) 2024-05-05 19:49:58 +00:00
test_control_collectives.py Reapply "c10d: add Collectives abstraction (#125978)" (#126695) 2024-05-21 18:00:09 +00:00
test_cuda_p2p.py Introduce ProcessGroupCudaP2P (#122163) 2024-05-24 18:33:18 +00:00
test_data_parallel.py
test_device_mesh.py [DeviceMesh] Supported N groups in from_group (#126258) 2024-05-17 01:03:21 +00:00
test_distributed_spawn.py
test_dynamo_distributed.py [dynamo][fsdp] Use Tensor match for FSDP modules (#125827) 2024-05-09 21:26:15 +00:00
test_fake_pg.py
test_functional_api.py
test_inductor_collectives.py
test_launcher.py
test_multi_threaded_pg.py
test_nccl.py
test_pg_wrapper.py
test_store.py