pytorch/test/distributed
Chirag Pandya 485497e727 [c10d][fr] flight recorder improvements (#143446)
Summary:
1. Flight recorder dumps are now automatically dumped by default upon
   timeout or exception. Users don't need to opt-in.
2. Change default dump location to running user's home directory
   `.cache` folder.

Test Plan:
1. Tested locally by running the crash program from flight recorder
   tutorial page.
   https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html#an-end-to-end-example
2. Noted that flight recorder files were correctly created.
❯ pwd
/home/cpio/.cache/fr_trace
❯ ls
nccl_trace_rank_0  nccl_trace_rank_1

Differential Revision: [D67363720](https://our.internmc.facebook.com/intern/diff/D67363720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143446
Approved by: https://github.com/d4l3k
2024-12-20 20:41:30 +00:00
..
_composable [FSDP2] Clamp reduce_dtype in lazy init (#143297) 2024-12-17 00:25:08 +00:00
_shard Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
_tensor Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
_tools [FSDP2] Move to public torch.distributed.fsdp (#141868) 2024-12-07 01:24:28 +00:00
algorithms Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
bin
checkpoint [state dict] Change _load_model_state_dict to enable cpu_offload, accept 2 device type and optimize memory (#142845) 2024-12-19 05:06:41 +00:00
elastic Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
flight_recorder [fr] recognize all_reduce_barrier as a valid op (#143354) 2024-12-17 21:09:18 +00:00
fsdp Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
launcher Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
nn/jit
optim
pipelining Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
rpc
tensor/parallel [fused_all_gather_matmul] introduce an argument to specify whether the all-gather result needs to be returned (#143159) 2024-12-17 01:07:27 +00:00
argparse_util_test.py
test_backends.py API to retrieve default distributed backend from device (#140536) 2024-11-22 11:01:53 +00:00
test_c10d_common.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_c10d_functional_native.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_c10d_gloo.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_c10d_logger.py [c10d] Switch all timer logging in c10d to wait_counter (#141154) 2024-11-21 01:10:11 +00:00
test_c10d_nccl.py [c10d][fr] flight recorder improvements (#143446) 2024-12-20 20:41:30 +00:00
test_c10d_object_collectives.py
test_c10d_ops_nccl.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_c10d_pypg.py DistributedDataParallel: add init_sync option to control collectives during initialization (#142824) 2024-12-11 20:28:38 +00:00
test_c10d_spawn.py
test_c10d_spawn_gloo.py
test_c10d_spawn_nccl.py
test_c10d_spawn_ucc.py
test_c10d_ucc.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_collective_utils.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_compute_comm_reordering.py
test_control_collectives.py
test_data_parallel.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_device_mesh.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_distributed_spawn.py
test_dynamo_distributed.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_fake_pg.py
test_functional_api.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_inductor_collectives.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_launcher.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_multi_threaded_pg.py
test_nccl.py
test_pg_wrapper.py
test_store.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_symmetric_memory.py [fused_all_gather_matmul] use _multimem_all_gather_matmul for small global Ms (#143160) 2024-12-17 01:07:27 +00:00