docs: replace torch.distributed.run by torchrun (#27528)

* docs: replace torch.distributed.run by torchrun

 `transformers` now officially support pytorch >= 1.10.
 The entrypoint `torchrun`` is present from 1.10 onwards.

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

* Update src/transformers/trainer.py

with @ArthurZucker's suggestion

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
Peter Pan 2023-11-28 00:26:33 +08:00 committed by GitHub
parent c832bcb812
commit ce31508134
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
25 changed files with 46 additions and 46 deletions

View file

@ -152,7 +152,7 @@ You are not required to read the following guidelines before opening an issue. H
```bash ```bash
cd examples/seq2seq cd examples/seq2seq
python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py \ torchrun --nproc_per_node=2 ./finetune_trainer.py \
--model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \ --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
--output_dir output_dir --overwrite_output_dir \ --output_dir output_dir --overwrite_output_dir \
--do_train --n_train 500 --num_train_epochs 1 \ --do_train --n_train 500 --num_train_epochs 1 \

View file

@ -130,7 +130,7 @@ Der [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) unt
- Legen Sie die Anzahl der zu verwendenden GPUs mit dem Argument `nproc_per_node` fest. - Legen Sie die Anzahl der zu verwendenden GPUs mit dem Argument `nproc_per_node` fest.
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \ --fp16 \
--model_name_or_path t5-small \ --model_name_or_path t5-small \

View file

@ -287,7 +287,7 @@ The information in this section isn't not specific to the DeepSpeed integration
For the duration of this section let's assume that you have 2 nodes with 8 gpus each. And you can reach the first node with `ssh hostname1` and second node with `ssh hostname2`, and both must be able to reach each other via ssh locally without a password. Of course, you will need to rename these host (node) names to the actual host names you are working with. For the duration of this section let's assume that you have 2 nodes with 8 gpus each. And you can reach the first node with `ssh hostname1` and second node with `ssh hostname2`, and both must be able to reach each other via ssh locally without a password. Of course, you will need to rename these host (node) names to the actual host names you are working with.
#### The torch.distributed.run launcher #### The torch.distributed.run(torchrun) launcher
For example, to use `torch.distributed.run`, you could do: For example, to use `torch.distributed.run`, you could do:

View file

@ -206,7 +206,7 @@ Let's discuss how you can tell your program which GPUs are to be used and in wha
When using [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) to use only a subset of your GPUs, you simply specify the number of GPUs to use. For example, if you have 4 GPUs, but you wish to use the first 2 you can do: When using [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) to use only a subset of your GPUs, you simply specify the number of GPUs to use. For example, if you have 4 GPUs, but you wish to use the first 2 you can do:
```bash ```bash
python -m torch.distributed.launch --nproc_per_node=2 trainer-program.py ... torchrun --nproc_per_node=2 trainer-program.py ...
``` ```
if you have either [`accelerate`](https://github.com/huggingface/accelerate) or [`deepspeed`](https://github.com/microsoft/DeepSpeed) installed you can also accomplish the same by using one of: if you have either [`accelerate`](https://github.com/huggingface/accelerate) or [`deepspeed`](https://github.com/microsoft/DeepSpeed) installed you can also accomplish the same by using one of:
@ -233,7 +233,7 @@ If you have multiple GPUs and you'd like to use only 1 or a few of those GPUs, s
For example, let's say you have 4 GPUs: 0, 1, 2 and 3. To run only on the physical GPUs 0 and 2, you can do: For example, let's say you have 4 GPUs: 0, 1, 2 and 3. To run only on the physical GPUs 0 and 2, you can do:
```bash ```bash
CUDA_VISIBLE_DEVICES=0,2 python -m torch.distributed.launch trainer-program.py ... CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
``` ```
So now pytorch will see only 2 GPUs, where your physical GPUs 0 and 2 are mapped to `cuda:0` and `cuda:1` correspondingly. So now pytorch will see only 2 GPUs, where your physical GPUs 0 and 2 are mapped to `cuda:0` and `cuda:1` correspondingly.
@ -241,7 +241,7 @@ So now pytorch will see only 2 GPUs, where your physical GPUs 0 and 2 are mapped
You can even change their order: You can even change their order:
```bash ```bash
CUDA_VISIBLE_DEVICES=2,0 python -m torch.distributed.launch trainer-program.py ... CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
``` ```
Here your physical GPUs 0 and 2 are mapped to `cuda:1` and `cuda:0` correspondingly. Here your physical GPUs 0 and 2 are mapped to `cuda:1` and `cuda:0` correspondingly.
@ -263,7 +263,7 @@ As with any environment variable you can, of course, export those instead of add
```bash ```bash
export CUDA_VISIBLE_DEVICES=0,2 export CUDA_VISIBLE_DEVICES=0,2
python -m torch.distributed.launch trainer-program.py ... torchrun trainer-program.py ...
``` ```
but this approach can be confusing since you may forget you set up the environment variable earlier and not understand why the wrong GPUs are used. Therefore, it's a common practice to set the environment variable just for a specific run on the same command line as it's shown in most examples of this section. but this approach can be confusing since you may forget you set up the environment variable earlier and not understand why the wrong GPUs are used. Therefore, it's a common practice to set the environment variable just for a specific run on the same command line as it's shown in most examples of this section.

View file

@ -134,7 +134,7 @@ Here is the full benchmark code and outputs:
```bash ```bash
# DDP w/ NVLink # DDP w/ NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
@ -143,7 +143,7 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch
# DDP w/o NVLink # DDP w/o NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

View file

@ -153,7 +153,7 @@ python examples/pytorch/language-modeling/run_clm.py \
``` ```
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
@ -164,7 +164,7 @@ python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-
``` ```
rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \ rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

View file

@ -130,7 +130,7 @@ The [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) sup
- Set the number of GPUs to use with the `nproc_per_node` argument. - Set the number of GPUs to use with the `nproc_per_node` argument.
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \ --fp16 \
--model_name_or_path t5-small \ --model_name_or_path t5-small \

View file

@ -130,7 +130,7 @@ python examples/tensorflow/summarization/run_summarization.py \
- Establece la cantidad de GPU que se usará con el argumento `nproc_per_node`. - Establece la cantidad de GPU que se usará con el argumento `nproc_per_node`.
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \ --fp16 \
--model_name_or_path t5-small \ --model_name_or_path t5-small \

View file

@ -134,7 +134,7 @@ Ecco il codice benchmark completo e gli output:
```bash ```bash
# DDP w/ NVLink # DDP w/ NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
@ -143,7 +143,7 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch
# DDP w/o NVLink # DDP w/o NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

View file

@ -130,7 +130,7 @@ Il [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) supp
- Imposta un numero di GPU da usare con l'argomento `nproc_per_node`. - Imposta un numero di GPU da usare con l'argomento `nproc_per_node`.
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \ --fp16 \
--model_name_or_path t5-small \ --model_name_or_path t5-small \

View file

@ -196,7 +196,7 @@ _python_、_numpy_、および _pytorch_ の RNG 状態は、そのチェック
[`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.Parallel.DistributedDataParallel.html) を使用して GPU のサブセットのみを使用する場合、使用する GPU の数を指定するだけです。 。たとえば、GPU が 4 つあるが、最初の 2 つを使用したい場合は、次のようにします。 [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.Parallel.DistributedDataParallel.html) を使用して GPU のサブセットのみを使用する場合、使用する GPU の数を指定するだけです。 。たとえば、GPU が 4 つあるが、最初の 2 つを使用したい場合は、次のようにします。
```bash ```bash
python -m torch.distributed.launch --nproc_per_node=2 trainer-program.py ... torchrun --nproc_per_node=2 trainer-program.py ...
``` ```
[`accelerate`](https://github.com/huggingface/accelerate) または [`deepspeed`](https://github.com/microsoft/DeepSpeed) がインストールされている場合は、次を使用して同じことを達成することもできます。の一つ: [`accelerate`](https://github.com/huggingface/accelerate) または [`deepspeed`](https://github.com/microsoft/DeepSpeed) がインストールされている場合は、次を使用して同じことを達成することもできます。の一つ:
@ -223,7 +223,7 @@ deepspeed --num_gpus 2 trainer-program.py ...
たとえば、4 つの GPU (0、1、2、3) があるとします。物理 GPU 0 と 2 のみで実行するには、次のようにします。 たとえば、4 つの GPU (0、1、2、3) があるとします。物理 GPU 0 と 2 のみで実行するには、次のようにします。
```bash ```bash
CUDA_VISIBLE_DEVICES=0,2 python -m torch.distributed.launch trainer-program.py ... CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
``` ```
したがって、pytorch は 2 つの GPU のみを認識し、物理 GPU 0 と 2 はそれぞれ `cuda:0``cuda:1` にマッピングされます。 したがって、pytorch は 2 つの GPU のみを認識し、物理 GPU 0 と 2 はそれぞれ `cuda:0``cuda:1` にマッピングされます。
@ -231,7 +231,7 @@ CUDA_VISIBLE_DEVICES=0,2 python -m torch.distributed.launch trainer-program.py .
順序を変更することもできます。 順序を変更することもできます。
```bash ```bash
CUDA_VISIBLE_DEVICES=2,0 python -m torch.distributed.launch trainer-program.py ... CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
``` ```
ここでは、物理 GPU 0 と 2 がそれぞれ`cuda:1`と`cuda:0`にマッピングされています。 ここでは、物理 GPU 0 と 2 がそれぞれ`cuda:1`と`cuda:0`にマッピングされています。
@ -253,7 +253,7 @@ CUDA_VISIBLE_DEVICES= python trainer-program.py ...
```bash ```bash
export CUDA_VISIBLE_DEVICES=0,2 export CUDA_VISIBLE_DEVICES=0,2
python -m torch.distributed.launch trainer-program.py ... torchrun trainer-program.py ...
``` ```
ただし、この方法では、以前に環境変数を設定したことを忘れて、なぜ間違った GPU が使用されているのか理解できない可能性があるため、混乱を招く可能性があります。したがって、このセクションのほとんどの例で示されているように、同じコマンド ラインで特定の実行に対してのみ環境変数を設定するのが一般的です。 ただし、この方法では、以前に環境変数を設定したことを忘れて、なぜ間違った GPU が使用されているのか理解できない可能性があるため、混乱を招く可能性があります。したがって、このセクションのほとんどの例で示されているように、同じコマンド ラインで特定の実行に対してのみ環境変数を設定するのが一般的です。

View file

@ -139,7 +139,7 @@ NVLinkを使用すると、トレーニングが約23速く完了すること
```bash ```bash
# DDP w/ NVLink # DDP w/ NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
@ -148,7 +148,7 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch
# DDP w/o NVLink # DDP w/o NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

View file

@ -143,7 +143,7 @@ python examples/pytorch/language-modeling/run_clm.py \
# DDP w/ NVlink # DDP w/ NVlink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
@ -151,7 +151,7 @@ python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-
# DDP w/o NVlink # DDP w/o NVlink
rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \ rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

View file

@ -140,7 +140,7 @@ python examples/tensorflow/summarization/run_summarization.py \
以下は提供されたBashコードです。このコードの日本語訳をMarkdown形式で記載します。 以下は提供されたBashコードです。このコードの日本語訳をMarkdown形式で記載します。
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \ --fp16 \
--model_name_or_path t5-small \ --model_name_or_path t5-small \

View file

@ -135,7 +135,7 @@ NVLink 사용 시 훈련이 약 23% 더 빠르게 완료됨을 확인할 수 있
```bash ```bash
# DDP w/ NVLink # DDP w/ NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
@ -144,7 +144,7 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch
# DDP w/o NVLink # DDP w/o NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

View file

@ -145,7 +145,7 @@ python examples/pytorch/language-modeling/run_clm.py \
# DDP w/ NVlink # DDP w/ NVlink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
@ -153,7 +153,7 @@ python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-
# DDP w/o NVlink # DDP w/o NVlink
rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \ rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

View file

@ -141,7 +141,7 @@ python examples/tensorflow/summarization/run_summarization.py \
- `nproc_per_node` 인수를 추가해 사용할 GPU 개수를 설정합니다. - `nproc_per_node` 인수를 추가해 사용할 GPU 개수를 설정합니다.
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \ --fp16 \
--model_name_or_path t5-small \ --model_name_or_path t5-small \

View file

@ -131,7 +131,7 @@ O [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) ofere
- Defina o número de GPUs a serem usadas com o argumento `nproc_per_node`. - Defina o número de GPUs a serem usadas com o argumento `nproc_per_node`.
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \ --fp16 \
--model_name_or_path t5-small \ --model_name_or_path t5-small \

View file

@ -135,7 +135,7 @@ GPU1 PHB X 0-11 N/A
```bash ```bash
# DDP w/ NVLink # DDP w/ NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
@ -144,7 +144,7 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch
# DDP w/o NVLink # DDP w/o NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

View file

@ -133,7 +133,7 @@ python examples/tensorflow/summarization/run_summarization.py \
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \ --fp16 \
--model_name_or_path t5-small \ --model_name_or_path t5-small \

View file

@ -18,7 +18,7 @@ in Huang et al. [Improve Transformer Models with Better Relative Position Embedd
```bash ```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \ torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_name_or_path zhiheng-huang/bert-base-uncased-embedding-relative-key-query \ --model_name_or_path zhiheng-huang/bert-base-uncased-embedding-relative-key-query \
--dataset_name squad \ --dataset_name squad \
--do_train \ --do_train \
@ -46,7 +46,7 @@ gpu training leads to the f1 score of 90.71.
```bash ```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \ torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_name_or_path zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query \ --model_name_or_path zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query \
--dataset_name squad \ --dataset_name squad \
--do_train \ --do_train \
@ -68,7 +68,7 @@ Training with the above command leads to the f1 score of 93.52, which is slightl
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1: Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
```bash ```bash
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \ torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_name_or_path bert-large-uncased-whole-word-masking \ --model_name_or_path bert-large-uncased-whole-word-masking \
--dataset_name squad \ --dataset_name squad \
--do_train \ --do_train \

View file

@ -140,7 +140,7 @@ python finetune_trainer.py --help
For multi-gpu training use `torch.distributed.launch`, e.g. with 2 gpus: For multi-gpu training use `torch.distributed.launch`, e.g. with 2 gpus:
```bash ```bash
python -m torch.distributed.launch --nproc_per_node=2 finetune_trainer.py ... torchrun --nproc_per_node=2 finetune_trainer.py ...
``` ```
**At the moment, `Seq2SeqTrainer` does not support *with teacher* distillation.** **At the moment, `Seq2SeqTrainer` does not support *with teacher* distillation.**
@ -214,7 +214,7 @@ because it uses SortishSampler to minimize padding. You can also use it on 1 GPU
`{type_path}.source` and `{type_path}.target`. Run `./run_distributed_eval.py --help` for all clargs. `{type_path}.source` and `{type_path}.target`. Run `./run_distributed_eval.py --help` for all clargs.
```bash ```bash
python -m torch.distributed.launch --nproc_per_node=8 run_distributed_eval.py \ torchrun --nproc_per_node=8 run_distributed_eval.py \
--model_name sshleifer/distilbart-large-xsum-12-3 \ --model_name sshleifer/distilbart-large-xsum-12-3 \
--save_dir xsum_generations \ --save_dir xsum_generations \
--data_dir xsum \ --data_dir xsum \

View file

@ -98,7 +98,7 @@ the [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)
use the following command: use the following command:
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node number_of_gpu_you_have path_to_script.py \ --nproc_per_node number_of_gpu_you_have path_to_script.py \
--all_arguments_of_the_script --all_arguments_of_the_script
``` ```
@ -107,7 +107,7 @@ As an example, here is how you would fine-tune the BERT large model (with whole
classification MNLI task using the `run_glue` script, with 8 GPUs: classification MNLI task using the `run_glue` script, with 8 GPUs:
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 8 pytorch/text-classification/run_glue.py \ --nproc_per_node 8 pytorch/text-classification/run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \ --model_name_or_path bert-large-uncased-whole-word-masking \
--task_name mnli \ --task_name mnli \

View file

@ -100,7 +100,7 @@ of **0.35**.
The following command shows how to fine-tune [XLSR-Wav2Vec2](https://huggingface.co/transformers/main/model_doc/xlsr_wav2vec2.html) on [Common Voice](https://huggingface.co/datasets/common_voice) using 8 GPUs in half-precision. The following command shows how to fine-tune [XLSR-Wav2Vec2](https://huggingface.co/transformers/main/model_doc/xlsr_wav2vec2.html) on [Common Voice](https://huggingface.co/datasets/common_voice) using 8 GPUs in half-precision.
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 8 run_speech_recognition_ctc.py \ --nproc_per_node 8 run_speech_recognition_ctc.py \
--dataset_name="common_voice" \ --dataset_name="common_voice" \
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
@ -147,7 +147,7 @@ However, the `--shuffle_buffer_size` argument controls how many examples we can
```bash ```bash
**python -m torch.distributed.launch \ **torchrun \
--nproc_per_node 4 run_speech_recognition_ctc_streaming.py \ --nproc_per_node 4 run_speech_recognition_ctc_streaming.py \
--dataset_name="common_voice" \ --dataset_name="common_voice" \
--model_name_or_path="facebook/wav2vec2-xls-r-300m" \ --model_name_or_path="facebook/wav2vec2-xls-r-300m" \
@ -404,7 +404,7 @@ If training on a different language, you should be sure to change the `language`
#### Multi GPU Whisper Training #### Multi GPU Whisper Training
The following example shows how to fine-tune the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 2 GPU devices in half-precision: The following example shows how to fine-tune the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 2 GPU devices in half-precision:
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 2 run_speech_recognition_seq2seq.py \ --nproc_per_node 2 run_speech_recognition_seq2seq.py \
--model_name_or_path="openai/whisper-small" \ --model_name_or_path="openai/whisper-small" \
--dataset_name="mozilla-foundation/common_voice_11_0" \ --dataset_name="mozilla-foundation/common_voice_11_0" \
@ -572,7 +572,7 @@ cross-entropy loss of **0.405** and word error rate of **0.0728**.
The following command shows how to fine-tune [XLSR-Wav2Vec2](https://huggingface.co/transformers/main/model_doc/xlsr_wav2vec2.html) on [Common Voice](https://huggingface.co/datasets/common_voice) using 8 GPUs in half-precision. The following command shows how to fine-tune [XLSR-Wav2Vec2](https://huggingface.co/transformers/main/model_doc/xlsr_wav2vec2.html) on [Common Voice](https://huggingface.co/datasets/common_voice) using 8 GPUs in half-precision.
```bash ```bash
python -m torch.distributed.launch \ torchrun \
--nproc_per_node 8 run_speech_recognition_seq2seq.py \ --nproc_per_node 8 run_speech_recognition_seq2seq.py \
--dataset_name="librispeech_asr" \ --dataset_name="librispeech_asr" \
--model_name_or_path="./" \ --model_name_or_path="./" \

View file

@ -1595,7 +1595,7 @@ class Trainer:
# references registered here no longer work on other gpus, breaking the module # references registered here no longer work on other gpus, breaking the module
raise ValueError( raise ValueError(
"Currently --debug underflow_overflow is not supported under DP. Please use DDP" "Currently --debug underflow_overflow is not supported under DP. Please use DDP"
" (torch.distributed.launch)." " (torchrun or torch.distributed.launch (deprecated))."
) )
else: else:
debug_overflow = DebugUnderflowOverflow(self.model) # noqa debug_overflow = DebugUnderflowOverflow(self.model) # noqa