mirror of
https://github.com/saymrwulf/transformers.git
synced 2026-05-14 20:58:08 +00:00
Convert Trainer doc page to MarkDown (#14753)
* Convert Trainer doc page to MarkDown * Fix repo consistency * Fix the doc build test job
This commit is contained in:
parent
e926ea2bdd
commit
7533d30acd
4 changed files with 553 additions and 633 deletions
2
.github/workflows/build_doc_test.yml
vendored
2
.github/workflows/build_doc_test.yml
vendored
|
|
@ -46,4 +46,4 @@ jobs:
|
|||
|
||||
- name: Make documentation
|
||||
run: |
|
||||
doc-builder build transformers ./transformers/docs/source
|
||||
doc-builder build transformers ./docs/source
|
||||
|
|
|
|||
550
docs/source/main_classes/trainer.mdx
Normal file
550
docs/source/main_classes/trainer.mdx
Normal file
|
|
@ -0,0 +1,550 @@
|
|||
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Trainer
|
||||
|
||||
The [`Trainer`] class provides an API for feature-complete training in PyTorch for most standard use cases. It's used in most of the [example scripts](../examples).
|
||||
|
||||
Before instantiating your [`Trainer`], create a [`TrainingArguments`] to access all the points of customization during training.
|
||||
|
||||
The API supports distributed training on multiple GPUs/TPUs, mixed precision through [NVIDIA Apex](https://github.com/NVIDIA/apex) and Native AMP for PyTorch.
|
||||
|
||||
The [`Trainer`] contains the basic training loop which supports the above features. To inject custom behavior you can subclass them and override the following methods:
|
||||
|
||||
- **get_train_dataloader** -- Creates the training DataLoader.
|
||||
- **get_eval_dataloader** -- Creates the evaluation DataLoader.
|
||||
- **get_test_dataloader** -- Creates the test DataLoader.
|
||||
- **log** -- Logs information on the various objects watching training.
|
||||
- **create_optimizer_and_scheduler** -- Sets up the optimizer and learning rate scheduler if they were not passed at
|
||||
init. Note, that you can also subclass or override the `create_optimizer` and `create_scheduler` methods
|
||||
separately.
|
||||
- **create_optimizer** -- Sets up the optimizer if it wasn't passed at init.
|
||||
- **create_scheduler** -- Sets up the learning rate scheduler if it wasn't passed at init.
|
||||
- **compute_loss** - Computes the loss on a batch of training inputs.
|
||||
- **training_step** -- Performs a training step.
|
||||
- **prediction_step** -- Performs an evaluation/test step.
|
||||
- **evaluate** -- Runs an evaluation loop and returns metrics.
|
||||
- **predict** -- Returns predictions (with metrics if labels are available) on a test set.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
The [`Trainer`] class is optimized for 🤗 Transformers models and can have surprising behaviors
|
||||
when you use it on other models. When using it on your own model, make sure:
|
||||
|
||||
- your model always return tuples or subclasses of [`~file_utils.ModelOutput`].
|
||||
- your model can compute the loss if a `labels` argument is provided and that loss is returned as the first
|
||||
element of the tuple (if your model returns tuples)
|
||||
- your model can accept multiple label arguments (use the `label_names` in your [`TrainingArguments`] to indicate their name to the [`Trainer`]) but none of them should be named `"label"`.
|
||||
|
||||
</Tip>
|
||||
|
||||
Here is an example of how to customize [`Trainer`] using a custom loss function for multi-label classification:
|
||||
|
||||
```python
|
||||
from torch import nn
|
||||
from transformers import Trainer
|
||||
|
||||
class MultilabelTrainer(Trainer):
|
||||
def compute_loss(self, model, inputs, return_outputs=False):
|
||||
labels = inputs.get("labels")
|
||||
outputs = model(**inputs)
|
||||
logits = outputs.get('logits')
|
||||
loss_fct = nn.BCEWithLogitsLoss()
|
||||
loss = loss_fct(logits.view(-1, self.model.config.num_labels),
|
||||
labels.float().view(-1, self.model.config.num_labels))
|
||||
return (loss, outputs) if return_outputs else loss
|
||||
```
|
||||
|
||||
Another way to customize the training loop behavior for the PyTorch [`Trainer`] is to use [callbacks](callback) that can inspect the training loop state (for progress reporting, logging on TensorBoard or other ML platforms...) and take decisions (like early stopping).
|
||||
|
||||
|
||||
## Trainer
|
||||
|
||||
[[autodoc]] Trainer
|
||||
- all
|
||||
|
||||
## Seq2SeqTrainer
|
||||
|
||||
[[autodoc]] Seq2SeqTrainer
|
||||
- evaluate
|
||||
- predict
|
||||
|
||||
## TrainingArguments
|
||||
|
||||
[[autodoc]] TrainingArguments
|
||||
- all
|
||||
|
||||
## Seq2SeqTrainingArguments
|
||||
|
||||
[[autodoc]] Seq2SeqTrainingArguments
|
||||
- all
|
||||
|
||||
## Checkpoints
|
||||
|
||||
By default, [`Trainer`] will save all checkpoints in the `output_dir` you set in the
|
||||
[`TrainingArguments`] you are using. Those will go in subfolder named `checkpoint-xxx` with xxx
|
||||
being the step at which the training was at.
|
||||
|
||||
Resuming training from a checkpoint can be done when calling [`Trainer.train`] with either:
|
||||
|
||||
- `resume_from_checkpoint=True` which will resume training from the latest checkpoint
|
||||
- `resume_from_checkpoint=checkpoint_dir` which will resume training from the specific checkpoint in the directory
|
||||
passed.
|
||||
|
||||
In addition, you can easily save your checkpoints on the Model Hub when using `push_to_hub=True`. By default, all
|
||||
the models saved in intermediate checkpoints are saved in different commits, but not the optimizer state. You can adapt
|
||||
the `hub-strategy` value of your [`TrainingArguments`] to either:
|
||||
|
||||
- `"checkpoint"`: the latest checkpoint is also pushed in a subfolder named last-checkpoint, allowing you to
|
||||
resume training easily with `trainer.train(resume_from_checkpoint="output_dir/last-checkpoint")`.
|
||||
- `"all_checkpoints"`: all checkpoints are pushed like they appear in the output folder (so you will get one
|
||||
checkpoint folder per folder in your final repository)
|
||||
|
||||
|
||||
## Logging
|
||||
|
||||
By default [`Trainer`] will use `logging.INFO` for the main process and `logging.WARNING` for the replicas if any.
|
||||
|
||||
These defaults can be overridden to use any of the 5 `logging` levels with [`TrainingArguments`]'s
|
||||
arguments:
|
||||
|
||||
- `log_level` - for the main process
|
||||
- `log_level_replica` - for the replicas
|
||||
|
||||
Further, if [`TrainingArguments`]'s `log_on_each_node` is set to `False` only the main node will
|
||||
use the log level settings for its main process, all other nodes will use the log level settings for replicas.
|
||||
|
||||
Note that [`Trainer`] is going to set `transformers`'s log level separately for each node in its
|
||||
[`Trainer.__init__`]. So you may want to set this sooner (see the next example) if you tap into other
|
||||
`transformers` functionality before creating the [`Trainer`] object.
|
||||
|
||||
Here is an example of how this can be used in an application:
|
||||
|
||||
```python
|
||||
[...]
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Setup logging
|
||||
logging.basicConfig(
|
||||
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
|
||||
datefmt="%m/%d/%Y %H:%M:%S",
|
||||
handlers=[logging.StreamHandler(sys.stdout)],
|
||||
)
|
||||
|
||||
# set the main code and the modules it uses to the same log-level according to the node
|
||||
log_level = training_args.get_process_log_level()
|
||||
logger.setLevel(log_level)
|
||||
datasets.utils.logging.set_verbosity(log_level)
|
||||
transformers.utils.logging.set_verbosity(log_level)
|
||||
|
||||
trainer = Trainer(...)
|
||||
```
|
||||
|
||||
And then if you only want to see warnings on the main node and all other nodes to not print any most likely duplicated
|
||||
warnings you could run it as:
|
||||
|
||||
```bash
|
||||
my_app.py ... --log_level warning --log_level_replica error
|
||||
```
|
||||
|
||||
In the multi-node environment if you also don't want the logs to repeat for each node's main process, you will want to
|
||||
change the above to:
|
||||
|
||||
```bash
|
||||
my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0
|
||||
```
|
||||
|
||||
and then only the main process of the first node will log at the "warning" level, and all other processes on the main
|
||||
node and all processes on other nodes will log at the "error" level.
|
||||
|
||||
If you need your application to be as quiet as possible you could do:
|
||||
|
||||
```bash
|
||||
my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0
|
||||
```
|
||||
|
||||
(add `--log_on_each_node 0` if on multi-node environment)
|
||||
|
||||
|
||||
## Randomness
|
||||
|
||||
When resuming from a checkpoint generated by [`Trainer`] all efforts are made to restore the
|
||||
_python_, _numpy_ and _pytorch_ RNG states to the same states as they were at the moment of saving that checkpoint,
|
||||
which should make the "stop and resume" style of training as close as possible to non-stop training.
|
||||
|
||||
However, due to various default non-deterministic pytorch settings this might not fully work. If you want full
|
||||
determinism please refer to [Controlling sources of randomness](https://pytorch.org/docs/stable/notes/randomness). As explained in the document, that some of those settings
|
||||
that make things deterministic (.e.g., `torch.backends.cudnn.deterministic`) may slow things down, therefore this
|
||||
can't be done by default, but you can enable those yourself if needed.
|
||||
|
||||
|
||||
## Trainer Integrations
|
||||
|
||||
The [`Trainer`] has been extended to support libraries that may dramatically improve your training
|
||||
time and fit much bigger models.
|
||||
|
||||
Currently it supports third party solutions, [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [FairScale](https://github.com/facebookresearch/fairscale/), which implement parts of the paper [ZeRO: Memory Optimizations
|
||||
Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He](https://arxiv.org/abs/1910.02054).
|
||||
|
||||
This provided support is new and experimental as of this writing.
|
||||
|
||||
<a id='zero-install-notes'></a>
|
||||
|
||||
### CUDA Extension Installation Notes
|
||||
|
||||
As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
|
||||
|
||||
While all installation issues should be dealt with through the corresponding GitHub Issues of [FairScale](https://github.com/facebookresearch/fairscale/issues) and [Deepspeed](https://github.com/microsoft/DeepSpeed/issues), there are a few common issues that one may encounter while building
|
||||
any PyTorch extension that needs to build CUDA extensions.
|
||||
|
||||
Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
|
||||
|
||||
```bash
|
||||
pip install fairscale
|
||||
pip install deepspeed
|
||||
```
|
||||
|
||||
please, read the following notes first.
|
||||
|
||||
In these notes we give examples for what to do when `pytorch` has been built with CUDA `10.2`. If your situation is
|
||||
different remember to adjust the version number to the one you are after.
|
||||
|
||||
#### Possible problem #1
|
||||
|
||||
While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
|
||||
installed system-wide.
|
||||
|
||||
For example, if you installed `pytorch` with `cudatoolkit==10.2` in the Python environment, you also need to have
|
||||
CUDA `10.2` installed system-wide.
|
||||
|
||||
The exact location may vary from system to system, but `/usr/local/cuda-10.2` is the most common location on many
|
||||
Unix systems. When CUDA is correctly set up and added to the `PATH` environment variable, one can find the
|
||||
installation location by doing:
|
||||
|
||||
```bash
|
||||
which nvcc
|
||||
```
|
||||
|
||||
If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
|
||||
search engine. For example, if you're on Ubuntu you may want to search for: [ubuntu cuda 10.2 install](https://www.google.com/search?q=ubuntu+cuda+10.2+install).
|
||||
|
||||
#### Possible problem #2
|
||||
|
||||
Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
|
||||
may have:
|
||||
|
||||
```bash
|
||||
/usr/local/cuda-10.2
|
||||
/usr/local/cuda-11.0
|
||||
```
|
||||
|
||||
Now, in this situation you need to make sure that your `PATH` and `LD_LIBRARY_PATH` environment variables contain
|
||||
the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
|
||||
last version was installed. If you encounter the problem, where the package build fails because it can't find the right
|
||||
CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
|
||||
environment variables.
|
||||
|
||||
First, you may look at their contents:
|
||||
|
||||
```bash
|
||||
echo $PATH
|
||||
echo $LD_LIBRARY_PATH
|
||||
```
|
||||
|
||||
so you get an idea of what is inside.
|
||||
|
||||
It's possible that `LD_LIBRARY_PATH` is empty.
|
||||
|
||||
`PATH` lists the locations of where executables can be found and `LD_LIBRARY_PATH` is for where shared libraries
|
||||
are to looked for. In both cases, earlier entries have priority over the later ones. `:` is used to separate multiple
|
||||
entries.
|
||||
|
||||
Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
|
||||
doing:
|
||||
|
||||
```bash
|
||||
export PATH=/usr/local/cuda-10.2/bin:$PATH
|
||||
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
|
||||
```
|
||||
|
||||
Note that we aren't overwriting the existing values, but prepending instead.
|
||||
|
||||
Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
|
||||
exist. `lib64` sub-directory is where the various CUDA `.so` objects, like `libcudart.so` reside, it's unlikely
|
||||
that your system will have it named differently, but if it is adjust it to reflect your reality.
|
||||
|
||||
|
||||
#### Possible problem #3
|
||||
|
||||
Some older CUDA versions may refuse to build with newer compilers. For example, you my have `gcc-9` but it wants
|
||||
`gcc-7`.
|
||||
|
||||
There are various ways to go about it.
|
||||
|
||||
If you can install the latest CUDA toolkit it typically should support the newer compiler.
|
||||
|
||||
Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
|
||||
already have it but it's not the default one, so the build system can't see it. If you have `gcc-7` installed but the
|
||||
build system complains it can't find it, the following might do the trick:
|
||||
|
||||
```bash
|
||||
sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc
|
||||
sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++
|
||||
```
|
||||
|
||||
Here, we are making a symlink to `gcc-7` from `/usr/local/cuda-10.2/bin/gcc` and since
|
||||
`/usr/local/cuda-10.2/bin/` should be in the `PATH` environment variable (see the previous problem's solution), it
|
||||
should find `gcc-7` (and `g++7`) and then the build will succeed.
|
||||
|
||||
As always make sure to edit the paths in the example to match your situation.
|
||||
|
||||
### FairScale
|
||||
|
||||
By integrating [FairScale](https://github.com/facebookresearch/fairscale/) the [`Trainer`]
|
||||
provides support for the following features from [the ZeRO paper](https://arxiv.org/abs/1910.02054):
|
||||
|
||||
1. Optimizer State Sharding
|
||||
2. Gradient Sharding
|
||||
3. Model Parameters Sharding (new and very experimental)
|
||||
4. CPU offload (new and very experimental)
|
||||
|
||||
You will need at least two GPUs to use this feature.
|
||||
|
||||
|
||||
**Installation**:
|
||||
|
||||
Install the library via pypi:
|
||||
|
||||
```bash
|
||||
pip install fairscale
|
||||
```
|
||||
|
||||
or via `transformers`' `extras`:
|
||||
|
||||
```bash
|
||||
pip install transformers[fairscale]
|
||||
```
|
||||
|
||||
(available starting from `transformers==4.6.0`) or find more details on [the FairScale's GitHub page](https://github.com/facebookresearch/fairscale/#installation).
|
||||
|
||||
If you're still struggling with the build, first make sure to read [CUDA Extension Installation Notes](#zero-install-notes).
|
||||
|
||||
If it's still not resolved the build issue, here are a few more ideas.
|
||||
|
||||
`fairscale` seems to have an issue with the recently introduced by pip build isolation feature. If you have a problem
|
||||
with it, you may want to try one of:
|
||||
|
||||
```bash
|
||||
pip install fairscale --no-build-isolation .
|
||||
```
|
||||
|
||||
or:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/facebookresearch/fairscale/
|
||||
cd fairscale
|
||||
rm -r dist build
|
||||
python setup.py bdist_wheel
|
||||
pip uninstall -y fairscale
|
||||
pip install dist/fairscale-*.whl
|
||||
```
|
||||
|
||||
`fairscale` also has issues with building against pytorch-nightly, so if you use it you may have to try one of:
|
||||
|
||||
```bash
|
||||
pip uninstall -y fairscale; pip install fairscale --pre \
|
||||
-f https://download.pytorch.org/whl/nightly/cu110/torch_nightly \
|
||||
--no-cache --no-build-isolation
|
||||
```
|
||||
|
||||
or:
|
||||
|
||||
```bash
|
||||
pip install -v --disable-pip-version-check . \
|
||||
-f https://download.pytorch.org/whl/nightly/cu110/torch_nightly --pre
|
||||
```
|
||||
|
||||
Of course, adjust the urls to match the cuda version you use.
|
||||
|
||||
If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
|
||||
[FairScale](https://github.com/facebookresearch/fairscale/issues).
|
||||
|
||||
|
||||
|
||||
**Usage**:
|
||||
|
||||
To use the first version of Sharded data-parallelism, add `--sharded_ddp simple` to the command line arguments, and
|
||||
make sure you have added the distributed launcher `-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE` if you haven't been using it already.
|
||||
|
||||
For example here is how you could use it for `run_translation.py` with 2 GPUs:
|
||||
|
||||
```bash
|
||||
python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
|
||||
--model_name_or_path t5-small --per_device_train_batch_size 1 \
|
||||
--output_dir output_dir --overwrite_output_dir \
|
||||
--do_train --max_train_samples 500 --num_train_epochs 1 \
|
||||
--dataset_name wmt16 --dataset_config "ro-en" \
|
||||
--source_lang en --target_lang ro \
|
||||
--fp16 --sharded_ddp simple
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- This feature requires distributed training (so multiple GPUs).
|
||||
- It is not implemented for TPUs.
|
||||
- It works with `--fp16` too, to make things even faster.
|
||||
- One of the main benefits of enabling `--sharded_ddp simple` is that it uses a lot less GPU memory, so you should be
|
||||
able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
|
||||
significantly shorter training time.
|
||||
|
||||
3. To use the second version of Sharded data-parallelism, add `--sharded_ddp zero_dp_2` or `--sharded_ddp zero_dp_3` to the command line arguments, and make sure you have added the distributed launcher `-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE` if you haven't been using it already.
|
||||
|
||||
For example here is how you could use it for `run_translation.py` with 2 GPUs:
|
||||
|
||||
```bash
|
||||
python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
|
||||
--model_name_or_path t5-small --per_device_train_batch_size 1 \
|
||||
--output_dir output_dir --overwrite_output_dir \
|
||||
--do_train --max_train_samples 500 --num_train_epochs 1 \
|
||||
--dataset_name wmt16 --dataset_config "ro-en" \
|
||||
--source_lang en --target_lang ro \
|
||||
--fp16 --sharded_ddp zero_dp_2
|
||||
```
|
||||
|
||||
`zero_dp_2` is an optimized version of the simple wrapper, while `zero_dp_3` fully shards model weights,
|
||||
gradients and optimizer states.
|
||||
|
||||
Both are compatible with adding `cpu_offload` to enable ZeRO-offload (activate it like this: `--sharded_ddp "zero_dp_2 cpu_offload"`).
|
||||
|
||||
Notes:
|
||||
|
||||
- This feature requires distributed training (so multiple GPUs).
|
||||
- It is not implemented for TPUs.
|
||||
- It works with `--fp16` too, to make things even faster.
|
||||
- The `cpu_offload` additional option requires `--fp16`.
|
||||
- This is an area of active development, so make sure you have a source install of fairscale to use this feature as
|
||||
some bugs you encounter may have been fixed there already.
|
||||
|
||||
Known caveats:
|
||||
|
||||
- This feature is incompatible with `--predict_with_generate` in the _run_translation.py_ script.
|
||||
- Using `--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
|
||||
`FullyShardedDataParallelism` of fairscale. It should be used with the option `auto_wrap` if you are not
|
||||
doing this yourself: `--sharded_ddp "zero_dp_3 auto_wrap"`.
|
||||
|
||||
|
||||
### DeepSpeed
|
||||
|
||||
|
||||
Moved to [Trainer DeepSpeed integration](deepspeed#trainer-deepspeed-integration).
|
||||
|
||||
|
||||
#### Installation
|
||||
|
||||
Moved to [Installation](deepspeed#deepspeed-installation).
|
||||
|
||||
|
||||
#### Deployment with multiple GPUs
|
||||
|
||||
Moved to [Deployment with multiple GPUs](deepspeed#deepspeed-multi-gpu).
|
||||
|
||||
|
||||
#### Deployment with one GPU
|
||||
|
||||
Moved to [Deployment with one GPU](deepspeed#deepspeed-one-gpu).
|
||||
|
||||
|
||||
#### Deployment in Notebooks
|
||||
|
||||
Moved to [Deployment in Notebooks](deepspeed#deepspeed-notebook).
|
||||
|
||||
|
||||
#### Configuration
|
||||
|
||||
Moved to [Configuration](deepspeed#deepspeed-config).
|
||||
|
||||
|
||||
#### Passing Configuration
|
||||
|
||||
Moved to [Passing Configuration](deepspeed#deepspeed-config-passing).
|
||||
|
||||
|
||||
#### Shared Configuration
|
||||
|
||||
Moved to [Shared Configuration](deepspeed#deepspeed-config-shared).
|
||||
|
||||
#### ZeRO
|
||||
|
||||
Moved to [ZeRO](deepspeed#deepspeed-zero).
|
||||
|
||||
##### ZeRO-2 Config
|
||||
|
||||
Moved to [ZeRO-2 Config](deepspeed#deepspeed-zero2-config).
|
||||
|
||||
##### ZeRO-3 Config
|
||||
|
||||
Moved to [ZeRO-3 Config](deepspeed#deepspeed-zero3-config).
|
||||
|
||||
|
||||
#### NVMe Support
|
||||
|
||||
Moved to [NVMe Support](deepspeed#deepspeed-nvme).
|
||||
|
||||
##### ZeRO-2 vs ZeRO-3 Performance
|
||||
|
||||
Moved to [ZeRO-2 vs ZeRO-3 Performance](deepspeed#deepspeed-zero2-zero3-performance).
|
||||
|
||||
##### ZeRO-2 Example
|
||||
|
||||
Moved to [ZeRO-2 Example](deepspeed#deepspeed-zero2-example).
|
||||
|
||||
##### ZeRO-3 Example
|
||||
|
||||
Moved to [ZeRO-3 Example](deepspeed#deepspeed-zero3-example).
|
||||
|
||||
|
||||
#### Optimizer and Scheduler
|
||||
|
||||
##### Optimizer
|
||||
|
||||
Moved to [Optimizer](deepspeed#deepspeed-optimizer).
|
||||
|
||||
|
||||
##### Scheduler
|
||||
|
||||
Moved to [Scheduler](deepspeed#deepspeed-scheduler).
|
||||
|
||||
#### fp32 Precision
|
||||
|
||||
Moved to [fp32 Precision](deepspeed#deepspeed-fp32).
|
||||
|
||||
#### Automatic Mixed Precision
|
||||
|
||||
Moved to [Automatic Mixed Precision](deepspeed#deepspeed-amp).
|
||||
|
||||
#### Batch Size
|
||||
|
||||
Moved to [Batch Size](deepspeed#deepspeed-bs).
|
||||
|
||||
#### Gradient Accumulation
|
||||
|
||||
Moved to [Gradient Accumulation](deepspeed#deepspeed-grad-acc).
|
||||
|
||||
|
||||
#### Gradient Clipping
|
||||
|
||||
Moved to [Gradient Clipping](deepspeed#deepspeed-grad-clip).
|
||||
|
||||
|
||||
#### Getting The Model Weights Out
|
||||
|
||||
Moved to [Getting The Model Weights Out](deepspeed#deepspeed-weight-extraction).
|
||||
|
|
@ -1,632 +0,0 @@
|
|||
..
|
||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
Trainer
|
||||
-----------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
|
||||
training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.
|
||||
|
||||
Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a
|
||||
:class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
|
||||
customization during training.
|
||||
|
||||
The API supports distributed training on multiple GPUs/TPUs, mixed precision through `NVIDIA Apex
|
||||
<https://github.com/NVIDIA/apex>`__ and Native AMP for PyTorch and :obj:`tf.keras.mixed_precision` for TensorFlow.
|
||||
|
||||
Both :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` contain the basic training loop which supports
|
||||
the above features. To inject custom behavior you can subclass them and override the following methods:
|
||||
|
||||
- **get_train_dataloader**/**get_train_tfdataset** -- Creates the training DataLoader (PyTorch) or TF Dataset.
|
||||
- **get_eval_dataloader**/**get_eval_tfdataset** -- Creates the evaluation DataLoader (PyTorch) or TF Dataset.
|
||||
- **get_test_dataloader**/**get_test_tfdataset** -- Creates the test DataLoader (PyTorch) or TF Dataset.
|
||||
- **log** -- Logs information on the various objects watching training.
|
||||
- **create_optimizer_and_scheduler** -- Sets up the optimizer and learning rate scheduler if they were not passed at
|
||||
init. Note, that you can also subclass or override the ``create_optimizer`` and ``create_scheduler`` methods
|
||||
separately.
|
||||
- **create_optimizer** -- Sets up the optimizer if it wasn't passed at init.
|
||||
- **create_scheduler** -- Sets up the learning rate scheduler if it wasn't passed at init.
|
||||
- **compute_loss** - Computes the loss on a batch of training inputs.
|
||||
- **training_step** -- Performs a training step.
|
||||
- **prediction_step** -- Performs an evaluation/test step.
|
||||
- **run_model** (TensorFlow only) -- Basic pass through the model.
|
||||
- **evaluate** -- Runs an evaluation loop and returns metrics.
|
||||
- **predict** -- Returns predictions (with metrics if labels are available) on a test set.
|
||||
|
||||
.. warning::
|
||||
|
||||
The :class:`~transformers.Trainer` class is optimized for 🤗 Transformers models and can have surprising behaviors
|
||||
when you use it on other models. When using it on your own model, make sure:
|
||||
|
||||
- your model always return tuples or subclasses of :class:`~transformers.file_utils.ModelOutput`.
|
||||
- your model can compute the loss if a :obj:`labels` argument is provided and that loss is returned as the first
|
||||
element of the tuple (if your model returns tuples)
|
||||
- your model can accept multiple label arguments (use the :obj:`label_names` in your
|
||||
:class:`~transformers.TrainingArguments` to indicate their name to the :class:`~transformers.Trainer`) but none
|
||||
of them should be named :obj:`"label"`.
|
||||
|
||||
Here is an example of how to customize :class:`~transformers.Trainer` using a custom loss function for multi-label
|
||||
classification:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from torch import nn
|
||||
from transformers import Trainer
|
||||
|
||||
class MultilabelTrainer(Trainer):
|
||||
def compute_loss(self, model, inputs, return_outputs=False):
|
||||
labels = inputs.get("labels")
|
||||
outputs = model(**inputs)
|
||||
logits = outputs.get('logits')
|
||||
loss_fct = nn.BCEWithLogitsLoss()
|
||||
loss = loss_fct(logits.view(-1, self.model.config.num_labels),
|
||||
labels.float().view(-1, self.model.config.num_labels))
|
||||
return (loss, outputs) if return_outputs else loss
|
||||
|
||||
Another way to customize the training loop behavior for the PyTorch :class:`~transformers.Trainer` is to use
|
||||
:doc:`callbacks <callback>` that can inspect the training loop state (for progress reporting, logging on TensorBoard or
|
||||
other ML platforms...) and take decisions (like early stopping).
|
||||
|
||||
|
||||
Trainer
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.Trainer
|
||||
:members:
|
||||
|
||||
|
||||
Seq2SeqTrainer
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.Seq2SeqTrainer
|
||||
:members: evaluate, predict
|
||||
|
||||
|
||||
TFTrainer
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.TFTrainer
|
||||
:members:
|
||||
|
||||
|
||||
TrainingArguments
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.TrainingArguments
|
||||
:members:
|
||||
|
||||
|
||||
Seq2SeqTrainingArguments
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.Seq2SeqTrainingArguments
|
||||
:members:
|
||||
|
||||
|
||||
TFTrainingArguments
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.TFTrainingArguments
|
||||
:members:
|
||||
|
||||
|
||||
Checkpoints
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
By default, :class:`~transformers.Trainer` will save all checkpoints in the :obj:`output_dir` you set in the
|
||||
:class:`~transformers.TrainingArguments` you are using. Those will go in subfolder named :obj:`checkpoint-xxx` with xxx
|
||||
being the step at which the training was at.
|
||||
|
||||
Resuming training from a checkpoint can be done when calling :meth:`~transformers.Trainer.train` with either:
|
||||
|
||||
- :obj:`resume_from_checkpoint=True` which will resume training from the latest checkpoint
|
||||
- :obj:`resume_from_checkpoint=checkpoint_dir` which will resume training from the specific checkpoint in the directory
|
||||
passed.
|
||||
|
||||
In addition, you can easily save your checkpoints on the Model Hub when using :obj:`push_to_hub=True`. By default, all
|
||||
the models saved in intermediate checkpoints are saved in different commits, but not the optimizer state. You can adapt
|
||||
the :obj:`hub-strategy` value of your :class:`~transformers.TrainingArguments` to either:
|
||||
|
||||
- :obj:`"checkpoint"`: the latest checkpoint is also pushed in a subfolder named last-checkpoint, allowing you to
|
||||
resume training easily with :obj:`trainer.train(resume_from_checkpoint="output_dir/last-checkpoint")`.
|
||||
- :obj:`"all_checkpoints"`: all checkpoints are pushed like they appear in the output folder (so you will get one
|
||||
checkpoint folder per folder in your final repository)
|
||||
|
||||
|
||||
Logging
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
By default :class:`~transformers.Trainer` will use ``logging.INFO`` for the main process and ``logging.WARNING`` for
|
||||
the replicas if any.
|
||||
|
||||
These defaults can be overridden to use any of the 5 ``logging`` levels with :class:`~transformers.TrainingArguments`'s
|
||||
arguments:
|
||||
|
||||
- ``log_level`` - for the main process
|
||||
- ``log_level_replica`` - for the replicas
|
||||
|
||||
Further, if :class:`~transformers.TrainingArguments`'s ``log_on_each_node`` is set to ``False`` only the main node will
|
||||
use the log level settings for its main process, all other nodes will use the log level settings for replicas.
|
||||
|
||||
Note that :class:`~transformers.Trainer` is going to set ``transformers``'s log level separately for each node in its
|
||||
:meth:`~transformers.Trainer.__init__`. So you may want to set this sooner (see the next example) if you tap into other
|
||||
``transformers`` functionality before creating the :class:`~transformers.Trainer` object.
|
||||
|
||||
Here is an example of how this can be used in an application:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
[...]
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Setup logging
|
||||
logging.basicConfig(
|
||||
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
|
||||
datefmt="%m/%d/%Y %H:%M:%S",
|
||||
handlers=[logging.StreamHandler(sys.stdout)],
|
||||
)
|
||||
|
||||
# set the main code and the modules it uses to the same log-level according to the node
|
||||
log_level = training_args.get_process_log_level()
|
||||
logger.setLevel(log_level)
|
||||
datasets.utils.logging.set_verbosity(log_level)
|
||||
transformers.utils.logging.set_verbosity(log_level)
|
||||
|
||||
trainer = Trainer(...)
|
||||
|
||||
And then if you only want to see warnings on the main node and all other nodes to not print any most likely duplicated
|
||||
warnings you could run it as:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
my_app.py ... --log_level warning --log_level_replica error
|
||||
|
||||
In the multi-node environment if you also don't want the logs to repeat for each node's main process, you will want to
|
||||
change the above to:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0
|
||||
|
||||
and then only the main process of the first node will log at the "warning" level, and all other processes on the main
|
||||
node and all processes on other nodes will log at the "error" level.
|
||||
|
||||
If you need your application to be as quiet as possible you could do:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0
|
||||
|
||||
(add ``--log_on_each_node 0`` if on multi-node environment)
|
||||
|
||||
|
||||
|
||||
Randomness
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When resuming from a checkpoint generated by :class:`~transformers.Trainer` all efforts are made to restore the
|
||||
`python`, `numpy` and `pytorch` RNG states to the same states as they were at the moment of saving that checkpoint,
|
||||
which should make the "stop and resume" style of training as close as possible to non-stop training.
|
||||
|
||||
However, due to various default non-deterministic pytorch settings this might not fully work. If you want full
|
||||
determinism please refer to `Controlling sources of randomness
|
||||
<https://pytorch.org/docs/stable/notes/randomness.html>`__. As explained in the document, that some of those settings
|
||||
that make things deterministic (.e.g., ``torch.backends.cudnn.deterministic``) may slow things down, therefore this
|
||||
can't be done by default, but you can enable those yourself if needed.
|
||||
|
||||
|
||||
Trainer Integrations
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
||||
|
||||
The :class:`~transformers.Trainer` has been extended to support libraries that may dramatically improve your training
|
||||
time and fit much bigger models.
|
||||
|
||||
Currently it supports third party solutions, `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ and `FairScale
|
||||
<https://github.com/facebookresearch/fairscale/>`__, which implement parts of the paper `ZeRO: Memory Optimizations
|
||||
Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
|
||||
<https://arxiv.org/abs/1910.02054>`__.
|
||||
|
||||
This provided support is new and experimental as of this writing.
|
||||
|
||||
.. _zero-install-notes:
|
||||
|
||||
CUDA Extension Installation Notes
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
|
||||
|
||||
While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale
|
||||
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
|
||||
<https://github.com/microsoft/DeepSpeed/issues>`__, there are a few common issues that one may encounter while building
|
||||
any PyTorch extension that needs to build CUDA extensions.
|
||||
|
||||
Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install fairscale
|
||||
pip install deepspeed
|
||||
|
||||
please, read the following notes first.
|
||||
|
||||
In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
|
||||
different remember to adjust the version number to the one you are after.
|
||||
|
||||
Possible problem #1
|
||||
=======================================================================================================================
|
||||
|
||||
While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
|
||||
installed system-wide.
|
||||
|
||||
For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have
|
||||
CUDA ``10.2`` installed system-wide.
|
||||
|
||||
The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many
|
||||
Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the
|
||||
installation location by doing:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
which nvcc
|
||||
|
||||
If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
|
||||
search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
|
||||
<https://www.google.com/search?q=ubuntu+cuda+10.2+install>`__.
|
||||
|
||||
Possible problem #2
|
||||
=======================================================================================================================
|
||||
|
||||
Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
|
||||
may have:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
/usr/local/cuda-10.2
|
||||
/usr/local/cuda-11.0
|
||||
|
||||
Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain
|
||||
the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
|
||||
last version was installed. If you encounter the problem, where the package build fails because it can't find the right
|
||||
CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
|
||||
environment variables.
|
||||
|
||||
First, you may look at their contents:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
echo $PATH
|
||||
echo $LD_LIBRARY_PATH
|
||||
|
||||
so you get an idea of what is inside.
|
||||
|
||||
It's possible that ``LD_LIBRARY_PATH`` is empty.
|
||||
|
||||
``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries
|
||||
are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple
|
||||
entries.
|
||||
|
||||
Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
|
||||
doing:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
export PATH=/usr/local/cuda-10.2/bin:$PATH
|
||||
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
|
||||
|
||||
Note that we aren't overwriting the existing values, but prepending instead.
|
||||
|
||||
Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
|
||||
exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely
|
||||
that your system will have it named differently, but if it is adjust it to reflect your reality.
|
||||
|
||||
|
||||
Possible problem #3
|
||||
=======================================================================================================================
|
||||
|
||||
Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
|
||||
``gcc-7``.
|
||||
|
||||
There are various ways to go about it.
|
||||
|
||||
If you can install the latest CUDA toolkit it typically should support the newer compiler.
|
||||
|
||||
Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
|
||||
already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the
|
||||
build system complains it can't find it, the following might do the trick:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc
|
||||
sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++
|
||||
|
||||
|
||||
Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since
|
||||
``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it
|
||||
should find ``gcc-7`` (and ``g++7``) and then the build will succeed.
|
||||
|
||||
As always make sure to edit the paths in the example to match your situation.
|
||||
|
||||
FairScale
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
By integrating `FairScale <https://github.com/facebookresearch/fairscale/>`__ the :class:`~transformers.Trainer`
|
||||
provides support for the following features from `the ZeRO paper <https://arxiv.org/abs/1910.02054>`__:
|
||||
|
||||
1. Optimizer State Sharding
|
||||
2. Gradient Sharding
|
||||
3. Model Parameters Sharding (new and very experimental)
|
||||
4. CPU offload (new and very experimental)
|
||||
|
||||
You will need at least two GPUs to use this feature.
|
||||
|
||||
|
||||
**Installation**:
|
||||
|
||||
Install the library via pypi:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install fairscale
|
||||
|
||||
or via ``transformers``' ``extras``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install transformers[fairscale]
|
||||
|
||||
(will become available starting from ``transformers==4.6.0``)
|
||||
|
||||
or find more details on `the FairScale's GitHub page <https://github.com/facebookresearch/fairscale/#installation>`__.
|
||||
|
||||
If you're still struggling with the build, first make sure to read :ref:`zero-install-notes`.
|
||||
|
||||
If it's still not resolved the build issue, here are a few more ideas.
|
||||
|
||||
``fairscale`` seems to have an issue with the recently introduced by pip build isolation feature. If you have a problem
|
||||
with it, you may want to try one of:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install fairscale --no-build-isolation .
|
||||
|
||||
or:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone https://github.com/facebookresearch/fairscale/
|
||||
cd fairscale
|
||||
rm -r dist build
|
||||
python setup.py bdist_wheel
|
||||
pip uninstall -y fairscale
|
||||
pip install dist/fairscale-*.whl
|
||||
|
||||
``fairscale`` also has issues with building against pytorch-nightly, so if you use it you may have to try one of:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip uninstall -y fairscale; pip install fairscale --pre \
|
||||
-f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html \
|
||||
--no-cache --no-build-isolation
|
||||
|
||||
or:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install -v --disable-pip-version-check . \
|
||||
-f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html --pre
|
||||
|
||||
Of course, adjust the urls to match the cuda version you use.
|
||||
|
||||
If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
|
||||
`FairScale <https://github.com/facebookresearch/fairscale/issues>`__.
|
||||
|
||||
|
||||
|
||||
**Usage**:
|
||||
|
||||
To use the first version of Sharded data-parallelism, add ``--sharded_ddp simple`` to the command line arguments, and
|
||||
make sure you have added the distributed launcher ``-m torch.distributed.launch
|
||||
--nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
|
||||
|
||||
For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
|
||||
--model_name_or_path t5-small --per_device_train_batch_size 1 \
|
||||
--output_dir output_dir --overwrite_output_dir \
|
||||
--do_train --max_train_samples 500 --num_train_epochs 1 \
|
||||
--dataset_name wmt16 --dataset_config "ro-en" \
|
||||
--source_lang en --target_lang ro \
|
||||
--fp16 --sharded_ddp simple
|
||||
|
||||
Notes:
|
||||
|
||||
- This feature requires distributed training (so multiple GPUs).
|
||||
- It is not implemented for TPUs.
|
||||
- It works with ``--fp16`` too, to make things even faster.
|
||||
- One of the main benefits of enabling ``--sharded_ddp simple`` is that it uses a lot less GPU memory, so you should be
|
||||
able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
|
||||
significantly shorter training time.
|
||||
|
||||
3. To use the second version of Sharded data-parallelism, add ``--sharded_ddp zero_dp_2`` or ``--sharded_ddp
|
||||
zero_dp_3`` to the command line arguments, and make sure you have added the distributed launcher ``-m
|
||||
torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
|
||||
|
||||
For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
|
||||
--model_name_or_path t5-small --per_device_train_batch_size 1 \
|
||||
--output_dir output_dir --overwrite_output_dir \
|
||||
--do_train --max_train_samples 500 --num_train_epochs 1 \
|
||||
--dataset_name wmt16 --dataset_config "ro-en" \
|
||||
--source_lang en --target_lang ro \
|
||||
--fp16 --sharded_ddp zero_dp_2
|
||||
|
||||
:obj:`zero_dp_2` is an optimized version of the simple wrapper, while :obj:`zero_dp_3` fully shards model weights,
|
||||
gradients and optimizer states.
|
||||
|
||||
Both are compatible with adding :obj:`cpu_offload` to enable ZeRO-offload (activate it like this: :obj:`--sharded_ddp
|
||||
"zero_dp_2 cpu_offload"`).
|
||||
|
||||
Notes:
|
||||
|
||||
- This feature requires distributed training (so multiple GPUs).
|
||||
- It is not implemented for TPUs.
|
||||
- It works with ``--fp16`` too, to make things even faster.
|
||||
- The ``cpu_offload`` additional option requires ``--fp16``.
|
||||
- This is an area of active development, so make sure you have a source install of fairscale to use this feature as
|
||||
some bugs you encounter may have been fixed there already.
|
||||
|
||||
Known caveats:
|
||||
|
||||
- This feature is incompatible with :obj:`--predict_with_generate` in the `run_translation.py` script.
|
||||
- Using :obj:`--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
|
||||
:obj:`FullyShardedDataParallelism` of fairscale. It should be used with the option :obj:`auto_wrap` if you are not
|
||||
doing this yourself: :obj:`--sharded_ddp "zero_dp_3 auto_wrap"`.
|
||||
|
||||
|
||||
DeepSpeed
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
||||
Moved to :ref:`deepspeed-trainer-integration`.
|
||||
|
||||
|
||||
Installation
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-installation`.
|
||||
|
||||
|
||||
Deployment with multiple GPUs
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-multi-gpu`.
|
||||
|
||||
|
||||
Deployment with one GPU
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-one-gpu`.
|
||||
|
||||
|
||||
Deployment in Notebooks
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-notebook`.
|
||||
|
||||
|
||||
Configuration
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-config`.
|
||||
|
||||
|
||||
Passing Configuration
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-config-passing`.
|
||||
|
||||
|
||||
Shared Configuration
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-config-shared`.
|
||||
|
||||
ZeRO
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-zero`.
|
||||
|
||||
ZeRO-2 Config
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Moved to :ref:`deepspeed-zero2-config`.
|
||||
|
||||
ZeRO-3 Config
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Moved to :ref:`deepspeed-zero3-config`.
|
||||
|
||||
|
||||
NVMe Support
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-nvme`.
|
||||
|
||||
ZeRO-2 vs ZeRO-3 Performance
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Moved to :ref:`deepspeed-zero2-zero3-performance`.
|
||||
|
||||
ZeRO-2 Example
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Moved to :ref:`deepspeed-zero2-example`.
|
||||
|
||||
ZeRO-3 Example
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Moved to :ref:`deepspeed-zero3-example`.
|
||||
|
||||
Optimizer and Scheduler
|
||||
=======================================================================================================================
|
||||
|
||||
|
||||
|
||||
Optimizer
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Moved to :ref:`deepspeed-optimizer`.
|
||||
|
||||
|
||||
Scheduler
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Moved to :ref:`deepspeed-scheduler`.
|
||||
|
||||
fp32 Precision
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-fp32`.
|
||||
|
||||
Automatic Mixed Precision
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-amp`.
|
||||
|
||||
Batch Size
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-bs`.
|
||||
|
||||
Gradient Accumulation
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-grad-acc`.
|
||||
|
||||
|
||||
Gradient Clipping
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-grad-clip`.
|
||||
|
||||
|
||||
Getting The Model Weights Out
|
||||
=======================================================================================================================
|
||||
|
||||
Moved to :ref:`deepspeed-weight-extraction`.
|
||||
|
|
@ -507,6 +507,8 @@ DEPRECATED_OBJECTS = [
|
|||
"xnli_output_modes",
|
||||
"xnli_processors",
|
||||
"xnli_tasks_num_labels",
|
||||
"TFTrainer",
|
||||
"TFTrainingArguments",
|
||||
]
|
||||
|
||||
# Exceptionally, some objects should not be documented after all rules passed.
|
||||
|
|
|
|||
Loading…
Reference in a new issue