From 593c0704351f35208e44dae1d85be8238209eb2a Mon Sep 17 00:00:00 2001 From: LysandreJik Date: Fri, 6 Sep 2019 12:00:12 -0400 Subject: [PATCH] Better examples --- docs/requirements.txt | 1 + docs/source/conf.py | 3 +- docs/source/examples.rst | 686 ---------------------------------- examples/README.md | 338 +++++++++++++++++ examples/run_lm_finetuning.py | 2 +- 5 files changed, 342 insertions(+), 688 deletions(-) delete mode 100644 docs/source/examples.rst create mode 100644 examples/README.md diff --git a/docs/requirements.txt b/docs/requirements.txt index 112beb3f7..0c2a31c09 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -26,3 +26,4 @@ sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==1.0.2 sphinxcontrib-serializinghtml==1.1.3 urllib3==1.25.3 +sphinx-markdown-tables==0.0.9 \ No newline at end of file diff --git a/docs/source/conf.py b/docs/source/conf.py index cdca1d82d..c847dee80 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -43,7 +43,8 @@ extensions = [ 'sphinx.ext.coverage', 'sphinx.ext.napoleon', 'recommonmark', - 'sphinx.ext.viewcode' + 'sphinx.ext.viewcode', + 'sphinx_markdown_tables' ] # Add any paths that contain templates here, relative to this directory. diff --git a/docs/source/examples.rst b/docs/source/examples.rst deleted file mode 100644 index d97845143..000000000 --- a/docs/source/examples.rst +++ /dev/null @@ -1,686 +0,0 @@ -examples.rst - -Examples -================================================ - -.. list-table:: - :header-rows: 1 - - * - Sub-section - - Description - * - `Training large models: introduction, tools and examples <#introduction>`_ - - How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models - * - `Fine-tuning with BERT: running the examples <#fine-tuning-bert-examples>`_ - - Running the examples in `examples `_\ : ``extract_classif.py``\ , ``run_bert_classifier.py``\ , ``run_bert_squad.py`` and ``run_lm_finetuning.py`` - * - `Fine-tuning with OpenAI GPT, Transformer-XL, GPT-2 as well as BERT and RoBERTa <#fine-tuning>`_ - - Running the examples in `examples `_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py``, ``run_gpt2.py`` and ``run_lm_finetuning.py`` - * - `Fine-tuning BERT-large on GPUs <#fine-tuning-bert-large>`_ - - How to fine tune ``BERT large`` - - -.. _introduction: - -Training large models: introduction, tools and examples -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32). - -To help with fine-tuning these models, we have included several techniques that you can activate in the fine-tuning scripts `run_bert_classifier.py `_ and `run_bert_squad.py `_\ : gradient-accumulation, multi-gpu training, distributed training and 16-bits training . For more details on how to use these techniques you can read `the tips on training large batches in PyTorch `_ that I published earlier this year. - -Here is how to use these techniques in our scripts: - - -* **Gradient Accumulation**\ : Gradient accumulation can be used by supplying a integer greater than 1 to the ``--gradient_accumulation_steps`` argument. The batch at each step will be divided by this integer and gradient will be accumulated over ``gradient_accumulation_steps`` steps. -* **Multi-GPU**\ : Multi-GPU is automatically activated when several GPUs are detected and the batches are splitted over the GPUs. -* **Distributed training**\ : Distributed training can be activated by supplying an integer greater or equal to 0 to the ``--local_rank`` argument (see below). -* **16-bits training**\ : 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found `here `__ and a full documentation is `here `__. In our scripts, this option can be activated by setting the ``--fp16`` flag and you can play with loss scaling using the ``--loss_scale`` flag (see the previously linked documentation for details on loss scaling). The loss scale can be zero in which case the scale is dynamically adjusted or a positive power of two in which case the scaling is static. - -To use 16-bits training and distributed training, you need to install NVIDIA's apex extension `as detailed here `__. You will find more information regarding the internals of ``apex`` and how to use ``apex`` in `the doc and the associated repository `_. The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in `the relevant PR of the present repository `_. - -Note: To use *Distributed Training*\ , you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see `the above mentioned blog post `_\ ) for more details): - -.. code-block:: bash - - python -m torch.distributed.launch \ - --nproc_per_node=4 \ - --nnodes=2 \ - --node_rank=$THIS_MACHINE_INDEX \ - --master_addr="192.168.1.1" \ - --master_port=1234 run_bert_classifier.py \ - (--arg1 --arg2 --arg3 and all other arguments of the run_classifier script) - -Where ``$THIS_MACHINE_INDEX`` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address ``192.168.1.1`` and an open port ``1234``. - -.. _fine-tuning-bert-examples: - -Fine-tuning with BERT: running the examples -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -We showcase several fine-tuning examples based on (and extended from) `the original implementation `_\ : - - -* a *sequence-level classifier* on nine different GLUE tasks, -* a *token-level classifier* on the question answering dataset SQuAD, and -* a *sequence-level multiple-choice classifier* on the SWAG classification corpus. -* a *BERT language model* on another target corpus - -GLUE results on dev set -~~~~~~~~~~~~~~~~~~~~~~~ - -We get the following results on the dev set of GLUE benchmark with an uncased BERT base -model (`bert-base-uncased`). All experiments ran on 8 V100 GPUs with a total train batch size of 24. Some of -these tasks have a small dataset and training can lead to high variance in the results between different runs. -We report the median on 5 runs (with different seeds) for each of the metrics. - -.. list-table:: - :header-rows: 1 - - * - Task - - Metric - - Result - * - CoLA - - Matthew's corr. - - 55.75 - * - SST-2 - - accuracy - - 92.09 - * - MRPC - - F1/accuracy - - 90.48/86.27 - * - STS-B - - Pearson/Spearman corr. - - 89.03/88.64 - * - QQP - - accuracy/F1 - - 90.92/87.72 - * - MNLI - - matched acc./mismatched acc. - - 83.74/84.06 - * - QNLI - - accuracy - - 91.07 - * - RTE - - accuracy - - 68.59 - * - WNLI - - accuracy - - 43.66 - - -Some of these results are significantly different from the ones reported on the test set -of GLUE benchmark on the website. For QQP and WNLI, please refer to `FAQ #12 `_ on the webite. - -Before running anyone of these GLUE tasks you should download the -`GLUE data `_ by running -`this script `_ -and unpack it to some directory ``$GLUE_DIR``. - -.. code-block:: shell - - export GLUE_DIR=/path/to/glue - export TASK_NAME=MRPC - - python run_bert_classifier.py \ - --task_name $TASK_NAME \ - --do_train \ - --do_eval \ - --do_lower_case \ - --data_dir $GLUE_DIR/$TASK_NAME \ - --bert_model bert-base-uncased \ - --max_seq_length 128 \ - --train_batch_size 32 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --output_dir /tmp/$TASK_NAME/ - -where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI. - -The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'. - -The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being said, there shouldn't be any issues in running half-precision training with the remaining GLUE tasks as well, since the data processor for each task inherits from the base class DataProcessor. - -MRPC -~~~~ - -This example code fine-tunes BERT on the Microsoft Research Paraphrase -Corpus (MRPC) corpus and runs in less than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed. - -Before running this example you should download the -`GLUE data `_ by running -`this script `_ -and unpack it to some directory ``$GLUE_DIR``. - -.. code-block:: shell - - export GLUE_DIR=/path/to/glue - - python run_bert_classifier.py \ - --task_name MRPC \ - --do_train \ - --do_eval \ - --do_lower_case \ - --data_dir $GLUE_DIR/MRPC/ \ - --bert_model bert-base-uncased \ - --max_seq_length 128 \ - --train_batch_size 32 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --output_dir /tmp/mrpc_output/ - -Our test ran on a few seeds with `the original implementation hyper-parameters `__ gave evaluation results between 84% and 88%. - -**Fast run with apex and 16 bit precision: fine-tuning on MRPC in 27 seconds!** -First install apex as indicated `here `__. -Then run - -.. code-block:: shell - - export GLUE_DIR=/path/to/glue - - python run_bert_classifier.py \ - --task_name MRPC \ - --do_train \ - --do_eval \ - --do_lower_case \ - --data_dir $GLUE_DIR/MRPC/ \ - --bert_model bert-base-uncased \ - --max_seq_length 128 \ - --train_batch_size 32 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --output_dir /tmp/mrpc_output/ \ - --fp16 - -**Distributed training** -Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking model to reach a F1 > 92 on MRPC: - -.. code-block:: bash - - python -m torch.distributed.launch \ - --nproc_per_node 8 run_bert_classifier.py \ - --bert_model bert-large-uncased-whole-word-masking \ - --task_name MRPC \ - --do_train \ - --do_eval \ - --do_lower_case \ - --data_dir $GLUE_DIR/MRPC/ \ - --max_seq_length 128 \ - --train_batch_size 8 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --output_dir /tmp/mrpc_output/ - -Training with these hyper-parameters gave us the following results: - -.. code-block:: bash - - acc = 0.8823529411764706 - acc_and_f1 = 0.901702786377709 - eval_loss = 0.3418912578906332 - f1 = 0.9210526315789473 - global_step = 174 - loss = 0.07231863956341798 - -Here is an example on MNLI: - -.. code-block:: bash - - python -m torch.distributed.launch \ - --nproc_per_node 8 run_bert_classifier.py \ - --bert_model bert-large-uncased-whole-word-masking \ - --task_name mnli \ - --do_train \ - --do_eval \ - --do_lower_case \ - --data_dir /datadrive/bert_data/glue_data//MNLI/ \ - --max_seq_length 128 \ - --train_batch_size 8 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --output_dir ../models/wwm-uncased-finetuned-mnli/ \ - --overwrite_output_dir - -.. code-block:: bash - - ***** Eval results ***** - acc = 0.8679706601466992 - eval_loss = 0.4911287787382479 - global_step = 18408 - loss = 0.04755385363816904 - - ***** Eval results ***** - acc = 0.8747965825874695 - eval_loss = 0.45516540421714036 - global_step = 18408 - loss = 0.04755385363816904 - -This is the example of the ``bert-large-uncased-whole-word-masking-finetuned-mnli`` model - -SQuAD -~~~~~ - -This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) on a single tesla V100 16GB. - -The data for SQuAD can be downloaded with the following links and should be saved in a ``$SQUAD_DIR`` directory. - - -* `train-v1.1.json `_ -* `dev-v1.1.json `_ -* `evaluate-v1.1.py `_ - -.. code-block:: shell - - export SQUAD_DIR=/path/to/SQUAD - - python run_bert_squad.py \ - --bert_model bert-base-uncased \ - --do_train \ - --do_predict \ - --do_lower_case \ - --train_file $SQUAD_DIR/train-v1.1.json \ - --predict_file $SQUAD_DIR/dev-v1.1.json \ - --train_batch_size 12 \ - --learning_rate 3e-5 \ - --num_train_epochs 2.0 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir /tmp/debug_squad/ - -Training with the previous hyper-parameters gave us the following results: - -.. code-block:: bash - - python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json /tmp/debug_squad/predictions.json - {"f1": 88.52381567990474, "exact_match": 81.22043519394512} - -**distributed training** - -Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD: - -.. code-block:: bash - - python -m torch.distributed.launch --nproc_per_node=8 \ - run_bert_squad.py \ - --bert_model bert-large-uncased-whole-word-masking \ - --do_train \ - --do_predict \ - --do_lower_case \ - --train_file $SQUAD_DIR/train-v1.1.json \ - --predict_file $SQUAD_DIR/dev-v1.1.json \ - --learning_rate 3e-5 \ - --num_train_epochs 2 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir ../models/wwm_uncased_finetuned_squad/ \ - --train_batch_size 24 \ - --gradient_accumulation_steps 12 - -Training with these hyper-parameters gave us the following results: - -.. code-block:: bash - - python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json - {"exact_match": 86.91579943235573, "f1": 93.1532499015869} - -This is the model provided as ``bert-large-uncased-whole-word-masking-finetuned-squad``. - -And here is the model provided as ``bert-large-cased-whole-word-masking-finetuned-squad``\ : - -.. code-block:: bash - - python -m torch.distributed.launch --nproc_per_node=8 run_bert_squad.py \ - --bert_model bert-large-cased-whole-word-masking \ - --do_train \ - --do_predict \ - --do_lower_case \ - --train_file $SQUAD_DIR/train-v1.1.json \ - --predict_file $SQUAD_DIR/dev-v1.1.json \ - --learning_rate 3e-5 \ - --num_train_epochs 2 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir ../models/wwm_cased_finetuned_squad/ \ - --train_batch_size 24 \ - --gradient_accumulation_steps 12 - -Training with these hyper-parameters gave us the following results: - -.. code-block:: bash - - python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json - {"exact_match": 84.18164616840113, "f1": 91.58645594850135} - -SWAG -~~~~ - -The data for SWAG can be downloaded by cloning the following `repository `_ - -.. code-block:: shell - - export SWAG_DIR=/path/to/SWAG - - python run_bert_swag.py \ - --bert_model bert-base-uncased \ - --do_train \ - --do_lower_case \ - --do_eval \ - --data_dir $SWAG_DIR/data \ - --train_batch_size 16 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --max_seq_length 80 \ - --output_dir /tmp/swag_output/ \ - --gradient_accumulation_steps 4 - -Training with the previous hyper-parameters on a single GPU gave us the following results: - -.. code-block:: - - eval_accuracy = 0.8062081375587323 - eval_loss = 0.5966546792367169 - global_step = 13788 - loss = 0.06423990014260186 - -LM Fine-tuning -~~~~~~~~~~~~~~ - -The data should be a text file in the same format as `sample_text.txt <./pytorch_transformers/tests/fixtures/sample_text.txt/sample_text.txt>`_ (one sentence per line, docs separated by empty line). -You can download an `exemplary training corpus `_ generated from wikipedia articles and split into ~500k sentences with spaCy. -Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with ``train_batch_size=200`` and ``max_seq_length=128``\ : - -Thank to the work of @Rocketknight1 and @tholor there are now **several scripts** that can be used to fine-tune BERT using the pretraining objective (combination of masked-language modeling and next sentence prediction loss). These scripts are detailed in the `README `_ of the `examples/lm_finetuning/ `_ folder. - -.. _fine-tuning: - -OpenAI GPT, Transformer-XL and GPT-2: running the examples -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -We provide three examples of scripts for OpenAI GPT, Transformer-XL, OpenAI GPT-2, BERT and RoBERTa based on (and extended from) the respective original implementations: - - -* fine-tuning OpenAI GPT on the ROCStories dataset -* evaluating Transformer-XL on Wikitext 103 -* unconditional and conditional generation from a pre-trained OpenAI GPT-2 model -* fine-tuning GPT/GPT-2 on a causal language modeling task and BERT/RoBERTa on a masked language modeling task - -Fine-tuning OpenAI GPT on the RocStories dataset -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -This example code fine-tunes OpenAI GPT on the RocStories dataset. - -Before running this example you should download the -`RocStories dataset `_ and unpack it to some directory ``$ROC_STORIES_DIR``. - -.. code-block:: shell - - export ROC_STORIES_DIR=/path/to/RocStories - - python run_openai_gpt.py \ - --model_name openai-gpt \ - --do_train \ - --do_eval \ - --train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \ - --eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \ - --output_dir ../log \ - --train_batch_size 16 \ - -This command runs in about 10 min on a single K-80 an gives an evaluation accuracy of about 87.7% (the authors report a median accuracy with the TensorFlow code of 85.8% and the OpenAI GPT paper reports a best single run accuracy of 86.5%). - -Evaluating the pre-trained Transformer-XL on the WikiText 103 dataset -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -This example code evaluate the pre-trained Transformer-XL on the WikiText 103 dataset. -This command will download a pre-processed version of the WikiText 103 dataset in which the vocabulary has been computed. - -.. code-block:: shell - - python run_transfo_xl.py --work_dir ../log - -This command runs in about 1 min on a V100 and gives an evaluation perplexity of 18.22 on WikiText-103 (the authors report a perplexity of about 18.3 on this dataset with the TensorFlow code). - -Unconditional and conditional generation from OpenAI's GPT-2 model -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -This example code is identical to the original unconditional and conditional generation codes. - -Conditional generation: - -.. code-block:: shell - - python run_gpt2.py - -Unconditional generation: - -.. code-block:: shell - - python run_gpt2.py --unconditional - -The same option as in the original scripts are provided, please refer to the code of the example and the original repository of OpenAI. - - -Causal LM fine-tuning on GPT/GPT-2, Masked LM fine-tuning on BERT/RoBERTa -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Before running the following examples you should download the `WikiText-2 dataset `__ and unpack it to some directory `$WIKITEXT_2_DATASET` -The following results were obtained using the `raw` WikiText-2 (no tokens were replaced before the tokenization). - -This example fine-tunes GPT-2 on the WikiText-2 dataset. The loss function is a causal language modeling loss (perplexity). - -.. code-block:: bash - - - export WIKITEXT_2_DATASET=/path/to/wikitext_dataset - - python run_lm_finetuning.py - --output_dir=output - --model_type=gpt2 - --model_name_or_path=gpt2 - --do_train - --train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw - --do_eval - --eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw - -This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. -It reaches a score of about 20 perplexity once fine-tuned on the dataset. - -This example fine-tunes RoBERTa on the WikiText-2 dataset. The loss function is a masked language modeling loss (masked perplexity). -The `--mlm` flag is necessary to fine-tune BERT/RoBERTa on masked language modeling. - -.. code-block:: bash - - - export WIKITEXT_2_DATASET=/path/to/wikitext_dataset - - python run_lm_finetuning.py - --output_dir=output - --model_type=roberta - --model_name_or_path=roberta-base - --do_train - --train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw - --do_eval - --eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw - --mlm - -.. _fine-tuning-BERT-large: - -Fine-tuning BERT-large on GPUs ------------------------------- - -The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation. - -For example, fine-tuning BERT-large on SQuAD can be done on a server with 4 k-80 (these are pretty old now) in 18 hours. Our results are similar to the TensorFlow implementation results (actually slightly higher): - -.. code-block:: bash - - {"exact_match": 84.56953642384106, "f1": 91.04028647786927} - -To get these results we used a combination of: - - -* multi-GPU training (automatically activated on a multi-GPU server), -* 2 steps of gradient accumulation and -* perform the optimization step on CPU to store Adam's averages in RAM. - -Here is the full list of hyper-parameters for this run: - -.. code-block:: bash - - export SQUAD_DIR=/path/to/SQUAD - - python ./run_bert_squad.py \ - --bert_model bert-large-uncased \ - --do_train \ - --do_predict \ - --do_lower_case \ - --train_file $SQUAD_DIR/train-v1.1.json \ - --predict_file $SQUAD_DIR/dev-v1.1.json \ - --learning_rate 3e-5 \ - --num_train_epochs 2 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir /tmp/debug_squad/ \ - --train_batch_size 24 \ - --gradient_accumulation_steps 2 - -If you have a recent GPU (starting from NVIDIA Volta series), you should try **16-bit fine-tuning** (FP16). - -Here is an example of hyper-parameters for a FP16 run we tried: - -.. code-block:: bash - - export SQUAD_DIR=/path/to/SQUAD - - python ./run_bert_squad.py \ - --bert_model bert-large-uncased \ - --do_train \ - --do_predict \ - --do_lower_case \ - --train_file $SQUAD_DIR/train-v1.1.json \ - --predict_file $SQUAD_DIR/dev-v1.1.json \ - --learning_rate 3e-5 \ - --num_train_epochs 2 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir /tmp/debug_squad/ \ - --train_batch_size 24 \ - --fp16 \ - --loss_scale 128 - -The results were similar to the above FP32 results (actually slightly higher): - -.. code-block:: bash - - {"exact_match": 84.65468306527909, "f1": 91.238669287002} - -Here is an example with the recent ``bert-large-uncased-whole-word-masking``\ : - -.. code-block:: bash - - python -m torch.distributed.launch --nproc_per_node=8 \ - run_bert_squad.py \ - --bert_model bert-large-uncased-whole-word-masking \ - --do_train \ - --do_predict \ - --do_lower_case \ - --train_file $SQUAD_DIR/train-v1.1.json \ - --predict_file $SQUAD_DIR/dev-v1.1.json \ - --learning_rate 3e-5 \ - --num_train_epochs 2 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir /tmp/debug_squad/ \ - --train_batch_size 24 \ - --gradient_accumulation_steps 2 - -Fine-tuning XLNet ------------------ - -STS-B -~~~~~ - -This example code fine-tunes XLNet on the STS-B corpus. - -Before running this example you should download the -`GLUE data `_ by running -`this script `_ -and unpack it to some directory ``$GLUE_DIR``. - -.. code-block:: shell - - export GLUE_DIR=/path/to/glue - - python run_xlnet_classifier.py \ - --task_name STS-B \ - --do_train \ - --do_eval \ - --data_dir $GLUE_DIR/STS-B/ \ - --max_seq_length 128 \ - --train_batch_size 8 \ - --gradient_accumulation_steps 1 \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --output_dir /tmp/mrpc_output/ - -Our test ran on a few seeds with `the original implementation hyper-parameters `__ gave evaluation results between 84% and 88%. - -**Distributed training** -Here is an example using distributed training on 8 V100 GPUs to reach XXXX: - -.. code-block:: bash - - python -m torch.distributed.launch --nproc_per_node 8 \ - run_xlnet_classifier.py \ - --task_name STS-B \ - --do_train \ - --do_eval \ - --data_dir $GLUE_DIR/STS-B/ \ - --max_seq_length 128 \ - --train_batch_size 8 \ - --gradient_accumulation_steps 1 \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --output_dir /tmp/mrpc_output/ - -Training with these hyper-parameters gave us the following results: - -.. code-block:: bash - - acc = 0.8823529411764706 - acc_and_f1 = 0.901702786377709 - eval_loss = 0.3418912578906332 - f1 = 0.9210526315789473 - global_step = 174 - loss = 0.07231863956341798 - -Here is an example on MNLI: - -.. code-block:: bash - - python -m torch.distributed.launch --nproc_per_node 8 run_bert_classifier.py \ - --bert_model bert-large-uncased-whole-word-masking \ - --task_name mnli \ - --do_train \ - --do_eval \ - --data_dir /datadrive/bert_data/glue_data//MNLI/ \ - --max_seq_length 128 \ - --train_batch_size 8 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --output_dir ../models/wwm-uncased-finetuned-mnli/ \ - --overwrite_output_dir - -.. code-block:: bash - - ***** Eval results ***** - acc = 0.8679706601466992 - eval_loss = 0.4911287787382479 - global_step = 18408 - loss = 0.04755385363816904 - - ***** Eval results ***** - acc = 0.8747965825874695 - eval_loss = 0.45516540421714036 - global_step = 18408 - loss = 0.04755385363816904 - -This is the example of the ``bert-large-uncased-whole-word-masking-finetuned-mnli`` model. diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 000000000..46ff9270d --- /dev/null +++ b/examples/README.md @@ -0,0 +1,338 @@ +# Examples + +In this section a few examples are put together. All of these examples work for several models, making use of the very +similar API between the different models. + +## Language model fine-tuning + +Based on the script `run_lm_finetuning.py`. + +Fine-tuning the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT +to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa +are fine-tuned using a masked language modeling (MLM) loss. + +Before running the following example, you should get a file that contains text on which the language model will be +fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/). + +We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains +text that will be used for evaluation. + +### GPT-2/GPT and causal language modeling + +The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before +the tokenization). The loss here is that of causal language modeling. + +```bash +export TRAIN_FILE=/path/to/dataset/wiki.train.raw +export TEST_FILE=/path/to/dataset/wiki.test.raw + +python run_lm_finetuning.py \ + --output_dir=output \ + --model_type=gpt2 \ + --model_name_or_path=gpt2 \ + --do_train \ + --train_data_file=$TRAIN_FILE \ + --do_eval \ + --eval_data_file=$TEST_FILE +``` + +This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches +a score of ~20 perplexity once fine-tuned on the dataset. + +### RoBERTa/BERT and masked language modeling + +The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different +as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their +pre-training: masked language modeling. + +In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may therefore converge +slower, but over-fitting would take more epochs. + +We use the `--mlm` flag so that the script may change its loss function. + +```bash +export TRAIN_FILE=/path/to/dataset/wiki.train.raw +export TEST_FILE=/path/to/dataset/wiki.test.raw + +python run_lm_finetuning.py \ + --output_dir=output \ + --model_type=roberta \ + --model_name_or_path=roberta-base \ + --do_train \ + --train_data_file=$TRAIN_FILE \ + --do_eval \ + --eval_data_file=$TEST_FILE \ + --mlm +``` + +## Language generation + +Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. +A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you +can try out the different models available in the library. + +Example usage: + +```bash +python run_generation.py \ + --model_type=gpt2 \ + --model_name_or_path=gpt2 +``` + +## GLUE + +Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding +Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa. + +GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an +uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran on 8 V100 GPUs with a total train +batch size of 24. Some of these tasks have a small dataset and training can lead to high variance in the results +between different runs. We report the median on 5 runs (with different seeds) for each of the metrics. + +| Task | Metric | Result | +|-------|------------------------------|-------------| +| CoLA | Matthew's corr | 55.75 | +| SST-2 | Accuracy | 92.09 | +| MRPC | F1/Accuracy | 90.48/86.27 | +| STS-B | Person/Spearman corr. | 89.03/88.64 | +| QQP | Accuracy/F1 | 90.92/87.72 | +| MNLI | Matched acc./Mismatched acc. | 83.74/84.06 | +| QNLI | Accuracy | 91.07 | +| RTE | Accuracy | 68.59 | +| WNLI | Accuracy | 43.66 | + +Some of these results are significantly different from the ones reported on the test set +of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite. + +Before running anyone of these GLUE tasks you should download the +[GLUE data](https://gluebenchmark.com/tasks) by running +[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) +and unpack it to some directory `$GLUE_DIR`. + +```bash +export GLUE_DIR=/path/to/glue +export TASK_NAME=MRPC + +python run_bert_classifier.py \ + --task_name $TASK_NAME \ + --do_train \ + --do_eval \ + --do_lower_case \ + --data_dir $GLUE_DIR/$TASK_NAME \ + --bert_model bert-base-uncased \ + --max_seq_length 128 \ + --train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/$TASK_NAME/ +``` + +where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI. + +The dev set results will be present within the text file `eval_results.txt` in the specified output_dir. +In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate +output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`. + +The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, +CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being +said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well, +since the data processor for each task inherits from the base class DataProcessor. + +### MRPC + +#### Fine-tuning example + +The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less +than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed. + +Before running anyone of these GLUE tasks you should download the +[GLUE data](https://gluebenchmark.com/tasks) by running +[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) +and unpack it to some directory `$GLUE_DIR`. + +```bash +export GLUE_DIR=/path/to/glue + +python run_bert_classifier.py \ + --task_name MRPC \ + --do_train \ + --do_eval \ + --do_lower_case \ + --data_dir $GLUE_DIR/MRPC/ \ + --bert_model bert-base-uncased \ + --max_seq_length 128 \ + --train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/mrpc_output/ +``` + +Our test ran on a few seeds with [the original implementation hyper- +parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation +results between 84% and 88%. + +#### Using Apex and mixed-precision + +Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install +[apex](https://github.com/NVIDIA/apex), then run the following example: + +```bash +export GLUE_DIR=/path/to/glue + +python run_bert_classifier.py \ + --task_name MRPC \ + --do_train \ + --do_eval \ + --do_lower_case \ + --data_dir $GLUE_DIR/MRPC/ \ + --bert_model bert-base-uncased \ + --max_seq_length 128 \ + --train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/mrpc_output/ \ + --fp16 +``` + +#### Distributed training + +Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it +reaches F1 > 92 on MRPC. + +```bash +export GLUE_DIR=/path/to/glue + +python -m torch.distributed.launch \ + --nproc_per_node 8 run_bert_classifier.py \ + --bert_model bert-large-uncased-whole-word-masking \ + --task_name MRPC \ + --do_train \ + --do_eval \ + --do_lower_case \ + --data_dir $GLUE_DIR/MRPC/ \ + --max_seq_length 128 \ + --train_batch_size 8 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/mrpc_output/ +``` + +Training with these hyper-parameters gave us the following results: + +```bash +acc = 0.8823529411764706 +acc_and_f1 = 0.901702786377709 +eval_loss = 0.3418912578906332 +f1 = 0.9210526315789473 +global_step = 174 +loss = 0.07231863956341798 +``` + +### MNLI + +The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task. + +```bash +export GLUE_DIR=/path/to/glue + +python -m torch.distributed.launch \ + --nproc_per_node 8 run_bert_classifier.py \ + --bert_model bert-large-uncased-whole-word-masking \ + --task_name mnli \ + --do_train \ + --do_eval \ + --do_lower_case \ + --data_dir $GLUE_DIR/MNLI/ \ + --max_seq_length 128 \ + --train_batch_size 8 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir output_dir \ +``` + +The results are the following: + +```bash +***** Eval results ***** + acc = 0.8679706601466992 + eval_loss = 0.4911287787382479 + global_step = 18408 + loss = 0.04755385363816904 + +***** Eval results ***** + acc = 0.8747965825874695 + eval_loss = 0.45516540421714036 + global_step = 18408 + loss = 0.04755385363816904 +``` + +## SQuAD + +#### Fine-tuning on SQuAD + +This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) +on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a +$SQUAD_DIR directory. + +* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json) +* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json) +* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py) + +```bash +export SQUAD_DIR=/path/to/SQUAD + +python run_bert_squad.py \ + --bert_model bert-base-uncased \ + --do_train \ + --do_predict \ + --do_lower_case \ + --train_file $SQUAD_DIR/train-v1.1.json \ + --predict_file $SQUAD_DIR/dev-v1.1.json \ + --train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2.0 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ +``` + +Training with the previously defined hyper-parameters yields the following results: + +```bash +f1 = 88.52 +exact_match = 81.22 +``` + +#### Distributed training + + +Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD: + +```bash + python -m torch.distributed.launch --nproc_per_node=8 \ + run_bert_squad.py \ + --bert_model bert-large-uncased-whole-word-masking \ + --do_train \ + --do_predict \ + --do_lower_case \ + --train_file $SQUAD_DIR/train-v1.1.json \ + --predict_file $SQUAD_DIR/dev-v1.1.json \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir ../models/wwm_uncased_finetuned_squad/ \ + --train_batch_size 24 \ + --gradient_accumulation_steps 12 +``` + +Training with the previously defined hyper-parameters yields the following results: + +```bash +f1 = 93.15 +exact_match = 86.91 +``` + +This fine-tuneds model is available as a checkpoint under the reference +`bert-large-uncased-whole-word-masking-finetuned-squad`. + diff --git a/examples/run_lm_finetuning.py b/examples/run_lm_finetuning.py index d37f7a443..a1995ae22 100644 --- a/examples/run_lm_finetuning.py +++ b/examples/run_lm_finetuning.py @@ -14,7 +14,7 @@ # See the License for the specific language governing permissions and # limitations under the License. """ -Fine-tuning the library models for language modeling on WikiText-2 (GPT, GPT-2, BERT, RoBERTa). +Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned using a masked language modeling (MLM) loss. """