From 64fd98637629f371e3692a08017eb79458755cae Mon Sep 17 00:00:00 2001 From: LysandreJik Date: Fri, 5 Jul 2019 17:44:59 -0400 Subject: [PATCH] Tokenizers and Config classes are referenced. --- docs/source/model_doc/bert.rst | 54 +++++-------------------- docs/source/model_doc/gpt.rst | 36 +++++------------ docs/source/model_doc/gpt2.rst | 29 ++++--------- docs/source/model_doc/transformerxl.rst | 14 ++++--- docs/source/model_doc/xlm.rst | 3 ++ docs/source/model_doc/xlnet.rst | 2 + 6 files changed, 43 insertions(+), 95 deletions(-) diff --git a/docs/source/model_doc/bert.rst b/docs/source/model_doc/bert.rst index 018f3e396..7dc669af7 100644 --- a/docs/source/model_doc/bert.rst +++ b/docs/source/model_doc/bert.rst @@ -1,57 +1,25 @@ BERT ---------------------------------------------------- +``BertConfig`` +~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: pytorch_pretrained_bert.BertConfig + :members: + + ``BertTokenizer`` ~~~~~~~~~~~~~~~~~~~~~ -``BertTokenizer`` perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization. - -This class has five arguments: - - -* ``vocab_file``\ : path to a vocabulary file. -* ``do_lower_case``\ : convert text to lower-case while tokenizing. **Default = True**. -* ``max_len``\ : max length to filter the input of the Transformer. Default to pre-trained value for the model if ``None``. **Default = None** -* ``do_basic_tokenize``\ : Do basic tokenization before wordpice tokenization. Set to false if text is pre-tokenized. **Default = True**. -* ``never_split``\ : a list of tokens that should not be splitted during tokenization. **Default = ``["[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"]``\ ** - -and three methods: - - -* ``tokenize(text)``\ : convert a ``str`` in a list of ``str`` tokens by (1) performing basic tokenization and (2) WordPiece tokenization. -* ``convert_tokens_to_ids(tokens)``\ : convert a list of ``str`` tokens in a list of ``int`` indices in the vocabulary. -* ``convert_ids_to_tokens(tokens)``\ : convert a list of ``int`` indices in a list of ``str`` tokens in the vocabulary. -* `save_vocabulary(directory_path)`: save the vocabulary file to `directory_path`. Return the path to the saved vocabulary file: ``vocab_file_path``. The vocabulary can be reloaded with ``BertTokenizer.from_pretrained('vocab_file_path')`` or ``BertTokenizer.from_pretrained('directory_path')``. - -Please refer to the doc strings and code in `\ ``tokenization.py`` <./pytorch_pretrained_bert/tokenization.py>`_ for the details of the ``BasicTokenizer`` and ``WordpieceTokenizer`` classes. In general it is recommended to use ``BertTokenizer`` unless you know what you are doing. +.. autoclass:: pytorch_pretrained_bert.BertTokenizer + :members: ``BertAdam`` ~~~~~~~~~~~~~~~~ -``BertAdam`` is a ``torch.optimizer`` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following: - - -* BertAdam implements weight decay fix, -* BertAdam doesn't compensate for bias as in the regular Adam optimizer. - -The optimizer accepts the following arguments: - - -* ``lr`` : learning rate -* ``warmup`` : portion of ``t_total`` for the warmup, ``-1`` means no warmup. Default : ``-1`` -* ``t_total`` : total number of training steps for the learning - rate schedule, ``-1`` means constant learning rate. Default : ``-1`` -* ``schedule`` : schedule to use for the warmup (see above). - Can be ``'warmup_linear'``\ , ``'warmup_constant'``\ , ``'warmup_cosine'``\ , ``'none'``\ , ``None`` or a ``_LRSchedule`` object (see below). - If ``None`` or ``'none'``\ , learning rate is always kept constant. - Default : ``'warmup_linear'`` -* ``b1`` : Adams b1. Default : ``0.9`` -* ``b2`` : Adams b2. Default : ``0.999`` -* ``e`` : Adams epsilon. Default : ``1e-6`` -* ``weight_decay:`` Weight decay. Default : ``0.01`` -* ``max_grad_norm`` : Maximum norm for the gradients (\ ``-1`` means no clipping). Default : ``1.0`` - +.. autoclass:: pytorch_pretrained_bert.BertAdam + :members: 1. ``BertModel`` ~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/source/model_doc/gpt.rst b/docs/source/model_doc/gpt.rst index 59e84a342..3db40719b 100644 --- a/docs/source/model_doc/gpt.rst +++ b/docs/source/model_doc/gpt.rst @@ -1,41 +1,25 @@ OpenAI GPT ---------------------------------------------------- +``OpenAIGPTConfig`` +~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: pytorch_pretrained_bert.OpenAIGPTConfig + :members: + ``OpenAIGPTTokenizer`` ~~~~~~~~~~~~~~~~~~~~~~~~~~ -``OpenAIGPTTokenizer`` perform Byte-Pair-Encoding (BPE) tokenization. - -This class has four arguments: - - -* ``vocab_file``\ : path to a vocabulary file. -* ``merges_file``\ : path to a file containing the BPE merges. -* ``max_len``\ : max length to filter the input of the Transformer. Default to pre-trained value for the model if ``None``. **Default = None** -* ``special_tokens``\ : a list of tokens to add to the vocabulary for fine-tuning. If SpaCy is not installed and BERT's ``BasicTokenizer`` is used as the pre-BPE tokenizer, these tokens are not split. **Default= None** - -and five methods: - - -* ``tokenize(text)``\ : convert a ``str`` in a list of ``str`` tokens by performing BPE tokenization. -* ``convert_tokens_to_ids(tokens)``\ : convert a list of ``str`` tokens in a list of ``int`` indices in the vocabulary. -* ``convert_ids_to_tokens(tokens)``\ : convert a list of ``int`` indices in a list of ``str`` tokens in the vocabulary. -* ``set_special_tokens(self, special_tokens)``\ : update the list of special tokens (see above arguments) -* ``encode(text)``\ : convert a ``str`` in a list of ``int`` tokens by performing BPE encoding. -* `decode(ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)`: decode a list of `int` indices in a string and do some post-processing if needed: (i) remove special tokens from the output and (ii) clean up tokenization spaces. -* `save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: ``vocab_file_path``\ , ``merge_file_path``\ , ``special_tokens_file_path``. The vocabulary can be reloaded with ``OpenAIGPTTokenizer.from_pretrained('directory_path')``. - -Please refer to the doc strings and code in `\ ``tokenization_openai.py`` <./pytorch_pretrained_bert/tokenization_openai.py>`_ for the details of the ``OpenAIGPTTokenizer``. +.. autoclass:: pytorch_pretrained_bert.OpenAIGPTTokenizer + :members: ``OpenAIAdam`` ~~~~~~~~~~~~~~~~~~ -``OpenAIAdam`` is similar to ``BertAdam``. -The differences with ``BertAdam`` is that ``OpenAIAdam`` compensate for bias as in the regular Adam optimizer. - -``OpenAIAdam`` accepts the same arguments as ``BertAdam``. +.. autoclass:: pytorch_pretrained_bert.OpenAIAdam + :members: 9. ``OpenAIGPTModel`` diff --git a/docs/source/model_doc/gpt2.rst b/docs/source/model_doc/gpt2.rst index bfcf26acb..ca232ca87 100644 --- a/docs/source/model_doc/gpt2.rst +++ b/docs/source/model_doc/gpt2.rst @@ -1,31 +1,18 @@ OpenAI GPT2 ---------------------------------------------------- +``GPT2Config`` +~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: pytorch_pretrained_bert.GPT2Config + :members: + ``GPT2Tokenizer`` ~~~~~~~~~~~~~~~~~~~~~ -``GPT2Tokenizer`` perform byte-level Byte-Pair-Encoding (BPE) tokenization. - -This class has three arguments: - - -* ``vocab_file``\ : path to a vocabulary file. -* ``merges_file``\ : path to a file containing the BPE merges. -* ``errors``\ : How to handle unicode decoding errors. **Default = ``replace``\ ** - -and two methods: - - -* ``tokenize(text)``\ : convert a ``str`` in a list of ``str`` tokens by performing byte-level BPE. -* ``convert_tokens_to_ids(tokens)``\ : convert a list of ``str`` tokens in a list of ``int`` indices in the vocabulary. -* ``convert_ids_to_tokens(tokens)``\ : convert a list of ``int`` indices in a list of ``str`` tokens in the vocabulary. -* ``set_special_tokens(self, special_tokens)``\ : update the list of special tokens (see above arguments) -* ``encode(text)``\ : convert a ``str`` in a list of ``int`` tokens by performing byte-level BPE. -* ``decode(tokens)``\ : convert back a list of ``int`` tokens in a ``str``. -* `save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: ``vocab_file_path``\ , ``merge_file_path``\ , ``special_tokens_file_path``. The vocabulary can be reloaded with ``OpenAIGPTTokenizer.from_pretrained('directory_path')``. - -Please refer to `\ ``tokenization_gpt2.py`` <./pytorch_pretrained_bert/tokenization_gpt2.py>`_ for more details on the ``GPT2Tokenizer``. +.. autoclass:: pytorch_pretrained_bert.GPT2Tokenizer + :members: 14. ``GPT2Model`` diff --git a/docs/source/model_doc/transformerxl.rst b/docs/source/model_doc/transformerxl.rst index c84693b38..2d2c38b25 100644 --- a/docs/source/model_doc/transformerxl.rst +++ b/docs/source/model_doc/transformerxl.rst @@ -2,14 +2,18 @@ Transformer XL ---------------------------------------------------- +``TransfoXLConfig`` +~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: pytorch_pretrained_bert.TransfoXLConfig + :members: + + ``TransfoXLTokenizer`` ~~~~~~~~~~~~~~~~~~~~~~~~~~ -``TransfoXLTokenizer`` perform word tokenization. This tokenizer can be used for adaptive softmax and has utilities for counting tokens in a corpus to create a vocabulary ordered by toekn frequency (for adaptive softmax). See the adaptive softmax paper (\ `Efficient softmax approximation for GPUs `_\ ) for more details. - -The API is similar to the API of ``BertTokenizer`` (see above). - -Please refer to the doc strings and code in `\ ``tokenization_transfo_xl.py`` <./pytorch_pretrained_bert/tokenization_transfo_xl.py>`_ for the details of these additional methods in ``TransfoXLTokenizer``. +.. autoclass:: pytorch_pretrained_bert.TransfoXLTokenizer + :members: 12. ``TransfoXLModel`` diff --git a/docs/source/model_doc/xlm.rst b/docs/source/model_doc/xlm.rst index 70b5fa3b4..086bf8782 100644 --- a/docs/source/model_doc/xlm.rst +++ b/docs/source/model_doc/xlm.rst @@ -1,2 +1,5 @@ XLM ---------------------------------------------------- + + +I don't really know what to put here, I'll leave it up to you to decide @Thom \ No newline at end of file diff --git a/docs/source/model_doc/xlnet.rst b/docs/source/model_doc/xlnet.rst index d2fd996cb..8138d1bcd 100644 --- a/docs/source/model_doc/xlnet.rst +++ b/docs/source/model_doc/xlnet.rst @@ -1,2 +1,4 @@ XLNet ---------------------------------------------------- + +I don't really know what to put here, I'll leave it up to you to decide @Thom \ No newline at end of file