Tokenizers and Config classes are referenced.

2026-05-14 20:58:08 +00:00 · 2019-07-05 17:44:59 -04:00 · 2019-07-05 17:44:59 -04:00 · 64fd986376
commit 64fd986376
parent df759114c9
6 changed files with 43 additions and 95 deletions
--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
@ -1,57 +1,25 @@
 BERT
 ----------------------------------------------------

+``BertConfig``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertConfig
+    :members:
+
+
 ``BertTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~

-``BertTokenizer`` perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
-
-This class has five arguments:
-
-
-* ``vocab_file``\ : path to a vocabulary file.
-* ``do_lower_case``\ : convert text to lower-case while tokenizing. **Default = True**.
-* ``max_len``\ : max length to filter the input of the Transformer. Default to pre-trained value for the model if ``None``. **Default = None**
-* ``do_basic_tokenize``\ : Do basic tokenization before wordpice tokenization. Set to false if text is pre-tokenized. **Default = True**.
-* ``never_split``\ : a list of tokens that should not be splitted during tokenization. **Default = ``["[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"]``\ **
-
-and three methods:
-
-
-* ``tokenize(text)``\ : convert a ``str`` in a list of ``str`` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
-* ``convert_tokens_to_ids(tokens)``\ : convert a list of ``str`` tokens in a list of ``int`` indices in the vocabulary.
-* ``convert_ids_to_tokens(tokens)``\ : convert a list of ``int`` indices in a list of ``str`` tokens in the vocabulary.
-* `save_vocabulary(directory_path)`: save the vocabulary file to `directory_path`. Return the path to the saved vocabulary file: ``vocab_file_path``. The vocabulary can be reloaded with ``BertTokenizer.from_pretrained('vocab_file_path')`` or ``BertTokenizer.from_pretrained('directory_path')``.
-
-Please refer to the doc strings and code in `\ ``tokenization.py`` <./pytorch_pretrained_bert/tokenization.py>`_ for the details of the ``BasicTokenizer`` and ``WordpieceTokenizer`` classes. In general it is recommended to use ``BertTokenizer`` unless you know what you are doing.
+.. autoclass:: pytorch_pretrained_bert.BertTokenizer
+    :members:


 ``BertAdam``
 ~~~~~~~~~~~~~~~~

-``BertAdam`` is a ``torch.optimizer`` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:
-
-
-* BertAdam implements weight decay fix,
-* BertAdam doesn't compensate for bias as in the regular Adam optimizer.
-
-The optimizer accepts the following arguments:
-
-
-* ``lr`` : learning rate
-* ``warmup`` : portion of ``t_total`` for the warmup, ``-1``  means no warmup. Default : ``-1``
-* ``t_total`` : total number of training steps for the learning
-    rate schedule, ``-1``  means constant learning rate. Default : ``-1``
-* ``schedule`` : schedule to use for the warmup (see above).
-    Can be ``'warmup_linear'``\ , ``'warmup_constant'``\ , ``'warmup_cosine'``\ , ``'none'``\ , ``None`` or a ``_LRSchedule`` object (see below).
-    If ``None`` or ``'none'``\ , learning rate is always kept constant.
-    Default : ``'warmup_linear'``
-* ``b1`` : Adams b1. Default : ``0.9``
-* ``b2`` : Adams b2. Default : ``0.999``
-* ``e`` : Adams epsilon. Default : ``1e-6``
-* ``weight_decay:`` Weight decay. Default : ``0.01``
-* ``max_grad_norm`` : Maximum norm for the gradients (\ ``-1`` means no clipping). Default : ``1.0``
-
+.. autoclass:: pytorch_pretrained_bert.BertAdam
+    :members:

 1. ``BertModel``
 ~~~~~~~~~~~~~~~~~~~~
--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
@ -1,41 +1,25 @@
 OpenAI GPT
 ----------------------------------------------------

+``OpenAIGPTConfig``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.OpenAIGPTConfig
+    :members:
+

 ``OpenAIGPTTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

-``OpenAIGPTTokenizer`` perform Byte-Pair-Encoding (BPE) tokenization.
-
-This class has four arguments:
-
-
-* ``vocab_file``\ : path to a vocabulary file.
-* ``merges_file``\ : path to a file containing the BPE merges.
-* ``max_len``\ : max length to filter the input of the Transformer. Default to pre-trained value for the model if ``None``. **Default = None**
-* ``special_tokens``\ : a list of tokens to add to the vocabulary for fine-tuning. If SpaCy is not installed and BERT's ``BasicTokenizer`` is used as the pre-BPE tokenizer, these tokens are not split. **Default= None**
-
-and five methods:
-
-
-* ``tokenize(text)``\ : convert a ``str`` in a list of ``str`` tokens by performing BPE tokenization.
-* ``convert_tokens_to_ids(tokens)``\ : convert a list of ``str`` tokens in a list of ``int`` indices in the vocabulary.
-* ``convert_ids_to_tokens(tokens)``\ : convert a list of ``int`` indices in a list of ``str`` tokens in the vocabulary.
-* ``set_special_tokens(self, special_tokens)``\ : update the list of special tokens (see above arguments)
-* ``encode(text)``\ : convert a ``str`` in a list of ``int`` tokens by performing BPE encoding.
-* `decode(ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)`: decode a list of `int` indices in a string and do some post-processing if needed: (i) remove special tokens from the output and (ii) clean up tokenization spaces.
-* `save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: ``vocab_file_path``\ , ``merge_file_path``\ , ``special_tokens_file_path``. The vocabulary can be reloaded with ``OpenAIGPTTokenizer.from_pretrained('directory_path')``.
-
-Please refer to the doc strings and code in `\ ``tokenization_openai.py`` <./pytorch_pretrained_bert/tokenization_openai.py>`_ for the details of the ``OpenAIGPTTokenizer``.
+.. autoclass:: pytorch_pretrained_bert.OpenAIGPTTokenizer
+    :members:


 ``OpenAIAdam``
 ~~~~~~~~~~~~~~~~~~

-``OpenAIAdam`` is similar to ``BertAdam``.
-The differences with ``BertAdam`` is that ``OpenAIAdam`` compensate for bias as in the regular Adam optimizer.
-
-``OpenAIAdam`` accepts the same arguments as ``BertAdam``.
+.. autoclass:: pytorch_pretrained_bert.OpenAIAdam
+    :members:


 9. ``OpenAIGPTModel``
--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@ -1,31 +1,18 @@
 OpenAI GPT2
 ----------------------------------------------------

+``GPT2Config``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.GPT2Config
+    :members:
+

 ``GPT2Tokenizer``
 ~~~~~~~~~~~~~~~~~~~~~

-``GPT2Tokenizer`` perform byte-level Byte-Pair-Encoding (BPE) tokenization.
-
-This class has three arguments:
-
-
-* ``vocab_file``\ : path to a vocabulary file.
-* ``merges_file``\ : path to a file containing the BPE merges.
-* ``errors``\ : How to handle unicode decoding errors. **Default = ``replace``\ **
-
-and two methods:
-
-
-* ``tokenize(text)``\ : convert a ``str`` in a list of ``str`` tokens by performing byte-level BPE.
-* ``convert_tokens_to_ids(tokens)``\ : convert a list of ``str`` tokens in a list of ``int`` indices in the vocabulary.
-* ``convert_ids_to_tokens(tokens)``\ : convert a list of ``int`` indices in a list of ``str`` tokens in the vocabulary.
-* ``set_special_tokens(self, special_tokens)``\ : update the list of special tokens (see above arguments)
-* ``encode(text)``\ : convert a ``str`` in a list of ``int`` tokens by performing byte-level BPE.
-* ``decode(tokens)``\ : convert back a list of ``int`` tokens in a ``str``.
-* `save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: ``vocab_file_path``\ , ``merge_file_path``\ , ``special_tokens_file_path``. The vocabulary can be reloaded with ``OpenAIGPTTokenizer.from_pretrained('directory_path')``.
-
-Please refer to `\ ``tokenization_gpt2.py`` <./pytorch_pretrained_bert/tokenization_gpt2.py>`_ for more details on the ``GPT2Tokenizer``.
+.. autoclass:: pytorch_pretrained_bert.GPT2Tokenizer
+    :members:


 14. ``GPT2Model``
--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@ -2,14 +2,18 @@ Transformer XL
 ----------------------------------------------------


+``TransfoXLConfig``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.TransfoXLConfig
+    :members:
+
+
 ``TransfoXLTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

-``TransfoXLTokenizer`` perform word tokenization. This tokenizer can be used for adaptive softmax and has utilities for counting tokens in a corpus to create a vocabulary ordered by toekn frequency (for adaptive softmax). See the adaptive softmax paper (\ `Efficient softmax approximation for GPUs <http://arxiv.org/abs/1609.04309>`_\ ) for more details.
-
-The API is similar to the API of ``BertTokenizer`` (see above).
-
-Please refer to the doc strings and code in `\ ``tokenization_transfo_xl.py`` <./pytorch_pretrained_bert/tokenization_transfo_xl.py>`_ for the details of these additional methods in ``TransfoXLTokenizer``.
+.. autoclass:: pytorch_pretrained_bert.TransfoXLTokenizer
+    :members:


 12. ``TransfoXLModel``
--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
@ -1,2 +1,5 @@
 XLM
 ----------------------------------------------------
+
+
+I don't really know what to put here, I'll leave it up to you to decide @Thom
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@ -1,2 +1,4 @@
 XLNet
 ----------------------------------------------------
+
+I don't really know what to put here, I'll leave it up to you to decide @Thom