diff --git a/docs/source/glossary.rst b/docs/source/glossary.rst index 406d9f30e..9529e1848 100644 --- a/docs/source/glossary.rst +++ b/docs/source/glossary.rst @@ -218,6 +218,52 @@ positional embeddings. Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings. +.. _labels: + +Labels +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels +should be the expected prediction of the model: it will use the standard loss in order to compute the loss between +its predictions and the expected value (the label). + +These labels are different according to the model head, for example: + +- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects + a tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the + entire sequence. +- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects + a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each + individual token. +- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects + a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each + individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually + -100). +- For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`, + :class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension + :obj:`(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each + input sequence. During training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder + attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the + Encoder-Decoder framework. + See the documentation of each model for more information on each specific model's labels. + +The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer models, +simply outputting features. + +.. _decoder-input-ids: + +Decoder input IDs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. +These inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually +built in a way specific to each model. + +Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`. +In such models, passing the :obj:`labels` is the preferred way to handle training. + +Please check each model's docs to see how they handle these input IDs for sequence to sequence training. + .. _feed-forward-chunking: Feed Forward Chunking