diff --git a/docs/source/en/tasks/language_modeling.mdx b/docs/source/en/tasks/language_modeling.mdx index f410bd5a5..82708f2f8 100644 --- a/docs/source/en/tasks/language_modeling.mdx +++ b/docs/source/en/tasks/language_modeling.mdx @@ -245,20 +245,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = lm_dataset["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], -... dummy_labels=True, +>>> tf_train_set = model.prepare_tf_dataset( +... lm_dataset["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_test_set = lm_dataset["test"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], -... dummy_labels=True, +>>> tf_test_set = model.prepare_tf_dataset( +... lm_dataset["test"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, @@ -352,20 +350,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = lm_dataset["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], -... dummy_labels=True, +>>> tf_train_set = model.prepare_tf_dataset( +... lm_dataset["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_test_set = lm_dataset["test"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], -... dummy_labels=True, +>>> tf_test_set = model.prepare_tf_dataset( +... lm_dataset["test"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, diff --git a/docs/source/en/tasks/multiple_choice.mdx b/docs/source/en/tasks/multiple_choice.mdx index b8eb52849..6ee0d7137 100644 --- a/docs/source/en/tasks/multiple_choice.mdx +++ b/docs/source/en/tasks/multiple_choice.mdx @@ -224,21 +224,19 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py >>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer) ->>> tf_train_set = tokenized_swag["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids"], -... label_cols=["labels"], +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_swag["train"], ... shuffle=True, ... batch_size=batch_size, ... collate_fn=data_collator, ... ) ->>> tf_validation_set = tokenized_swag["validation"].to_tf_dataset( -... columns=["attention_mask", "input_ids"], -... label_cols=["labels"], +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_swag["validation"], ... shuffle=False, ... batch_size=batch_size, ... collate_fn=data_collator, @@ -273,10 +271,7 @@ Load BERT with [`TFAutoModelForMultipleChoice`]: Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): ```py ->>> model.compile( -... optimizer=optimizer, -... loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), -... ) +>>> model.compile(optimizer=optimizer) ``` Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model: diff --git a/docs/source/en/tasks/question_answering.mdx b/docs/source/en/tasks/question_answering.mdx index 2cb54760e..218fa7bb5 100644 --- a/docs/source/en/tasks/question_answering.mdx +++ b/docs/source/en/tasks/question_answering.mdx @@ -199,20 +199,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = tokenized_squad["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "start_positions", "end_positions"], -... dummy_labels=True, +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_squad["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_validation_set = tokenized_squad["validation"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "start_positions", "end_positions"], -... dummy_labels=True, +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_squad["validation"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, diff --git a/docs/source/en/tasks/sequence_classification.mdx b/docs/source/en/tasks/sequence_classification.mdx index 44729dc28..2ef8a9619 100644 --- a/docs/source/en/tasks/sequence_classification.mdx +++ b/docs/source/en/tasks/sequence_classification.mdx @@ -144,18 +144,19 @@ At this point, only three steps remain: -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. + ```py ->>> tf_train_set = tokenized_imdb["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "label"], +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_imdb["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_validation_set = tokenized_imdb["test"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "label"], +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_imdb["test"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, diff --git a/docs/source/en/tasks/summarization.mdx b/docs/source/en/tasks/summarization.mdx index f636141a1..1b2eafcb5 100644 --- a/docs/source/en/tasks/summarization.mdx +++ b/docs/source/en/tasks/summarization.mdx @@ -159,18 +159,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = tokenized_billsum["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_billsum["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_test_set = tokenized_billsum["test"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_test_set = model.prepare_tf_dataset( +... tokenized_billsum["test"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, diff --git a/docs/source/en/tasks/token_classification.mdx b/docs/source/en/tasks/token_classification.mdx index aa5739534..3d2a3ccb0 100644 --- a/docs/source/en/tasks/token_classification.mdx +++ b/docs/source/en/tasks/token_classification.mdx @@ -199,18 +199,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = tokenized_wnut["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_wnut["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_validation_set = tokenized_wnut["validation"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_wnut["validation"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, diff --git a/docs/source/en/tasks/translation.mdx b/docs/source/en/tasks/translation.mdx index d17b87041..7439bc7b6 100644 --- a/docs/source/en/tasks/translation.mdx +++ b/docs/source/en/tasks/translation.mdx @@ -175,18 +175,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = tokenized_books["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_books["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_test_set = tokenized_books["test"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_test_set = model.prepare_tf_dataset( +... tokenized_books["test"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, @@ -216,7 +216,7 @@ Configure the model for training with [`compile`](https://keras.io/api/models/mo Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model: ```py ->>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3) +>>> model.fit(tf_train_set, validation_data=tf_test_set, epochs=3) ``` diff --git a/docs/source/en/training.mdx b/docs/source/en/training.mdx index 9222d27ac..89f5c3148 100644 --- a/docs/source/en/training.mdx +++ b/docs/source/en/training.mdx @@ -65,10 +65,16 @@ If you like, you can create a smaller subset of the full dataset to fine-tune on ## Train +At this point, you should follow the section corresponding to the framework you want to use. You can use the links +in the right sidebar to jump to the one you want - and if you want to hide all of the content for a given framework, +just use the button at the top-right of that framework's block! + +## Train with PyTorch Trainer + 🤗 Transformers provides a [`Trainer`] class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The [`Trainer`] API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision. Start by loading your model and specify the number of expected labels. From the Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields), you know there are five labels: @@ -151,66 +157,113 @@ Then fine-tune your model by calling [`~transformers.Trainer.train`]: -🤗 Transformers models also supports training in TensorFlow with the Keras API. +## Train a TensorFlow model with Keras -### Convert dataset to TensorFlow format +You can also train 🤗 Transformers models in TensorFlow with the Keras API! -The [`DefaultDataCollator`] assembles tensors into a batch for the model to train on. Make sure you specify `return_tensors` to return TensorFlow tensors: +### Loading data for Keras + +When you want to train a 🤗 Transformers model with the Keras API, you need to convert your dataset to a format that +Keras understands. If your dataset is small, you can just convert the whole thing to NumPy arrays and pass it to Keras. +Let's try that first before we do anything more complicated. + +First, load a dataset. We'll use the CoLA dataset from the [GLUE benchmark](https://huggingface.co/datasets/glue), +since it's a simple binary text classification task, and just take the training split for now. ```py ->>> from transformers import DefaultDataCollator +from datasets import load_dataset ->>> data_collator = DefaultDataCollator(return_tensors="tf") +dataset = load_dataset("glue", "cola") +dataset = dataset["train"] # Just take the training split for now +``` + +Next, load a tokenizer and tokenize the data as NumPy arrays. Note that the labels are already a list of 0 and 1s, +so we can just convert that directly to a NumPy array without tokenization! + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +tokenized_data = tokenizer(dataset["text"], return_tensors="np", padding=True) + +labels = np.array(dataset["label"]) # Label is already an array of 0 and 1 +``` + +Finally, load, [`compile`](https://keras.io/api/models/model_training_apis/#compile-method), and [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) the model: + +```py +from transformers import TFAutoModelForSequenceClassification +from tensorflow.keras.optimizers import Adam + +# Load and compile our model +model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased") +# Lower learning rates are often better for fine-tuning transformers +model.compile(optimizer=Adam(3e-5)) + +model.fit(tokenized_data, labels) ``` -[`Trainer`] uses [`DataCollatorWithPadding`] by default so you don't need to explicitly specify a data collator. +You don't have to pass a loss argument to your models when you `compile()` them! Hugging Face models automatically +choose a loss that is appropriate for their task and model architecture if this argument is left blank. You can always +override this by specifying a loss yourself if you want to! -Next, convert the tokenized datasets to TensorFlow datasets with the [`~datasets.Dataset.to_tf_dataset`] method. Specify your inputs in `columns`, and your label in `label_cols`: +This approach works great for smaller datasets, but for larger datasets, you might find it starts to become a problem. Why? +Because the tokenized array and labels would have to be fully loaded into memory, and because NumPy doesn’t handle +“jagged” arrays, so every tokenized sample would have to be padded to the length of the longest sample in the whole +dataset. That’s going to make your array even bigger, and all those padding tokens will slow down training too! + +### Loading data as a tf.data.Dataset + +If you want to avoid slowing down training, you can load your data as a `tf.data.Dataset` instead. Although you can write your own +`tf.data` pipeline if you want, we have two convenience methods for doing this: + +- [`~TFPreTrainedModel.prepare_tf_dataset`]: This is the method we recommend in most cases. Because it is a method +on your model, it can inspect the model to automatically figure out which columns are usable as model inputs, and +discard the others to make a simpler, more performant dataset. +- [`~datasets.Dataset.to_tf_dataset`]: This method is more low-level, and is useful when you want to exactly control how +your dataset is created, by specifying exactly which `columns` and `label_cols` to include. + +Before you can use [`~TFPreTrainedModel.prepare_tf_dataset`], you will need to add the tokenizer outputs to your dataset as columns, as shown in +the following code sample: ```py ->>> tf_train_dataset = small_train_dataset.to_tf_dataset( -... columns=["attention_mask", "input_ids", "token_type_ids"], -... label_cols=["labels"], -... shuffle=True, -... collate_fn=data_collator, -... batch_size=8, -... ) +def tokenize_dataset(data): + # Keys of the returned dictionary will be added to the dataset as columns + return tokenizer(data["text"]) ->>> tf_validation_dataset = small_eval_dataset.to_tf_dataset( -... columns=["attention_mask", "input_ids", "token_type_ids"], -... label_cols=["labels"], -... shuffle=False, -... collate_fn=data_collator, -... batch_size=8, -... ) + +dataset = dataset.map(tokenize_dataset) ``` -### Compile and fit +Remember that Hugging Face datasets are stored on disk by default, so this will not inflate your memory usage! Once the +columns have been added, you can stream batches from the dataset and add padding to each batch, which greatly +reduces the number of padding tokens compared to padding the entire dataset. -Load a TensorFlow model with the expected number of labels: ```py ->>> import tensorflow as tf ->>> from transformers import TFAutoModelForSequenceClassification - ->>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> tf_dataset = model.prepare_tf_dataset(dataset, batch_size=16, shuffle=True, tokenizer=tokenizer) ``` -Then compile and fine-tune your model with [`fit`](https://keras.io/api/models/model_training_apis/) as you would with any other Keras model: +Note that in the code sample above, you need to pass the tokenizer to `prepare_tf_dataset` so it can correctly pad batches as they're loaded. +If all the samples in your dataset are the same length and no padding is necessary, you can skip this argument. +If you need to do something more complex than just padding samples (e.g. corrupting tokens for masked language +modelling), you can use the `collate_fn` argument instead to pass a function that will be called to transform the +list of samples into a batch and apply any preprocessing you want. See our +[examples](https://github.com/huggingface/transformers/tree/main/examples) or +[notebooks](https://huggingface.co/docs/transformers/notebooks) to see this approach in action. + +Once you've created a `tf.data.Dataset`, you can compile and fit the model as before: ```py ->>> model.compile( -... optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5), -... loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), -... metrics=tf.metrics.SparseCategoricalAccuracy(), -... ) +model.compile(optimizer=Adam(3e-5)) ->>> model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3) +model.fit(tf_dataset) ``` +