transformers/docs/source/en/preprocessing.mdx

<!--Copyright 2023 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Preprocess

[[open-in-colab]]

Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:

* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor or image processor.

<Tip>

`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, image processor, feature extractor or processor.

</Tip>

Before you begin, install 🤗 Datasets so you can load some datasets to experiment with:

```bash
pip install datasets
```

## Natural Language Processing

<Youtube id="Yffk5aydLzg"/>

The main tool for preprocessing textual data is a [tokenizer](main_classes/tokenizer). A tokenizer splits text into *tokens* according to a set of rules. The tokens are converted into numbers and then tensors, which become the model inputs. Any additional inputs required by the model are added by the tokenizer.

<Tip>

If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referred to as the *vocab*) during pretraining.

</Tip>

Get started by loading a pretrained tokenizer with the [`AutoTokenizer.from_pretrained`] method. This downloads the *vocab* a model was pretrained with:

```py
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
```

Then pass your text to the tokenizer:

```py
>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
>>> print(encoded_input)
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
```

The tokenizer returns a dictionary with three important items:

* [input_ids](glossary#input-ids) are the indices corresponding to each token in the sentence.
* [attention_mask](glossary#attention-mask) indicates whether a token should be attended to or not.
* [token_type_ids](glossary#token-type-ids) identifies which sequence a token belongs to when there is more than one sequence.

Return your input by decoding the `input_ids`:

```py
>>> tokenizer.decode(encoded_input["input_ids"])
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
```

As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need
special tokens, but if they do, the tokenizer automatically adds them for you.

If there are several sentences you want to preprocess, pass them as a list to the tokenizer:

```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1]]}
```

### Pad

Sentences aren't always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to shorter sentences.

Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence:

```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

The first and third sentences are now padded with `0`'s because they are shorter.

### Truncation

On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you'll need to truncate the sequence to a shorter length.

Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model:

```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

<Tip>

Check out the [Padding and truncation](./pad_truncation) concept guide to learn more different padding and truncation arguments.

</Tip>

### Build tensors

Finally, you want the tokenizer to return the actual tensors that get fed to the model.

Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:

<frameworkcontent>
<pt>

```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
>>> print(encoded_input)
{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
                      [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
                      [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
```
</pt>
<tf>
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
>>> print(encoded_input)
{'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
       [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
       [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>, 
 'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 
 'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}
```
</tf>
</frameworkcontent>

## Audio

For audio tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from raw audio data, and convert them into tensors.

Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a feature extractor with audio datasets:

```py
>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
```

Access the first element of the `audio` column to take a look at the input. Calling the `audio` column automatically loads and resamples the audio file:

```py
>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 8000}
```

This returns three items:

* `array` is the speech signal loaded - and potentially resampled - as a 1D array.
* `path` points to the location of the audio file.
* `sampling_rate` refers to how many data points in the speech signal are measured per second.

For this tutorial, you'll use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. Take a look at the model card, and you'll learn Wav2Vec2 is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your data. 

1. Use 🤗 Datasets' [`~datasets.Dataset.cast_column`] method to upsample the sampling rate to 16kHz:

```py
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
```

2. Call the `audio` column again to resample the audio file:

```py
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 16000}
```

Next, load a feature extractor to normalize and pad the input. When padding textual data, a `0` is added for shorter sequences. The same idea applies to audio data. The feature extractor adds a `0` - interpreted as silence - to `array`.

Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:

```py
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
```

Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur.

```py
>>> audio_input = [dataset[0]["audio"]["array"]]
>>> feature_extractor(audio_input, sampling_rate=16000)
{'input_values': [array([ 3.8106556e-04,  2.7506407e-03,  2.8015103e-03, ...,
        5.6335266e-04,  4.6588284e-06, -1.7142107e-04], dtype=float32)]}
```

Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:

```py
>>> dataset[0]["audio"]["array"].shape
(173398,)

>>> dataset[1]["audio"]["array"].shape
(106496,)
```

Create a function to preprocess the dataset so the audio samples are the same lengths. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:

```py
>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays,
...         sampling_rate=16000,
...         padding=True,
...         max_length=100000,
...         truncation=True,
...     )
...     return inputs
```

Apply the `preprocess_function` to the the first few examples in the dataset:

```py
>>> processed_dataset = preprocess_function(dataset[:5])
```

The sample lengths are now the same and match the specified maximum length. You can pass your processed dataset to the model now!

```py
>>> processed_dataset["input_values"][0].shape
(100000,)

>>> processed_dataset["input_values"][1].shape
(100000,)
```

## Computer vision

For computer vision tasks, you'll need an [image processor](main_classes/image_processor) to prepare your dataset for the model.
Image preprocessing consists of several steps that convert images into the input expected by the model. These steps
include but are not limited to resizing, normalizing, color channel correction, and converting images to tensors.

<Tip>

Image preprocessing often follows some form of image augmentation. Both image preprocessing and image augmentation
transform image data, but they serve different purposes:

* Image augmentation alters images in a way that can help prevent overfitting and increase the robustness of the model. You can get creative in how you augment your data - adjust brightness and colors, crop, rotate, resize, zoom, etc. However, be mindful not to change the meaning of the images with your augmentations.
* Image preprocessing guarantees that the images match the model’s expected input format. When fine-tuning a computer vision model, images must be preprocessed exactly as when the model was initially trained.

You can use any library you like for image augmentation. For image preprocessing, use the `ImageProcessor` associated with the model.

</Tip>

Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets:

<Tip>

Use 🤗 Datasets `split` parameter to only load a small sample from the training split since the dataset is quite large!

</Tip>

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("food101", split="train[:100]")
```

Next, take a look at the image with 🤗 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=image#datasets.Image) feature:

```py
>>> dataset[0]["image"]
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png"/>
</div>

Load the image processor with [`AutoImageProcessor.from_pretrained`]:

```py
>>> from transformers import AutoImageProcessor

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
```

First, let's add some image augmentation. You can use any library you prefer, but in this tutorial, we'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).

1. Here we use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain together a couple of
transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html).
Note that for resizing, we can get the image size requirements from the `image_processor`. For some models, an exact height and
width are expected, for others only the `shortest_edge` is defined.

```py
>>> from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose

>>> size = (
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
... )

>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)])
```

2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values)
as its input. `ImageProcessor` can take care of normalizing the images, and generating appropriate tensors.
Create a function that combines image augmentation and image preprocessing for a batch of images and generates `pixel_values`:

```py
>>> def transforms(examples):
...     images = [_transforms(img.convert("RGB")) for img in examples["image"]]
...     examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"]
...     return examples
```

<Tip>

In the example above we set `do_resize=False` because we have already resized the images in the image augmentation transformation,
and leveraged the `size` attribute from the appropriate `image_processor`. If you do not resize images during image augmentation,
leave this parameter out. By default, `ImageProcessor` will handle the resizing.

If you wish to normalize images as a part of the augmentation transformation, use the `image_processor.image_mean`,
and `image_processor.image_std` values.
</Tip>

3. Then use 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform) to apply the transforms on the fly:

```py
>>> dataset.set_transform(transforms)
```

4. Now when you access the image, you'll notice the image processor has added `pixel_values`. You can pass your processed dataset to the model now!

```py
>>> dataset[0].keys()
```

Here is what the image looks like after the transforms are applied. The image has been randomly cropped and it's color properties are different.

```py
>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> img = dataset[0]["pixel_values"]
>>> plt.imshow(img.permute(1, 2, 0))
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png"/>
</div>

<Tip>

For tasks like object detection, semantic segmentation, instance segmentation, and panoptic segmentation, `ImageProcessor`
offers post processing methods. These methods convert model's raw outputs into meaningful predictions such as bounding boxes,
or segmentation maps.

</Tip>

### Pad

In some cases, for instance, when fine-tuning [DETR](./model_doc/detr), the model applies scale augmentation at training
time. This may cause images to be different sizes in a batch. You can use [`DetrImageProcessor.pad_and_create_pixel_mask`]
from [`DetrImageProcessor`] and define a custom `collate_fn` to batch images together.

```py
>>> def collate_fn(batch):
...     pixel_values = [item["pixel_values"] for item in batch]
...     encoding = image_processor.pad_and_create_pixel_mask(pixel_values, return_tensors="pt")
...     labels = [item["labels"] for item in batch]
...     batch = {}
...     batch["pixel_values"] = encoding["pixel_values"]
...     batch["pixel_mask"] = encoding["pixel_mask"]
...     batch["labels"] = labels
...     return batch
```

## Multimodal

For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples together two processing objects such as as tokenizer and feature extractor.

Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):

```py
>>> from datasets import load_dataset

>>> lj_speech = load_dataset("lj_speech", split="train")
```

For ASR, you're mainly focused on `audio` and `text` so you can remove the other columns:

```py
>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])
```

Now take a look at the `audio` and `text` columns:

```py
>>> lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 22050}

>>> lj_speech[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
```

Remember you should always [resample](preprocessing#audio) your audio dataset's sampling rate to match the sampling rate of the dataset used to pretrain a model!

```py
>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
```

Load a processor with [`AutoProcessor.from_pretrained`]:

```py
>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
```

1. Create a function to process the audio data contained in `array` to `input_values`, and tokenize `text` to `labels`. These are the inputs to the model:

```py
>>> def prepare_dataset(example):
...     audio = example["audio"]

...     example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))

...     return example
```

2. Apply the `prepare_dataset` function to a sample:

```py
>>> prepare_dataset(lj_speech[0])
```

The processor has now added `input_values` and `labels`, and the sampling rate has also been correctly downsampled to 16kHz. You can pass your processed dataset to the model now!
-												Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example

* added padding to the CV section, though it is a special case

* Added a tip about post processing methods

* make style

* link update

* Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* review feedback

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
											
										
										
											2023-01-19 13:43:36 +00:00
+								<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
 								Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 								the License. You may obtain a copy of the License at
 								http://www.apache.org/licenses/LICENSE-2.0
 								Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 								an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 								specific language governing permissions and limitations under the License.
 								-->
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								# Preprocess
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Put back open in colab markers (#14684)


											
										
										
											2021-12-09 17:00:06 +00:00
+								[[open-in-colab]]
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
-												AutoImageProcessor (#20111)

* AutoImageProcessor skeleton

* Update references

* Add mapping in init

* Add model image processors to __init__ for importing

* Add AutoImageProcessor tests

* Fix up

* Image Processor documentation

* Remove pdb

* Update docs/source/en/model_doc/mobilevit.mdx

* Update docs

* Don't add whitespace on json files

* Remove fixtures

* Move checking model config down

* Fix up

* Add check for image processor

* Remove FeatureExtractorMixin in docstrings

* Rename model_tmpfile to config_tmpfile

* Don't make None if not in image processor map
											
										
										
											2022-11-08 19:54:41 +00:00
+								* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
-												Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example

* added padding to the CV section, though it is a special case

* Added a tip about post processing methods

* make style

* link update

* Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* review feedback

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
											
										
										
											2023-01-19 13:43:36 +00:00
+								* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
-												Update doc examples feature extractor -> image processor (#20501)

* Update doc example feature extractor -> image processor

* Apply suggestions from code review
											
										
										
											2022-11-30 14:50:55 +00:00
+								* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor or image processor.
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								<Tip>
-												Update doc examples feature extractor -> image processor (#20501)

* Update doc example feature extractor -> image processor

* Apply suggestions from code review
											
										
										
											2022-11-30 14:50:55 +00:00
+								`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, image processor, feature extractor or processor.
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
 								</Tip>
 								Before you begin, install 🤗 Datasets so you can load some datasets to experiment with:
 								```bash
 								pip install datasets
 								```
 								## Natural Language Processing
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								<Youtube id="Yffk5aydLzg"/>
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								The main tool for preprocessing textual data is a [tokenizer](main_classes/tokenizer). A tokenizer splits text into *tokens* according to a set of rules. The tokens are converted into numbers and then tensors, which become the model inputs. Any additional inputs required by the model are added by the tokenizer.
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
 								<Tip>
-												fix spelling error (#23143)

change referrred to referred
											
										
										
											2023-05-04 13:56:28 +00:00
+								If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referred to as the *vocab*) during pretraining.
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
 								</Tip>
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Get started by loading a pretrained tokenizer with the [`AutoTokenizer.from_pretrained`] method. This downloads the *vocab* a model was pretrained with:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```py
 								>>> from transformers import AutoTokenizer
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
 								```
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Then pass your text to the tokenizer:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
 								```py
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
+								>>> print(encoded_input)
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
 								 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 								 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
+								```
-												Fixed typo (#18921)

Fixed typo itmes --> items
											
										
										
											2022-09-12 19:03:48 +00:00
+								The tokenizer returns a dictionary with three important items:
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								* [input_ids](glossary#input-ids) are the indices corresponding to each token in the sentence.
 								* [attention_mask](glossary#attention-mask) indicates whether a token should be attended to or not.
 								* [token_type_ids](glossary#token-type-ids) identifies which sequence a token belongs to when there is more than one sequence.
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Return your input by decoding the `input_ids`:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
 								```py
 								>>> tokenizer.decode(encoded_input["input_ids"])
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
+								```
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								special tokens, but if they do, the tokenizer automatically adds them for you.
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								If there are several sentences you want to preprocess, pass them as a list to the tokenizer:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
 								```py
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								>>> batch_sentences = [
 								...     "But what about second breakfast?",
 								...     "Don't think he knows about second breakfast, Pip.",
 								...     "What about elevensies?",
 								... ]
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
+								>>> encoded_inputs = tokenizer(batch_sentences)
 								>>> print(encoded_inputs)
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
 								               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
 								               [101, 1327, 1164, 5450, 23434, 136, 102]],
 								 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
 								                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 								                    [0, 0, 0, 0, 0, 0, 0]],
 								 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
 								                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 								                    [1, 1, 1, 1, 1, 1, 1]]}
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
+								```
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								### Pad
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Sentences aren't always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to shorter sentences.
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```py
 								>>> batch_sentences = [
 								...     "But what about second breakfast?",
 								...     "Don't think he knows about second breakfast, Pip.",
 								...     "What about elevensies?",
 								... ]
 								>>> encoded_input = tokenizer(batch_sentences, padding=True)
 								>>> print(encoded_input)
 								{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
 								               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
 								               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 								 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 								                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 								                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 								 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
 								                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 								                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
 								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								The first and third sentences are now padded with `0`'s because they are shorter.
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								### Truncation
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you'll need to truncate the sequence to a shorter length.
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
 								```py
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								>>> batch_sentences = [
 								...     "But what about second breakfast?",
 								...     "Don't think he knows about second breakfast, Pip.",
 								...     "What about elevensies?",
 								... ]
 								>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
 								>>> print(encoded_input)
 								{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
 								               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
 								               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 								 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 								                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 								                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 								 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
 								                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 								                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
 								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								<Tip>
 								Check out the [Padding and truncation](./pad_truncation) concept guide to learn more different padding and truncation arguments.
 								</Tip>
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								### Build tensors
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Finally, you want the tokenizer to return the actual tensors that get fed to the model.
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:
-												Framework split (#16030)

* First files

* More files

* Last files

* Style
											
										
										
											2022-03-15 14:13:34 +00:00
+								<frameworkcontent>
 								<pt>
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```py
 								>>> batch_sentences = [
 								...     "But what about second breakfast?",
 								...     "Don't think he knows about second breakfast, Pip.",
 								...     "What about elevensies?",
 								... ]
-												Fixing the output of code examples in the preprocessing chapter (#17162)


											
										
										
											2022-05-10 16:16:28 +00:00
+								>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								>>> print(encoded_input)
-												Fixing the output of code examples in the preprocessing chapter (#17162)


											
										
										
											2022-05-10 16:16:28 +00:00
+								{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
 								                      [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
 								                      [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]),
 								 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 								                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 								                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 								 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
 								                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 								                           [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
-												Framework split (#16030)

* First files

* More files

* Last files

* Style
											
										
										
											2022-03-15 14:13:34 +00:00
+								```
 								</pt>
 								<tf>
 								```py
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								>>> batch_sentences = [
 								...     "But what about second breakfast?",
 								...     "Don't think he knows about second breakfast, Pip.",
 								...     "What about elevensies?",
 								... ]
-												fix wrong variable name (#16467)


											
										
										
											2022-03-29 16:55:40 +00:00
+								>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								>>> print(encoded_input)
 								{'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
-												Fixing the output of code examples in the preprocessing chapter (#17162)


											
										
										
											2022-05-10 16:16:28 +00:00
+								array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
 								       [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
 								       [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								      dtype=int32)>,
 								 'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
-												Fixing the output of code examples in the preprocessing chapter (#17162)


											
										
										
											2022-05-10 16:16:28 +00:00
+								array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 								       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 								       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>,
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								 'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
-												Fixing the output of code examples in the preprocessing chapter (#17162)


											
										
										
											2022-05-10 16:16:28 +00:00
+								array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
 								       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 								       [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
+								```
-												Framework split (#16030)

* First files

* More files

* Last files

* Style
											
										
										
											2022-03-15 14:13:34 +00:00
+								</tf>
 								</frameworkcontent>
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								## Audio
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								For audio tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from raw audio data, and convert them into tensors.
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a feature extractor with audio datasets:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```py
 								>>> from datasets import load_dataset, Audio
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Access the first element of the `audio` column to take a look at the input. Calling the `audio` column automatically loads and resamples the audio file:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```py
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								>>> dataset[0]["audio"]
 								{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
 .        ,  0.        ], dtype=float32),
 								 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 								 'sampling_rate': 8000}
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								This returns three items:
 								* `array` is the speech signal loaded - and potentially resampled - as a 1D array.
 								* `path` points to the location of the audio file.
 								* `sampling_rate` refers to how many data points in the speech signal are measured per second.
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								For this tutorial, you'll use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. Take a look at the model card, and you'll learn Wav2Vec2 is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your data.
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Doc to dataset (#18037)

* Link to the Datasets doc

* Remove unwanted file
											
										
										
											2022-07-06 16:10:06 +00:00
+. Use 🤗 Datasets' [`~datasets.Dataset.cast_column`] method to upsample the sampling rate to 16kHz:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```py
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+. Call the `audio` column again to resample the audio file:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
 								```py
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								>>> dataset[0]["audio"]
 								{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
 .8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 								 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								 'sampling_rate': 16000}
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
+								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Next, load a feature extractor to normalize and pad the input. When padding textual data, a `0` is added for shorter sequences. The same idea applies to audio data. The feature extractor adds a `0` - interpreted as silence - to `array`.
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
 								```py
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								>>> from transformers import AutoFeatureExtractor
 								>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
 								```
 								Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur.
 								```py
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								>>> audio_input = [dataset[0]["audio"]["array"]]
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								>>> feature_extractor(audio_input, sampling_rate=16000)
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								{'input_values': [array([ 3.8106556e-04,  2.7506407e-03,  2.8015103e-03, ...,
 .6335266e-04,  4.6588284e-06, -1.7142107e-04], dtype=float32)]}
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
+								```
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
 								```py
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								>>> dataset[0]["audio"]["array"].shape
 								(173398,)
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								>>> dataset[1]["audio"]["array"].shape
 								(106496,)
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
+								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Create a function to preprocess the dataset so the audio samples are the same lengths. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
 								```py
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								>>> def preprocess_function(examples):
 								...     audio_arrays = [x["array"] for x in examples["audio"]]
 								...     inputs = feature_extractor(
 								...         audio_arrays,
 								...         sampling_rate=16000,
 								...         padding=True,
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								...         max_length=100000,
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								...         truncation=True,
 								...     )
 								...     return inputs
 								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Apply the `preprocess_function` to the the first few examples in the dataset:
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								>>> processed_dataset = preprocess_function(dataset[:5])
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								The sample lengths are now the same and match the specified maximum length. You can pass your processed dataset to the model now!
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
 								>>> processed_dataset["input_values"][0].shape
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								(100000,)
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								>>> processed_dataset["input_values"][1].shape
-												Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
											
										
										
											2022-04-08 20:55:42 +00:00
+								(100000,)
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								## Computer vision
-												Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example

* added padding to the CV section, though it is a special case

* Added a tip about post processing methods

* make style

* link update

* Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* review feedback

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
											
										
										
											2023-01-19 13:43:36 +00:00
+								For computer vision tasks, you'll need an [image processor](main_classes/image_processor) to prepare your dataset for the model.
 								Image preprocessing consists of several steps that convert images into the input expected by the model. These steps
 								include but are not limited to resizing, normalizing, color channel correction, and converting images to tensors.
 								<Tip>
 								Image preprocessing often follows some form of image augmentation. Both image preprocessing and image augmentation
 								transform image data, but they serve different purposes:
 								* Image augmentation alters images in a way that can help prevent overfitting and increase the robustness of the model. You can get creative in how you augment your data - adjust brightness and colors, crop, rotate, resize, zoom, etc. However, be mindful not to change the meaning of the images with your augmentations.
 								* Image preprocessing guarantees that the images match the model’s expected input format. When fine-tuning a computer vision model, images must be preprocessed exactly as when the model was initially trained.
 								You can use any library you like for image augmentation. For image preprocessing, use the `ImageProcessor` associated with the model.
 								</Tip>
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
-												Update doc examples feature extractor -> image processor (#20501)

* Update doc example feature extractor -> image processor

* Apply suggestions from code review
											
										
										
											2022-11-30 14:50:55 +00:00
+								Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets:
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								<Tip>
 								Use 🤗 Datasets `split` parameter to only load a small sample from the training split since the dataset is quite large!
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								</Tip>
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
 								>>> from datasets import load_dataset
 								>>> dataset = load_dataset("food101", split="train[:100]")
 								```
 								Next, take a look at the image with 🤗 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=image#datasets.Image) feature:
 								```py
 								>>> dataset[0]["image"]
 								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								<div class="flex justify-center">
 								    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png"/>
 								</div>
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
-												Update doc examples feature extractor -> image processor (#20501)

* Update doc example feature extractor -> image processor

* Apply suggestions from code review
											
										
										
											2022-11-30 14:50:55 +00:00
+								Load the image processor with [`AutoImageProcessor.from_pretrained`]:
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
-												Update doc examples feature extractor -> image processor (#20501)

* Update doc example feature extractor -> image processor

* Apply suggestions from code review
											
										
										
											2022-11-30 14:50:55 +00:00
+								>>> from transformers import AutoImageProcessor
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
-												Update doc examples feature extractor -> image processor (#20501)

* Update doc example feature extractor -> image processor

* Apply suggestions from code review
											
										
										
											2022-11-30 14:50:55 +00:00
+								>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
-												Convert tutorials (#14665)

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links

* Convert a few docs

* And another

* Last tutorials

* New syntax for colab links
											
										
										
											2021-12-08 18:19:46 +00:00
+								```
-												Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example

* added padding to the CV section, though it is a special case

* Added a tip about post processing methods

* make style

* link update

* Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* review feedback

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
											
										
										
											2023-01-19 13:43:36 +00:00
+								First, let's add some image augmentation. You can use any library you prefer, but in this tutorial, we'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
-												Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example

* added padding to the CV section, though it is a special case

* Added a tip about post processing methods

* make style

* link update

* Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* review feedback

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
											
										
										
											2023-01-19 13:43:36 +00:00
+. Here we use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain together a couple of
 								transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html).
 								Note that for resizing, we can get the image size requirements from the `image_processor`. For some models, an exact height and
 								width are expected, for others only the `shortest_edge` is defined.
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
-												Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example

* added padding to the CV section, though it is a special case

* Added a tip about post processing methods

* make style

* link update

* Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* review feedback

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
											
										
										
											2023-01-19 13:43:36 +00:00
+								>>> from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
-												Add Image Processors (#19796)

* Add CLIP image processor

* Crop size as dict too

* Update warning

* Actually use logger this time

* Normalize doesn't change dtype of input

* Add perceiver image processor

* Tidy up

* Add DPT image processor

* Add Vilt image processor

* Tidy up

* Add poolformer image processor

* Tidy up

* Add LayoutLM v2 and v3 imsge processors

* Tidy up

* Add Flava image processor

* Tidy up

* Add deit image processor

* Tidy up

* Add ConvNext image processor

* Tidy up

* Add levit image processor

* Add segformer image processor

* Add in post processing

* Fix up

* Add ImageGPT image processor

* Fixup

* Add mobilevit image processor

* Tidy up

* Add postprocessing

* Fixup

* Add VideoMAE image processor

* Tidy up

* Add ImageGPT image processor

* Fixup

* Add ViT image processor

* Tidy up

* Add beit image processor

* Add mobilevit image processor

* Tidy up

* Add postprocessing

* Fixup

* Fix up

* Fix flava and remove tree module

* Fix image classification pipeline failing tests

* Update feature extractor in trainer scripts

* Update pad_if_smaller to accept tuple and int size

* Update for image segmentation pipeline

* Update src/transformers/models/perceiver/image_processing_perceiver.py

Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>

* Update src/transformers/image_processing_utils.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/beit/image_processing_beit.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* PR comments - docstrings; remove accidentally added resize; var names

* Update docstrings

* Add exception if size is not in the right format

* Fix exception check

* Fix up

* Use shortest_edge in tuple in script

Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
											
										
										
											2022-11-02 11:57:36 +00:00
+								>>> size = (
-												Fix code sample in preprocess (#20561)

* change to image_processor

* apply review
											
										
										
											2022-12-05 19:49:43 +00:00
+								...     image_processor.size["shortest_edge"]
 								...     if "shortest_edge" in image_processor.size
 								...     else (image_processor.size["height"], image_processor.size["width"])
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								... )
-												Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example

* added padding to the CV section, though it is a special case

* Added a tip about post processing methods

* make style

* link update

* Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* review feedback

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
											
										
										
											2023-01-19 13:43:36 +00:00
 								>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)])
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```
-												Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example

* added padding to the CV section, though it is a special case

* Added a tip about post processing methods

* make style

* link update

* Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* review feedback

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
											
										
										
											2023-01-19 13:43:36 +00:00
+. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values)
 								as its input. `ImageProcessor` can take care of normalizing the images, and generating appropriate tensors.
 								Create a function that combines image augmentation and image preprocessing for a batch of images and generates `pixel_values`:
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
 								>>> def transforms(examples):
-												Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example

* added padding to the CV section, though it is a special case

* Added a tip about post processing methods

* make style

* link update

* Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* review feedback

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
											
										
										
											2023-01-19 13:43:36 +00:00
+								...     images = [_transforms(img.convert("RGB")) for img in examples["image"]]
 								...     examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"]
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								...     return examples
 								```
-												Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example

* added padding to the CV section, though it is a special case

* Added a tip about post processing methods

* make style

* link update

* Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* review feedback

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
											
										
										
											2023-01-19 13:43:36 +00:00
+								<Tip>
 								In the example above we set `do_resize=False` because we have already resized the images in the image augmentation transformation,
 								and leveraged the `size` attribute from the appropriate `image_processor`. If you do not resize images during image augmentation,
 								leave this parameter out. By default, `ImageProcessor` will handle the resizing.
 								If you wish to normalize images as a part of the augmentation transformation, use the `image_processor.image_mean`,
 								and `image_processor.image_std` values.
 								</Tip>
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+. Then use 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform) to apply the transforms on the fly:
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
 								>>> dataset.set_transform(transforms)
 								```
-												Update doc examples feature extractor -> image processor (#20501)

* Update doc example feature extractor -> image processor

* Apply suggestions from code review
											
										
										
											2022-11-30 14:50:55 +00:00
+. Now when you access the image, you'll notice the image processor has added `pixel_values`. You can pass your processed dataset to the model now!
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
-												Fix code sample in preprocess (#20561)

* change to image_processor

* apply review
											
										
										
											2022-12-05 19:49:43 +00:00
+								>>> dataset[0].keys()
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Here is what the image looks like after the transforms are applied. The image has been randomly cropped and it's color properties are different.
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
 								>>> import numpy as np
 								>>> import matplotlib.pyplot as plt
 								>>> img = dataset[0]["pixel_values"]
 								>>> plt.imshow(img.permute(1, 2, 0))
 								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								<div class="flex justify-center">
 								    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png"/>
 								</div>
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
-												Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example

* added padding to the CV section, though it is a special case

* Added a tip about post processing methods

* make style

* link update

* Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* review feedback

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
											
										
										
											2023-01-19 13:43:36 +00:00
+								<Tip>
 								For tasks like object detection, semantic segmentation, instance segmentation, and panoptic segmentation, `ImageProcessor`
 								offers post processing methods. These methods convert model's raw outputs into meaningful predictions such as bounding boxes,
 								or segmentation maps.
 								</Tip>
 								### Pad
 								In some cases, for instance, when fine-tuning [DETR](./model_doc/detr), the model applies scale augmentation at training
 								time. This may cause images to be different sizes in a batch. You can use [`DetrImageProcessor.pad_and_create_pixel_mask`]
 								from [`DetrImageProcessor`] and define a custom `collate_fn` to batch images together.
 								```py
 								>>> def collate_fn(batch):
 								...     pixel_values = [item["pixel_values"] for item in batch]
 								...     encoding = image_processor.pad_and_create_pixel_mask(pixel_values, return_tensors="pt")
 								...     labels = [item["labels"] for item in batch]
 								...     batch = {}
 								...     batch["pixel_values"] = encoding["pixel_values"]
 								...     batch["pixel_mask"] = encoding["pixel_mask"]
 								...     batch["labels"] = labels
 								...     return batch
 								```
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
+								## Multimodal
-												Update doc examples feature extractor -> image processor (#20501)

* Update doc example feature extractor -> image processor

* Apply suggestions from code review
											
										
										
											2022-11-30 14:50:55 +00:00
+								For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples together two processing objects such as as tokenizer and feature extractor.
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
 								>>> from datasets import load_dataset
 								>>> lj_speech = load_dataset("lj_speech", split="train")
 								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								For ASR, you're mainly focused on `audio` and `text` so you can remove the other columns:
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
 								>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])
 								```
 								Now take a look at the `audio` and `text` columns:
 								```py
 								>>> lj_speech[0]["audio"]
 								{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
 .3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 								 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 								 'sampling_rate': 22050}
 								>>> lj_speech[0]["text"]
 								'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
 								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Remember you should always [resample](preprocessing#audio) your audio dataset's sampling rate to match the sampling rate of the dataset used to pretrain a model!
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
 								>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
 								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+								Load a processor with [`AutoProcessor.from_pretrained`]:
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
 								>>> from transformers import AutoProcessor
 								>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
 								```
-												Focus doc around preprocessing classes (#18768)

* 📝 reframe docs around preprocessing classes

* small edits

* edits and review

* fix typo

* apply review

* clarify processor
											
										
										
											2022-09-29 00:09:44 +00:00
+. Create a function to process the audio data contained in `array` to `input_values`, and tokenize `text` to `labels`. These are the inputs to the model:
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								```py
 								>>> def prepare_dataset(example):
 								...     audio = example["audio"]
-												Replace `as_target` context managers by direct calls (#18325)

* Preliminary work on tokenizers

* Quality + fix tests

* Treat processors

* Fix pad

* Remove all uses of  in tests, docs and examples

* Replace all as_target_tokenizer

* Fix tests

* Fix quality

* Update examples/flax/image-captioning/run_image_captioning_flax.py

Co-authored-by: amyeroberts <amy@huggingface.co>

* Style

Co-authored-by: amyeroberts <amy@huggingface.co>
											
										
										
											2022-07-29 12:09:09 +00:00
+								...     example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))
-												Update tutorial docs (#15165)

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick
											
										
										
											2022-02-02 00:31:35 +00:00
 								...     return example
 								```
 . Apply the `prepare_dataset` function to a sample:
 								```py
 								>>> prepare_dataset(lj_speech[0])
 								```
-												Add Image Processors (#19796)

* Add CLIP image processor

* Crop size as dict too

* Update warning

* Actually use logger this time

* Normalize doesn't change dtype of input

* Add perceiver image processor

* Tidy up

* Add DPT image processor

* Add Vilt image processor

* Tidy up

* Add poolformer image processor

* Tidy up

* Add LayoutLM v2 and v3 imsge processors

* Tidy up

* Add Flava image processor

* Tidy up

* Add deit image processor

* Tidy up

* Add ConvNext image processor

* Tidy up

* Add levit image processor

* Add segformer image processor

* Add in post processing

* Fix up

* Add ImageGPT image processor

* Fixup

* Add mobilevit image processor

* Tidy up

* Add postprocessing

* Fixup

* Add VideoMAE image processor

* Tidy up

* Add ImageGPT image processor

* Fixup

* Add ViT image processor

* Tidy up

* Add beit image processor

* Add mobilevit image processor

* Tidy up

* Add postprocessing

* Fixup

* Fix up

* Fix flava and remove tree module

* Fix image classification pipeline failing tests

* Update feature extractor in trainer scripts

* Update pad_if_smaller to accept tuple and int size

* Update for image segmentation pipeline

* Update src/transformers/models/perceiver/image_processing_perceiver.py

Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>

* Update src/transformers/image_processing_utils.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/beit/image_processing_beit.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* PR comments - docstrings; remove accidentally added resize; var names

* Update docstrings

* Add exception if size is not in the right format

* Fix exception check

* Fix up

* Use shortest_edge in tuple in script

Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
											
										
										
											2022-11-02 11:57:36 +00:00
+								The processor has now added `input_values` and `labels`, and the sampling rate has also been correctly downsampled to 16kHz. You can pass your processed dataset to the model now!