SqueezeBERT architecture (#7083)

* configuration_squeezebert.py thin wrapper around bert tokenizer fix typos wip sb model code wip modeling_squeezebert.py. Next step is to get the multi-layer-output interface working set up squeezebert to use BertModelOutput when returning results. squeezebert documentation formatting allow head mask that is an array of [None, ..., None] docs docs cont'd path to vocab docs and pointers to cloud files (WIP) line length and indentation squeezebert model cards formatting of model cards untrack modeling_squeezebert_scratchpad.py update aws paths to vocab and config files get rid of stub of NSP code, and advise users to pretrain with mlm only fix rebase issues redo rebase of modeling_auto.py fix issues with code formatting more code format auto-fixes move squeezebert before bert in tokenization_auto.py and modeling_auto.py because squeezebert inherits from bert tests for squeezebert modeling and tokenization fix typo move squeezebert before bert in modeling_auto.py to fix inheritance problem disable test_head_masking, since squeezebert doesn't yet implement head masking fix issues exposed by the test_modeling_squeezebert.py fix an issue exposed by test_tokenization_squeezebert.py fix issue exposed by test_modeling_squeezebert.py auto generated code style improvement issue that we inherited from modeling_xxx.py: SqueezeBertForMaskedLM.forward() calls self.cls(), but there is no self.cls, and I think the goal was actually to call self.lm_head() update copyright resolve failing 'test_hidden_states_output' and remove unused encoder_hidden_states and encoder_attention_mask docs add integration test. rename squeezebert-mnli --> squeezebert/squeezebert-mnli autogenerated formatting tweaks integrate feedback from patrickvonplaten and sgugger to programming style and documentation strings * tiny change to order of imports
2026-05-14 20:58:08 +00:00 · 2020-10-05 01:25:43 -07:00 · 2020-10-05 01:25:43 -07:00 · 02ef825be2
commit 02ef825be2
parent e2c935f561
18 changed files with 1950 additions and 14 deletions
--- a/README.md
+++ b/README.md
@ -25,9 +25,9 @@
 <p>State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0
 </h3>

-🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone. 
+🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

-🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our [model hub](https://huggingface.co/models). At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments. 
+🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our [model hub](https://huggingface.co/models). At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments.

 🤗 Transformers is backed by the two most popular deep learning libraries, [PyTorch](https://pytorch.org/) and [TensorFlow](https://www.tensorflow.org/), with a seamless integration between them, allowing you to train your models with one then load it for inference with the other.

@ -38,7 +38,7 @@

 You can test most of our models directly on their pages from the [model hub](https://huggingface.co/models). We also offer an [inference API](https://huggingface.co/pricing) to use those models.

-Here are a few examples: 
+Here are a few examples:
 - [Masked word completion with BERT](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France)
 - [Name Entity Recognition with Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city)
 - [Text generation with GPT-2](https://huggingface.co/gpt2?text=A+long+time+ago%2C+)
@ -51,7 +51,7 @@ Here are a few examples:

 ## Quick tour

-To immediately use a model on a given text, we provide the `pipeline` API. Pipelines group together a pretrained model with the preprocessing that was used during that model training. Here is how to quickly use a pipeline to classify positive versus negative texts 
+To immediately use a model on a given text, we provide the `pipeline` API. Pipelines group together a pretrained model with the preprocessing that was used during that model training. Here is how to quickly use a pipeline to classify positive versus negative texts

 ```python
 >>> from transformers import pipeline
@ -62,7 +62,7 @@ To immediately use a model on a given text, we provide the `pipeline` API. Pipel
 [{'label': 'POSITIVE', 'score': 0.9978193640708923}]
 ```

-The second line of code downloads and caches the pretrained model used by the pipeline, the third line evaluates it on the given text. Here the answer is "positive" with a confidence of 99.8%. 
+The second line of code downloads and caches the pretrained model used by the pipeline, the third line evaluates it on the given text. Here the answer is "positive" with a confidence of 99.8%.

 This is another example of pipeline used for that can extract question answers from some context:

@ -127,7 +127,7 @@ The model itself is a regular [Pytorch `nn.Module`](https://pytorch.org/docs/sta
 1. Easily customize a model or an example to your needs:
    - Examples for each architecture to reproduce the results by the official authors of said architecture.
    - Expose the models internal as consistently as possible.
-    - Model files can be used independently of the library for quick experiments. 
+    - Model files can be used independently of the library for quick experiments.

 ## Why shouldn't I use transformers?

@ -179,12 +179,13 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 1. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
 1. **[LXMERT](https://github.com/airsplay/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
 1. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
-1. **[MBart](https://github.com/pytorch/fairseq/tree/master/examples/mbart)** (from Facebook) released with the paper  [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 
+1. **[MBart](https://github.com/pytorch/fairseq/tree/master/examples/mbart)** (from Facebook) released with the paper  [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
 1. **[MMBT](https://github.com/facebookresearch/mmbt/)** (from Facebook), released together with the paper a [Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/pdf/1909.02950.pdf) by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
 1. **[Pegasus](https://github.com/google-research/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
 1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
 ultilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
+1. **[SqueezeBert](https://huggingface.co/transformers/model_doc/squeezebert.html)** released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
 1. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
 1. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 1. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -148,7 +148,10 @@ conversion utilities for the following models:
 29. `XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `XLNet: Generalized
    Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang, Zihang
    Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
-30. `Other community models <https://huggingface.co/models>`_, contributed by the `community
+30. SqueezeBERT (from UC Berkeley) released with the paper
+    `SqueezeBERT: What can computer vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`_
+    by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
+31. `Other community models <https://huggingface.co/models>`_, contributed by the `community
    <https://huggingface.co/users>`_.

 .. toctree::
@ -241,6 +244,7 @@ conversion utilities for the following models:
    model_doc/reformer
    model_doc/retribert
    model_doc/roberta
+    model_doc/squeezebert
    model_doc/t5
    model_doc/transformerxl
    model_doc/xlm
--- a/docs/source/model_doc/squeezebert.rst
+++ b/docs/source/model_doc/squeezebert.rst
@ -0,0 +1,103 @@
+SqueezeBERT
+----------------------------------------------------
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~
+
+The SqueezeBERT model was proposed in
+`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
+<https://arxiv.org/abs/2006.11316>`__
+by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer.
+It's a bidirectional transformer similar to the BERT model.
+The key difference between the BERT architecture and the SqueezeBERT architecture
+is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
+instead of fully-connected layers for the Q, K, V and FFN layers.
+
+The abstract from the paper is the following:
+
+*Humans read and write hundreds of billions of messages every day. Further, due to the availability of
+large datasets, large computing systems, and better neural network models, natural language processing (NLP)
+technology has made significant strides in understanding, proofreading, and organizing these messages.
+Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users,
+social networks, and businesses. In particular, we consider smartphones and other mobile devices as
+crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network
+models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds
+to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped
+convolutions have yielded significant speedups for computer vision networks, but many of these techniques
+have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
+self-attention layers with grouped convolutions, and we use this technique in a novel network architecture
+called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive
+accuracy on the GLUE test set. The SqueezeBERT code will be released.*
+
+Tips:
+
+- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+  the right rather than the left.
+- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
+  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
+  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+- For best results when finetuning on sequence classification tasks, it is recommended to start with the
+  `squeezebert/squeezebert-mnli-headless` checkpoint.
+
+SqueezeBertConfig
+~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertConfig
+    :members:
+
+
+SqueezeBertTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertTokenizer
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+
+
+SqueezeBertTokenizerFast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertTokenizerFast
+    :members:
+
+
+SqueezeBertModel
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertModel
+    :members:
+
+
+SqueezeBertForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForMaskedLM
+    :members:
+
+
+SqueezeBertForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForSequenceClassification
+    :members:
+
+
+SqueezeBertForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForMultipleChoice
+    :members:
+
+
+SqueezeBertForTokenClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForTokenClassification
+    :members:
+
+
+SqueezeBertForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForQuestionAnswering
+    :members:
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@ -426,4 +426,13 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details <https://github.com/microsoft/DeBERTa>`__)                                                                              |
 +--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-
+| SqueezeBERT        | ``squeezebert/squeezebert-uncased``                        | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
+|                    |                                                            | | SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks.          |
+|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                    | ``squeezebert/squeezebert-mnli``                           | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
+|                    |                                                            | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.      |
+|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                    | ``squeezebert/squeezebert-mnli-headless``                  | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
+|                    |                                                            | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.      |
+|                    |                                                            | | The final classification layer is removed, so when you finetune, the final layer will be reinitialized.                             |
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
--- a/model_cards/squeezebert/squeezebert-mnli-headless/README.md
+++ b/model_cards/squeezebert/squeezebert-mnli-headless/README.md
@ -0,0 +1,67 @@
+language: en
+license: bsd
+datasets:
+- bookcorpus
+- wikipedia
+---
+
+# SqueezeBERT pretrained model
+
+This model, `squeezebert-mnli-headless`, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the [Multi-Genre Natural Language Inference (MNLI)](https://cims.nyu.edu/~sbowman/multinli/) dataset. This is a "headless" model with the final classification layer removed, and this will allow Transformers to automatically reinitialize the final classification layer before you begin finetuning on your data.
+SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
+The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
+
+
+## Pretraining
+
+### Pretraining data
+- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
+- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
+
+### Pretraining procedure
+The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
+(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
+
+From the SqueezeBERT paper:
+> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
+
+## Finetuning
+
+The SqueezeBERT paper presents 2 approaches to finetuning the model:
+- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
+- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
+
+A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
+Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
+
+This model, `squeezebert/squeezebert-mnli-headless`, is the "finetuned with bells and whistles" MNLI-finetuned SqueezeBERT model. In this particular model, we have removed the final classification layer -- in other words, it is "headless." We recommend using this model if you intend to finetune the model on your own data. Using this model means that your final layer will automatically be reinitialized when you start finetuning on your data.
+
+### How to finetune
+To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
+```
+./utils/download_glue_data.py
+
+python examples/text-classification/run_glue.py \
+    --model_name_or_path squeezebert-base-headless \
+    --task_name mrpc \
+    --data_dir ./glue_data/MRPC \
+    --output_dir ./models/squeezebert_mrpc \
+    --overwrite_output_dir \
+    --do_train \
+    --do_eval \
+    --num_train_epochs 10 \
+    --learning_rate 3e-05 \
+    --per_device_train_batch_size 16 \
+    --save_steps 20000
+
+```
+
+## BibTeX entry and citation info
+```
+@article{2020_SqueezeBERT,
+     author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
+     title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
+     journal = {arXiv:2006.11316},
+     year = {2020}
+}
+```
--- a/model_cards/squeezebert/squeezebert-mnli/README.md
+++ b/model_cards/squeezebert/squeezebert-mnli/README.md
@ -0,0 +1,67 @@
+language: en
+license: bsd
+datasets:
+- bookcorpus
+- wikipedia
+---
+
+# SqueezeBERT pretrained model
+
+This model, `squeezebert-mnli`, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the [Multi-Genre Natural Language Inference (MNLI)](https://cims.nyu.edu/~sbowman/multinli/) dataset.
+SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
+The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
+
+
+## Pretraining
+
+### Pretraining data
+- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
+- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
+
+### Pretraining procedure
+The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
+(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
+
+From the SqueezeBERT paper:
+> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
+
+## Finetuning
+
+The SqueezeBERT paper presents 2 approaches to finetuning the model:
+- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
+- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
+
+A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
+Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
+
+This model, `squeezebert/squeezebert-mnli`, is the "trained with bells and whistles" MNLI-finetuned SqueezeBERT model.
+
+### How to finetune
+To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
+```
+./utils/download_glue_data.py
+
+python examples/text-classification/run_glue.py \
+    --model_name_or_path squeezebert-base-headless \
+    --task_name mrpc \
+    --data_dir ./glue_data/MRPC \
+    --output_dir ./models/squeezebert_mrpc \
+    --overwrite_output_dir \
+    --do_train \
+    --do_eval \
+    --num_train_epochs 10 \
+    --learning_rate 3e-05 \
+    --per_device_train_batch_size 16 \
+    --save_steps 20000
+
+```
+
+## BibTeX entry and citation info
+```
+@article{2020_SqueezeBERT,
+     author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
+     title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
+     journal = {arXiv:2006.11316},
+     year = {2020}
+}
+```
--- a/model_cards/squeezebert/squeezebert-uncased/README.md
+++ b/model_cards/squeezebert/squeezebert-uncased/README.md
@ -0,0 +1,67 @@
+language: en
+license: bsd
+datasets:
+- bookcorpus
+- wikipedia
+---
+
+# SqueezeBERT pretrained model
+
+This model, `squeezebert-uncased`, is a pretrained model for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective.
+SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
+The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
+
+
+## Pretraining
+
+### Pretraining data
+- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
+- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
+
+### Pretraining procedure
+The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
+(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
+
+The SqueezeBERT paper presents 2 approaches to finetuning the model:
+> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
+
+## Finetuning
+
+The SqueezeBERT paper results from 2 approaches to finetuning the model:
+- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
+- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
+
+A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
+Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
+
+This model, `squeezebert/squeezebert-uncased`, has been pretrained but not finetuned. For most text classification tasks, we recommend using squeezebert-mnli-headless as a starting point.
+
+### How to finetune
+To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
+```
+./utils/download_glue_data.py
+
+python examples/text-classification/run_glue.py \
+    --model_name_or_path squeezebert-base-headless \
+    --task_name mrpc \
+    --data_dir ./glue_data/MRPC \
+    --output_dir ./models/squeezebert_mrpc \
+    --overwrite_output_dir \
+    --do_train \
+    --do_eval \
+    --num_train_epochs 10 \
+    --learning_rate 3e-05 \
+    --per_device_train_batch_size 16 \
+    --save_steps 20000
+
+```
+
+## BibTeX entry and citation info
+```
+@article{2020_SqueezeBERT,
+     author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
+     title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
+     journal = {arXiv:2006.11316},
+     year = {2020}
+}
+```
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@ -57,6 +57,7 @@ from .configuration_rag import RagConfig
 from .configuration_reformer import REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ReformerConfig
 from .configuration_retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBertConfig
 from .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
+from .configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig
 from .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
 from .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
 from .configuration_utils import PretrainedConfig
@ -184,6 +185,7 @@ from .tokenization_rag import RagTokenizer
 from .tokenization_reformer import ReformerTokenizer
 from .tokenization_retribert import RetriBertTokenizer, RetriBertTokenizerFast
 from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
+from .tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
 from .tokenization_t5 import T5Tokenizer
 from .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer, TransfoXLTokenizerFast
 from .tokenization_utils import PreTrainedTokenizer
@ -453,6 +455,17 @@ if is_torch_available():
        RobertaForTokenClassification,
        RobertaModel,
    )
+    from .modeling_squeezebert import (
+        SQUEEZEBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
+        SqueezeBertForMaskedLM,
+        SqueezeBertForMultipleChoice,
+        SqueezeBertForQuestionAnswering,
+        SqueezeBertForSequenceClassification,
+        SqueezeBertForTokenClassification,
+        SqueezeBertModel,
+        SqueezeBertModule,
+        SqueezeBertPreTrainedModel,
+    )
    from .modeling_t5 import (
        T5_PRETRAINED_MODEL_ARCHIVE_LIST,
        T5ForConditionalGeneration,
--- a/src/transformers/configuration_auto.py
+++ b/src/transformers/configuration_auto.py
@ -44,6 +44,7 @@ from .configuration_rag import RagConfig
 from .configuration_reformer import ReformerConfig
 from .configuration_retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBertConfig
 from .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
+from .configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig
 from .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
 from .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
 from .configuration_utils import PretrainedConfig
@ -80,6 +81,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
        LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
        DPR_PRETRAINED_CONFIG_ARCHIVE_MAP,
        DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
    ]
    for key, value, in pretrained_map.items()
 )
@ -105,6 +107,7 @@ CONFIG_MAPPING = OrderedDict(
        ("deberta", DebertaConfig),
        ("flaubert", FlaubertConfig),
        ("fsmt", FSMTConfig),
+        ("squeezebert", SqueezeBertConfig),
        ("bert", BertConfig),
        ("openai-gpt", OpenAIGPTConfig),
        ("gpt2", GPT2Config),
@ -141,6 +144,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("roberta", "RoBERTa"),
        ("flaubert", "FlauBERT"),
        ("fsmt", "FairSeq Machine-Translation"),
+        ("squeezebert", "SqueezeBERT"),
        ("bert", "BERT"),
        ("openai-gpt", "OpenAI GPT"),
        ("gpt2", "OpenAI GPT-2"),
--- a/src/transformers/configuration_squeezebert.py
+++ b/src/transformers/configuration_squeezebert.py
@ -0,0 +1,153 @@
+# coding=utf-8
+# Copyright 2020 The SqueezeBert authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" SqueezeBERT model configuration """
+
+from .configuration_utils import PretrainedConfig
+from .utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "squeezebert/squeezebert-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-uncased/config.json",
+    "squeezebert/squeezebert-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-mnli/config.json",
+    "squeezebert/squeezebert-mnli-headless": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-mnli-headless/config.json",
+}
+
+
+class SqueezeBertConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.SqueezeBertModel`.
+    It is used to instantiate a SqueezeBERT model according to the specified arguments, defining the model
+    architecture.
+
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
+
+
+    Args:
+        vocab_size (:obj:`int`, optional, defaults to 30522):
+            Vocabulary size of the SqueezeBERT model. Defines the different tokens that
+            can be represented by the :obj:`inputs_ids` passed to the forward method of :class:`~transformers.SqueezeBertModel`.
+        hidden_size (:obj:`int`, optional, defaults to 512):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (:obj:`int`, optional, defaults to 24):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (:obj:`int`, optional, defaults to 4):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (:obj:`int`, optional, defaults to 512):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to :obj:`"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"`, :obj:`"gelu_new"`, :obj:`"gelu_fast"`,
+            and :obj:`"mish"` are supported.
+        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.0):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+            The maximum sequence length that this model might ever be used with.
+            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (:obj:`int`, optional, defaults to 2):
+            The vocabulary size of the :obj:`token_type_ids` passed into :class:`~transformers.SqueezeBertModel`.
+        initializer_range (:obj:`float`, optional, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+
+        pad_token_id (:obj:`int`, optional, defaults to 0):
+            The ID of the token in the word embedding to use as padding.
+        embedding_size (:obj:`int`, optional, defaults to 128):
+            The dimension of the word embedding vectors.
+
+        q_groups (:obj:`int`, optional, defaults to 4):
+            The number of groups in Q layer
+        k_groups (:obj:`int`, optional, defaults to 4):
+            The number of groups in K layer
+        v_groups (:obj:`int`, optional, defaults to 4):
+            The number of groups in V layer
+        post_attention_groups (:obj:`int`, optional, defaults to 1):
+            The number of groups in the first feed forward network layer
+        intermediate_groups (:obj:`int`, optional, defaults to 4):
+            The number of groups in the second feed forward network layer
+        output_groups (:obj:`int`, optional, defaults to 4):
+            The number of groups in the third feed forward network layer
+
+    Example:
+
+        >>> from transformers import SqueezeBertModel, SqueezeBertConfig
+
+        >>> # Initializing a SqueezeBERT configuration
+        >>> configuration = SqueezeBertConfig()
+
+        >>> # Initializing a model from the configuration above
+        >>> model = SqueezeBertModel(configuration)
+
+        >>> # Accessing the model configuration
+        >>> configuration = model.config
+
+    Attributes:
+        pretrained_config_archive_map (Dict[str, str]):
+            A dictionary containing all the available pre-trained checkpoints.
+    """
+    pretrained_config_archive_map = SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
+    model_type = "squeezebert"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        embedding_size=768,
+        q_groups=4,
+        k_groups=4,
+        v_groups=4,
+        post_attention_groups=1,
+        intermediate_groups=4,
+        output_groups=4,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.embedding_size = embedding_size
+        self.q_groups = q_groups
+        self.k_groups = k_groups
+        self.v_groups = v_groups
+        self.post_attention_groups = post_attention_groups
+        self.intermediate_groups = intermediate_groups
+        self.output_groups = output_groups
--- a/src/transformers/modeling_auto.py
+++ b/src/transformers/modeling_auto.py
@ -45,6 +45,7 @@ from .configuration_auto import (
    ReformerConfig,
    RetriBertConfig,
    RobertaConfig,
+    SqueezeBertConfig,
    T5Config,
    TransfoXLConfig,
    XLMConfig,
@ -173,6 +174,14 @@ from .modeling_roberta import (
    RobertaForTokenClassification,
    RobertaModel,
 )
+from .modeling_squeezebert import (
+    SqueezeBertForMaskedLM,
+    SqueezeBertForMultipleChoice,
+    SqueezeBertForQuestionAnswering,
+    SqueezeBertForSequenceClassification,
+    SqueezeBertForTokenClassification,
+    SqueezeBertModel,
+)
 from .modeling_t5 import T5ForConditionalGeneration, T5Model
 from .modeling_transfo_xl import TransfoXLLMHeadModel, TransfoXLModel
 from .modeling_xlm import (
@ -218,6 +227,7 @@ MODEL_MAPPING = OrderedDict(
        (LongformerConfig, LongformerModel),
        (RobertaConfig, RobertaModel),
        (LayoutLMConfig, LayoutLMModel),
+        (SqueezeBertConfig, SqueezeBertModel),
        (BertConfig, BertModel),
        (OpenAIGPTConfig, OpenAIGPTModel),
        (GPT2Config, GPT2Model),
@ -251,6 +261,7 @@ MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
        (FSMTConfig, FSMTForConditionalGeneration),
        (LongformerConfig, LongformerForMaskedLM),
        (RobertaConfig, RobertaForMaskedLM),
+        (SqueezeBertConfig, SqueezeBertForMaskedLM),
        (BertConfig, BertForPreTraining),
        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),
        (GPT2Config, GPT2LMHeadModel),
@ -278,6 +289,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
        (BartConfig, BartForConditionalGeneration),
        (LongformerConfig, LongformerForMaskedLM),
        (RobertaConfig, RobertaForMaskedLM),
+        (SqueezeBertConfig, SqueezeBertForMaskedLM),
        (BertConfig, BertForMaskedLM),
        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),
        (GPT2Config, GPT2LMHeadModel),
@ -324,6 +336,7 @@ MODEL_FOR_MASKED_LM_MAPPING = OrderedDict(
        (XLMRobertaConfig, XLMRobertaForMaskedLM),
        (LongformerConfig, LongformerForMaskedLM),
        (RobertaConfig, RobertaForMaskedLM),
+        (SqueezeBertConfig, SqueezeBertForMaskedLM),
        (BertConfig, BertForMaskedLM),
        (MobileBertConfig, MobileBertForMaskedLM),
        (FlaubertConfig, FlaubertWithLMHeadModel),
@ -355,6 +368,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
        (BartConfig, BartForSequenceClassification),
        (LongformerConfig, LongformerForSequenceClassification),
        (RobertaConfig, RobertaForSequenceClassification),
+        (SqueezeBertConfig, SqueezeBertForSequenceClassification),
        (BertConfig, BertForSequenceClassification),
        (XLNetConfig, XLNetForSequenceClassification),
        (MobileBertConfig, MobileBertForSequenceClassification),
@ -375,6 +389,7 @@ MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
        (LongformerConfig, LongformerForQuestionAnswering),
        (XLMRobertaConfig, XLMRobertaForQuestionAnswering),
        (RobertaConfig, RobertaForQuestionAnswering),
+        (SqueezeBertConfig, SqueezeBertForQuestionAnswering),
        (BertConfig, BertForQuestionAnswering),
        (XLNetConfig, XLNetForQuestionAnsweringSimple),
        (FlaubertConfig, FlaubertForQuestionAnsweringSimple),
@ -396,6 +411,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
        (XLMRobertaConfig, XLMRobertaForTokenClassification),
        (LongformerConfig, LongformerForTokenClassification),
        (RobertaConfig, RobertaForTokenClassification),
+        (SqueezeBertConfig, SqueezeBertForTokenClassification),
        (BertConfig, BertForTokenClassification),
        (MobileBertConfig, MobileBertForTokenClassification),
        (XLNetConfig, XLNetForTokenClassification),
@ -413,6 +429,7 @@ MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
        (XLMRobertaConfig, XLMRobertaForMultipleChoice),
        (LongformerConfig, LongformerForMultipleChoice),
        (RobertaConfig, RobertaForMultipleChoice),
+        (SqueezeBertConfig, SqueezeBertForMultipleChoice),
        (BertConfig, BertForMultipleChoice),
        (DistilBertConfig, DistilBertForMultipleChoice),
        (MobileBertConfig, MobileBertForMultipleChoice),
--- a/src/transformers/modeling_squeezebert.py
+++ b/src/transformers/modeling_squeezebert.py
--- a/src/transformers/tokenization_auto.py
+++ b/src/transformers/tokenization_auto.py
@ -46,6 +46,7 @@ from .configuration_auto import (
    ReformerConfig,
    RetriBertConfig,
    RobertaConfig,
+    SqueezeBertConfig,
    T5Config,
    TransfoXLConfig,
    XLMConfig,
@ -83,6 +84,7 @@ from .tokenization_rag import RagTokenizer
 from .tokenization_reformer import ReformerTokenizer
 from .tokenization_retribert import RetriBertTokenizer, RetriBertTokenizerFast
 from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
+from .tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
 from .tokenization_t5 import T5Tokenizer
 from .tokenization_transfo_xl import TransfoXLTokenizer, TransfoXLTokenizerFast
 from .tokenization_xlm import XLMTokenizer
@ -117,6 +119,7 @@ TOKENIZER_MAPPING = OrderedDict(
        (LxmertConfig, (LxmertTokenizer, LxmertTokenizerFast)),
        (LayoutLMConfig, (LayoutLMTokenizer, LayoutLMTokenizerFast)),
        (DPRConfig, (DPRQuestionEncoderTokenizer, DPRQuestionEncoderTokenizerFast)),
+        (SqueezeBertConfig, (SqueezeBertTokenizer, SqueezeBertTokenizerFast)),
        (BertConfig, (BertTokenizer, BertTokenizerFast)),
        (OpenAIGPTConfig, (OpenAIGPTTokenizer, OpenAIGPTTokenizerFast)),
        (GPT2Config, (GPT2Tokenizer, GPT2TokenizerFast)),
--- a/src/transformers/tokenization_squeezebert.py
+++ b/src/transformers/tokenization_squeezebert.py
@ -0,0 +1,78 @@
+# coding=utf-8
+# Copyright 2020 The SqueezeBert authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for SqueezeBERT."""
+
+from .tokenization_bert import BertTokenizer, BertTokenizerFast
+from .utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "squeezebert/squeezebert-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-uncased/vocab.txt",
+        "squeezebert/squeezebert-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-mnli/vocab.txt",
+        "squeezebert/squeezebert-mnli-headless": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-mnli-headless/vocab.txt",
+    }
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "squeezebert/squeezebert-uncased": 512,
+    "squeezebert/squeezebert-mnli": 512,
+    "squeezebert/squeezebert-mnli-headless": 512,
+}
+
+
+PRETRAINED_INIT_CONFIGURATION = {
+    "squeezebert/squeezebert-uncased": {"do_lower_case": True},
+    "squeezebert/squeezebert-mnli": {"do_lower_case": True},
+    "squeezebert/squeezebert-mnli-headless": {"do_lower_case": True},
+}
+
+
+class SqueezeBertTokenizer(BertTokenizer):
+    r"""
+    Constructs a  SqueezeBert tokenizer.
+
+    :class:`~transformers.SqueezeBertTokenizer is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
+    tokenization: punctuation splitting + wordpiece.
+
+    Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
+    parameters.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
+
+
+class SqueezeBertTokenizerFast(BertTokenizerFast):
+    r"""
+    Constructs a  "Fast" SqueezeBert tokenizer (backed by HuggingFace's `tokenizers` library).
+
+    :class:`~transformers.SqueezeBertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
+    tokenization: punctuation splitting + wordpiece.
+
+    Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
+    parameters.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
--- a/templates/adding_a_new_model/README.md
+++ b/templates/adding_a_new_model/README.md
@ -21,7 +21,7 @@ For a quick overview of the general philosphy of the library and its organizatio

 # Typical workflow for including a model

-Here an overview of the general workflow: 
+Here an overview of the general workflow:

 - [ ] Add model/configuration/tokenization classes.
 - [ ] Add conversion scripts.
@ -69,7 +69,7 @@ Here is the workflow for documentation:
 - [ ] Create a new page `xxx.rst` in the folder `docs/source/model_doc` and add this file in `docs/source/index.rst`.

 Make sure to check you have no sphinx warnings when building the documentation locally and follow our
-[documentaiton guide](https://github.com/huggingface/transformers/tree/master/docs#writing-documentation---specification).
+[documentation guide](https://github.com/huggingface/transformers/tree/master/docs#writing-documentation---specification).

 ## Final steps

--- a/templates/adding_a_new_model/modeling_xxx.py
+++ b/templates/adding_a_new_model/modeling_xxx.py
@ -19,7 +19,6 @@
 ####################################################


-import logging
 import os

 import torch
@ -37,9 +36,10 @@ from .modeling_outputs import (
    TokenClassifierOutput,
 )
 from .modeling_utils import PreTrainedModel
+from .utils import logging


-logger = logging.getLogger(__name__)
+logger = logging.get_logger(__name__)

 _CONFIG_FOR_DOC = "XXXConfig"
 _TOKENIZER_FOR_DOC = "XXXTokenizer"
@ -433,7 +433,7 @@ class XxxForMaskedLM(XxxPreTrainedModel):
        )

        sequence_output = outputs[0]
-        prediction_scores = self.cls(sequence_output)
+        prediction_scores = self.lm_head(sequence_output)

        masked_lm_loss = None
        if labels is not None:
--- a/tests/test_modeling_squeezebert.py
+++ b/tests/test_modeling_squeezebert.py
@ -0,0 +1,284 @@
+# coding=utf-8
+# Copyright 2020 The SqueezeBert authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import unittest
+
+from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device
+
+from .test_configuration_common import ConfigTester
+from .test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+if is_torch_available():
+    import torch
+
+    from transformers import (
+        SQUEEZEBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
+        SqueezeBertConfig,
+        SqueezeBertForMaskedLM,
+        SqueezeBertForMultipleChoice,
+        SqueezeBertForQuestionAnswering,
+        SqueezeBertForSequenceClassification,
+        SqueezeBertForTokenClassification,
+        SqueezeBertModel,
+    )
+
+    class SqueezeBertModelTester(object):
+        def __init__(
+            self,
+            parent,
+            batch_size=13,
+            seq_length=7,
+            is_training=True,
+            use_input_mask=True,
+            use_token_type_ids=False,
+            use_labels=True,
+            vocab_size=99,
+            hidden_size=32,
+            num_hidden_layers=5,
+            num_attention_heads=4,
+            intermediate_size=64,
+            hidden_act="gelu",
+            hidden_dropout_prob=0.1,
+            attention_probs_dropout_prob=0.1,
+            max_position_embeddings=512,
+            type_vocab_size=16,
+            type_sequence_label_size=2,
+            initializer_range=0.02,
+            num_labels=3,
+            num_choices=4,
+            scope=None,
+            q_groups=2,
+            k_groups=2,
+            v_groups=2,
+            post_attention_groups=2,
+            intermediate_groups=4,
+            output_groups=1,
+        ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_input_mask = use_input_mask
+            self.use_token_type_ids = use_token_type_ids
+            self.use_labels = use_labels
+            self.vocab_size = vocab_size
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.intermediate_size = intermediate_size
+            self.hidden_act = hidden_act
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.num_labels = num_labels
+            self.num_choices = num_choices
+            self.scope = scope
+            self.q_groups = q_groups
+            self.k_groups = k_groups
+            self.v_groups = v_groups
+            self.post_attention_groups = post_attention_groups
+            self.intermediate_groups = intermediate_groups
+            self.output_groups = output_groups
+
+        def prepare_config_and_inputs(self):
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+            input_mask = None
+            if self.use_input_mask:
+                input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+            sequence_labels = None
+            token_labels = None
+            choice_labels = None
+            if self.use_labels:
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+                choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+            config = SqueezeBertConfig(
+                embedding_size=self.hidden_size,
+                vocab_size=self.vocab_size,
+                hidden_size=self.hidden_size,
+                num_hidden_layers=self.num_hidden_layers,
+                num_attention_heads=self.num_attention_heads,
+                intermediate_size=self.intermediate_size,
+                hidden_act=self.hidden_act,
+                attention_probs_dropout_prob=self.hidden_dropout_prob,
+                attention_dropout=self.attention_probs_dropout_prob,
+                max_position_embeddings=self.max_position_embeddings,
+                initializer_range=self.initializer_range,
+                q_groups=self.q_groups,
+                k_groups=self.k_groups,
+                v_groups=self.v_groups,
+                post_attention_groups=self.post_attention_groups,
+                intermediate_groups=self.intermediate_groups,
+                output_groups=self.output_groups,
+                return_dict=True,
+            )
+
+            return config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+        def create_and_check_squeezebert_model(
+            self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+        ):
+            model = SqueezeBertModel(config=config)
+            model.to(torch_device)
+            model.eval()
+            result = model(input_ids, input_mask)
+            result = model(input_ids)
+            self.parent.assertEqual(
+                result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size)
+            )
+
+        def create_and_check_squeezebert_for_masked_lm(
+            self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+        ):
+            model = SqueezeBertForMaskedLM(config=config)
+            model.to(torch_device)
+            model.eval()
+            result = model(input_ids, attention_mask=input_mask, labels=token_labels)
+            self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
+
+        def create_and_check_squeezebert_for_question_answering(
+            self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+        ):
+            model = SqueezeBertForQuestionAnswering(config=config)
+            model.to(torch_device)
+            model.eval()
+            result = model(
+                input_ids, attention_mask=input_mask, start_positions=sequence_labels, end_positions=sequence_labels
+            )
+            self.parent.assertEqual(result.start_logits.shape, (self.batch_size, self.seq_length))
+            self.parent.assertEqual(result.end_logits.shape, (self.batch_size, self.seq_length))
+
+        def create_and_check_squeezebert_for_sequence_classification(
+            self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+        ):
+            config.num_labels = self.num_labels
+            model = SqueezeBertForSequenceClassification(config)
+            model.to(torch_device)
+            model.eval()
+            result = model(input_ids, attention_mask=input_mask, labels=sequence_labels)
+            self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
+
+        def create_and_check_squeezebert_for_token_classification(
+            self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+        ):
+            config.num_labels = self.num_labels
+            model = SqueezeBertForTokenClassification(config=config)
+            model.to(torch_device)
+            model.eval()
+
+            result = model(input_ids, attention_mask=input_mask, labels=token_labels)
+            self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.num_labels))
+
+        def create_and_check_squeezebert_for_multiple_choice(
+            self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+        ):
+            config.num_choices = self.num_choices
+            model = SqueezeBertForMultipleChoice(config=config)
+            model.to(torch_device)
+            model.eval()
+            multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
+            multiple_choice_input_mask = input_mask.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
+            result = model(
+                multiple_choice_inputs_ids,
+                attention_mask=multiple_choice_input_mask,
+                labels=choice_labels,
+            )
+            self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_choices))
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids, input_mask, sequence_labels, token_labels, choice_labels) = config_and_inputs
+            inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+            return config, inputs_dict
+
+
+@require_torch
+class SqueezeBertModelTest(ModelTesterMixin, unittest.TestCase):
+
+    all_model_classes = (
+        (
+            SqueezeBertModel,
+            SqueezeBertForMaskedLM,
+            SqueezeBertForMultipleChoice,
+            SqueezeBertForQuestionAnswering,
+            SqueezeBertForSequenceClassification,
+            SqueezeBertForTokenClassification,
+        )
+        if is_torch_available()
+        else None
+    )
+    test_pruning = False
+    test_torchscript = True
+    test_resize_embeddings = True
+    test_head_masking = False
+
+    def setUp(self):
+        self.model_tester = SqueezeBertModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=SqueezeBertConfig, dim=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_squeezebert_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_squeezebert_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_squeezebert_for_masked_lm(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_squeezebert_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_squeezebert_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_squeezebert_for_token_classification(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_squeezebert_for_multiple_choice(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in SQUEEZEBERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = SqueezeBertModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class SqueezeBertModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_classification_head(self):
+        model = SqueezeBertForSequenceClassification.from_pretrained("squeezebert/squeezebert-mnli")
+
+        input_ids = torch.tensor([[0, 29414, 232, 328, 740, 1140, 12695, 69, 13, 1588, 2]])
+        output = model(input_ids)[0]
+        expected_shape = torch.Size((1, 3))
+        self.assertEqual(output.shape, expected_shape)
+        expected_tensor = torch.tensor([[0.5075, 0.0682, -0.5881]])
+        self.assertTrue(torch.allclose(output, expected_tensor, atol=1e-4))
--- a/tests/test_tokenization_squeezebert.py
+++ b/tests/test_tokenization_squeezebert.py
@ -0,0 +1,43 @@
+# coding=utf-8
+# Copyright 2020 The SqueezeBert authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from transformers.testing_utils import slow
+from transformers.tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
+
+from .test_tokenization_bert import BertTokenizationTest
+
+
+class SqueezeBertTokenizationTest(BertTokenizationTest):
+
+    tokenizer_class = SqueezeBertTokenizer
+
+    def get_rust_tokenizer(self, **kwargs):
+        return SqueezeBertTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = SqueezeBertTokenizer.from_pretrained("squeezebert/squeezebert-mnli-headless")
+
+        text = tokenizer.encode("sequence builders", add_special_tokens=False)
+        text_2 = tokenizer.encode("multi-sequence build", add_special_tokens=False)
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
+        assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + text_2 + [
+            tokenizer.sep_token_id
+        ]