diff --git a/docs/source/imgs/ppl_chunked.gif b/docs/source/imgs/ppl_chunked.gif new file mode 100644 index 000000000..2e3373693 Binary files /dev/null and b/docs/source/imgs/ppl_chunked.gif differ diff --git a/docs/source/imgs/ppl_full.gif b/docs/source/imgs/ppl_full.gif new file mode 100644 index 000000000..2869208fa Binary files /dev/null and b/docs/source/imgs/ppl_full.gif differ diff --git a/docs/source/imgs/ppl_sliding.gif b/docs/source/imgs/ppl_sliding.gif new file mode 100644 index 000000000..d2dc26f55 Binary files /dev/null and b/docs/source/imgs/ppl_sliding.gif differ diff --git a/docs/source/index.rst b/docs/source/index.rst index accba53ee..a84ccd0a4 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -165,6 +165,7 @@ conversion utilities for the following models: :caption: Research bertology + perplexity benchmarks .. toctree:: diff --git a/docs/source/perplexity.rst b/docs/source/perplexity.rst new file mode 100644 index 000000000..c89d849cc --- /dev/null +++ b/docs/source/perplexity.rst @@ -0,0 +1,151 @@ +Perplexity of fixed-length models +================================= + +Perplexity (PPL) is one of the most common metrics for evaluating language +models. Before diving in, we should note that the metric applies specifically +to classical language models (sometimes called autoregressive or causal +language models) and is not well defined for masked language models like BERT +(see :doc:`summary of the models `). + +Perplexity is defined as the exponentiated average log-likelihood of a +sequence. If we have a tokenized sequence :math:`X = (x_0, x_1, \dots, x_t)`, +then the perplexity of :math:`X` is, + +.. math:: + + \text{PPL}(X) + = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{`_. + +Calculating PPL with fixed-length models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If we weren't limited by a model's context size, we would evaluate the +model's perplexity by autoregressively factorizing a sequence and +conditioning on the entire preceding subsequence at each step, as shown +below. + +.. image:: imgs/ppl_full.gif + :width: 600 + :alt: Full decomposition of a sequence with unlimited context length + +When working with approximate models, however, we typically have a constraint +on the number of tokens the model can process. The largest version +of :doc:`GPT-2 `, for example, has a fixed length of 1024 +tokens, so we cannot calculate :math:`p_\theta(x_t|x_{