* Updated documentation and added conversion utility * Update docs/source/en/tiktoken.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tiktoken.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Moved util function to integration folder + allow for str * Update formatting Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Updated formatting * style changes --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2.6 KiB
Tiktoken and interaction with Transformers
Support for tiktoken model files is seamlessly integrated in 🤗 transformers when loading models
from_pretrained with a tokenizer.model tiktoken file on the Hub, which is automatically converted into our
fast tokenizer.
Known models that were released with a tiktoken.model:
- gpt2
- llama3
Example usage
In order to load tiktoken files in transformers, ensure that the tokenizer.model file is a tiktoken file and it
will automatically be loaded when loading from_pretrained. Here is how one would load a tokenizer and a model, which
can be loaded from the exact same file:
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="original")
Create tiktoken tokenizer
The tokenizer.model file contains no information about additional tokens or pattern strings. If these are important, convert the tokenizer to tokenizer.json, the appropriate format for [PreTrainedTokenizerFast].
Generate the tokenizer.model file with tiktoken.get_encoding and then convert it to tokenizer.json with [convert_tiktoken_to_fast].
from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import get_encoding
# You can load your custom encoding or the one provided by OpenAI
encoding = get_encoding("gpt2")
convert_tiktoken_to_fast(encoding, "config/save/dir")
The resulting tokenizer.json file is saved to the specified directory and can be loaded with [PreTrainedTokenizerFast].
tokenizer = PreTrainedTokenizerFast.from_pretrained("config/save/dir")