transformers/tests/tokenization/test_tokenization_fast.py

238 lines
12 KiB
Python
Raw Normal View History

# coding=utf-8
# Copyright 2019 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
This will reduce "Already borrowed error": (#12550) * This will reduce "Already borrowed error": Original issue https://github.com/huggingface/tokenizers/issues/537 The original issue is caused by transformers calling many times mutable functions on the rust tokenizers. Rust needs to guarantee that only 1 agent has a mutable reference to memory at a given time (for many reasons which don't need explaining here). Usually, the rust compiler can guarantee that this property is true at compile time. Unfortunately, this is impossible for Python to do that, so PyO3, the bridge between rust and python used by `tokenizers`, will change the compile guarantee for a dynamic guarantee, so if multiple agents try to have multiple mutable borrows at the same time, then the runtime will yell with "Already borrowed". The proposed fix here in transformers, is simply to reduce the actual number of calls that really need mutable borrows. By reducing them, we reduce the risk of running into "Already borrowed" error. The caveat is now we add a call to read the current configuration of the `_tokenizer`, so worst case we have 2 calls instead of 1, and best case we simply have 1 + a Python comparison of a dict (should be negligible). * Adding a test. * trivial error :(. * Update tests/test_tokenization_fast.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * Adding reference to original issues in the tests. * Update the tests with fast tokenizer. Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
2021-07-09 07:36:05 +00:00
import concurrent.futures
import json
import os
import shutil
import tempfile
import unittest
from transformers import AutoTokenizer, PreTrainedTokenizerFast
from transformers.testing_utils import require_tokenizers
from ..test_tokenization_common import TokenizerTesterMixin
@require_tokenizers
class PreTrainedTokenizationFastTest(TokenizerTesterMixin, unittest.TestCase):
rust_tokenizer_class = PreTrainedTokenizerFast
test_slow_tokenizer = False
test_rust_tokenizer = True
from_pretrained_vocab_key = "tokenizer_file"
def setUp(self):
self.test_rust_tokenizer = False # because we don't have pretrained_vocab_files_map
super().setUp()
self.test_rust_tokenizer = True
model_paths = ["robot-test/dummy-tokenizer-fast", "robot-test/dummy-tokenizer-wordlevel"]
self.bytelevel_bpe_model_name = "SaulLu/dummy-tokenizer-bytelevel-bpe"
# Inclusion of 2 tokenizers to test different types of models (Unigram and WordLevel for the moment)
self.tokenizers_list = [(PreTrainedTokenizerFast, model_path, {}) for model_path in model_paths]
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_paths[0])
tokenizer.save_pretrained(self.tmpdirname)
def test_tokenizer_mismatch_warning(self):
# We disable this test for PreTrainedTokenizerFast because it is the only tokenizer that is not linked to any
# model
pass
🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909) * fix test for bart. Order is correct now let's skip BPEs * ouf * styling * fix bert.... * slow refactoring * current updates * massive refactoring * update * NICE! * update to see where I am at * updates * update * update * revert * updates * updates * start supporting legacy_save * styling * big update * revert some changes * nits * nniiiiiice * small fixes * kinda fix t5 with new behaviour * major update * fixup * fix copies * today's updates * fix byt5 * upfate * update * update * updates * update vocab size test * Barthez does not use not need the fairseq offset ids * super calll must be after * calll super * move all super init * move other super init * fixup * nits * more fixes * nits * more fixes * nits * more fix * remove useless files * ouch all of them are affected * and more! * small imporvements * no more sanitize token * more changes around unique no split tokens * partially fix more things * keep legacy save but add warning * so... more fixes * updates * guess deberta tokenizer could be nuked * fixup * fixup did some bad things * nuke it if it breaks * remove prints and pretrain fast from slow with new format. * fixups * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fiou * nit * by default specials should not be normalized? * update * remove brakpoint * updates * a lot of updates * fixup * fixes revert some changes to match fast * small nits * that makes it cleaner * fix camembert accordingly * update * some lest breaking changes * update * fixup * fix byt5 and whisper mostly * some more fixes, canine's byte vocab * fix gpt2 * fix most of the perceiver tests (4 left) * fix layout lmv3 * fixup * fix copies for gpt2 style * make sure to only warn once * fix perciever and gpt2 tests * some more backward compatibility: also read special tokens map because some ppl use it........////..... * fixup * add else when reading * nits * fresh updates * fix copies * will this make everything faster? * fixes * more fixes * update * more fixes * fixup * is the source of truth right? * sorry camembert for the troubles * current updates * fixup * update led * update * fix regression * fix single word * more model specific fixes * fix t5 tests * fixup * more comments * update * fix nllb * rstrip removed * small fixes * better handle additional_special_tokens and vocab sizes * fixing * styling * fix 4 / 21 * fixup * fix nlbb's tests * some fixes * fix t5 * fixes * style * fix canine tests * damn this is nice * nits * m2m100 nit * fixups * fixes! * fixup * stash * fix merge * revert bad change * fixup * correct order for code Llama * fix speecht5 post merge * styling * revert source of 11 fails * small nits * all changes in one go * fnet hack * fix 2 more tests * update based on main branch of tokenizers * fixup * fix VITS issues * more fixes * fix mgp test * fix camembert issues * oups camembert still has 2 failing tests * mluke fixes * decode fixes * small nits * nits * fix llama and vits * fix camembert * smal nits * more fixes when initialising a fast from a slow and etc * fix one of the last test * fix CPM tokenizer test * fixups * fix pop2piano * fixup * ⚠️ Change tokenizers required version ⚠️ * ⚠️ Change tokenizers required version ⚠️ * "tokenizers>=0.14,<0.15", don't forget smaller than * fix musicgen tests and pretraiendtokenizerfast * fix owlvit and all * update t5 * fix 800 red * fix tests * fix the fix of the fix of t5 * styling * documentation nits * cache _added_tokens_encoder * fixups * Nit * fix red tests * one last nit! * make eveything a lot simpler * Now it's over :wink: * few small nits * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates that work for now * tests that should no be skipped / changed and fixed next * fixup * i am ashamed * pushe the fix * update * fixups * nits * fix added_tokens_encoder * fix canine test * fix pegasus vocab * fix transfoXL * fixup * whisper needs to be fixed for train new * pegasus nits * more pegasus fixes * minor update * better error message in failed test * fix whisper failing test * fix whisper failing test * fix pegasus * fixup * fix **** pegasus * reset things * remove another file * attempts to fix the strange custome encoder and offset * nits here and there * update * fixup * nit * fix the whisper test * nits nits * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates based on review * some small update to potentially remove * nits * import rlu cache * Update src/transformers/tokenization_utils_base.py Co-authored-by: Lysandre Debut <hi@lysand.re> * move warning to `from_pretrained` * update tests results now that the special tokens are always added --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Lysandre Debut <hi@lysand.re>
2023-09-18 18:28:36 +00:00
@unittest.skip(
"We disable this test for PreTrainedTokenizerFast because it is the only tokenizer that is not linked to any model"
)
def test_encode_decode_with_spaces(self):
pass
[`Tokenizer`] Fix slow and fast serialization (#26570) * fix * last attempt * current work * fix forward compatibility * save all special tokens * current state * revert additional changes * updates * remove tokenizer.model * add a test and the fix * nit * revert one more break * fix typefield issue * quality * more tests * fix fields for FC * more nits? * new additional changes * how * some updates * simplify all * more nits * revert some things to original * nice * nits * a small hack * more nits * ahhaha * fixup * update * make test run on ci * use subtesting * update * Update .circleci/create_circleci_config.py * updates * fixup * nits * replace typo * fix the test * nits * update * None max dif pls * a partial fix * had to revert one thing * test the fast * updates * fixup * and more nits * more fixes * update * Oupsy :eye: * nits * fix marian * on our way to heaven * Update src/transformers/models/t5/tokenization_t5.py Co-authored-by: Lysandre Debut <hi@lysand.re> * fixup * Update src/transformers/tokenization_utils_fast.py Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com> * fix phobert * skip some things, test more * nits * fixup * fix deberta * update * update * more updates * skip one test * more updates * fix camembert * can't test this one * more good fixes * kind of a major update - seperate what is only done in fast in fast init and refactor - add_token(AddedToken(..., speicla = True)) ignores it in fast - better loading * fixup * more fixups * fix pegasus and mpnet * remove skipped tests * fix phoneme tokenizer if self.verbose * fix individual models * update common tests * update testing files * all over again * nits * skip test for markup lm * fixups * fix order of addition in fast by sorting the added tokens decoder * proper defaults for deberta * correct default for fnet * nits on add tokens, string initialized to special if special * skip irrelevant herbert tests * main fixes * update test added_tokens_serialization * the fix for bart like models and class instanciating * update bart * nit! * update idefix test * fix whisper! * some fixup * fixups * revert some of the wrong chanegs * fixup * fixup * skip marian * skip the correct tests * skip for tf and flax as well --------- Co-authored-by: Lysandre Debut <hi@lysand.re> Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com>
2023-10-18 14:30:53 +00:00
@unittest.skip(
"We disable this test for PreTrainedTokenizerFast because it is the only tokenizer that is not linked to any model"
)
def test_added_tokens_serialization(self):
pass
@unittest.skip(
"We disable this test for PreTrainedTokenizerFast because it is the only tokenizer that is not linked to any model"
)
def test_additional_special_tokens_serialization(self):
pass
def test_prepare_for_model(self):
# We disable this test for PreTrainedTokenizerFast because it is the only tokenizer that is not linked to any
# model
pass
def test_rust_tokenizer_signature(self):
# PreTrainedTokenizerFast doesn't have tokenizer_file in its signature
pass
def test_training_new_tokenizer(self):
tmpdirname_orig = self.tmpdirname
# Here we want to test the 2 available tokenizers that use 2 different types of models: Unigram and WordLevel.
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
try:
self.tmpdirname = tempfile.mkdtemp()
tokenizer = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
tokenizer.save_pretrained(self.tmpdirname)
super().test_training_new_tokenizer()
finally:
# Even if the test fails, we must be sure that the folder is deleted and that the default tokenizer
# is restored
shutil.rmtree(self.tmpdirname)
self.tmpdirname = tmpdirname_orig
def test_training_new_tokenizer_with_special_tokens_change(self):
tmpdirname_orig = self.tmpdirname
# Here we want to test the 2 available tokenizers that use 2 different types of models: Unigram and WordLevel.
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
try:
self.tmpdirname = tempfile.mkdtemp()
tokenizer = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
tokenizer.save_pretrained(self.tmpdirname)
super().test_training_new_tokenizer_with_special_tokens_change()
finally:
# Even if the test fails, we must be sure that the folder is deleted and that the default tokenizer
# is restored
shutil.rmtree(self.tmpdirname)
self.tmpdirname = tmpdirname_orig
This will reduce "Already borrowed error": (#12550) * This will reduce "Already borrowed error": Original issue https://github.com/huggingface/tokenizers/issues/537 The original issue is caused by transformers calling many times mutable functions on the rust tokenizers. Rust needs to guarantee that only 1 agent has a mutable reference to memory at a given time (for many reasons which don't need explaining here). Usually, the rust compiler can guarantee that this property is true at compile time. Unfortunately, this is impossible for Python to do that, so PyO3, the bridge between rust and python used by `tokenizers`, will change the compile guarantee for a dynamic guarantee, so if multiple agents try to have multiple mutable borrows at the same time, then the runtime will yell with "Already borrowed". The proposed fix here in transformers, is simply to reduce the actual number of calls that really need mutable borrows. By reducing them, we reduce the risk of running into "Already borrowed" error. The caveat is now we add a call to read the current configuration of the `_tokenizer`, so worst case we have 2 calls instead of 1, and best case we simply have 1 + a Python comparison of a dict (should be negligible). * Adding a test. * trivial error :(. * Update tests/test_tokenization_fast.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * Adding reference to original issues in the tests. * Update the tests with fast tokenizer. Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
2021-07-09 07:36:05 +00:00
def test_training_new_tokenizer_with_bytelevel(self):
tokenizer = self.rust_tokenizer_class.from_pretrained(self.bytelevel_bpe_model_name)
toy_text_iterator = ("a" for _ in range(1000))
new_tokenizer = tokenizer.train_new_from_iterator(text_iterator=toy_text_iterator, length=1000, vocab_size=50)
encoding_ids = new_tokenizer.encode("a🤗")
self.assertEqual(encoding_ids, [64, 172, 253, 97, 245])
def test_init_from_tokenizers_model(self):
from tokenizers import Tokenizer
sentences = ["Hello, y'all!", "How are you 😁 ? There should not be any issue right?"]
tokenizer = Tokenizer.from_pretrained("google-t5/t5-base")
# Enable padding
tokenizer.enable_padding(pad_id=0, pad_token="<pad>", length=512, pad_to_multiple_of=8)
self.assertEqual(
tokenizer.padding,
{
"length": 512,
"pad_to_multiple_of": 8,
"pad_id": 0,
"pad_token": "<pad>",
"pad_type_id": 0,
"direction": "right",
},
)
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
tmpdirname = tempfile.mkdtemp()
fast_tokenizer.save_pretrained(tmpdirname)
fast_from_saved = PreTrainedTokenizerFast.from_pretrained(tmpdirname)
for tok in [fast_tokenizer, fast_from_saved]:
self.assertEqual(tok.pad_token_id, 0)
self.assertEqual(tok.padding_side, "right")
self.assertEqual(tok.pad_token, "<pad>")
self.assertEqual(tok.init_kwargs["max_length"], 512)
self.assertEqual(tok.init_kwargs["pad_to_multiple_of"], 8)
self.assertEqual(tok(sentences, padding = True), {'input_ids': [[8774, 6, 3, 63, 31, 1748, 55, 1, 0, 0, 0, 0,0, 0, 0, 0],[ 571, 33, 25, 3, 2, 3, 58, 290, 225, 59, 36, 136, 962, 269, 58, 1]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}) # fmt: skip
tokenizer.enable_truncation(8, stride=0, strategy="longest_first", direction="right")
self.assertEqual(
tokenizer.truncation, {"max_length": 8, "stride": 0, "strategy": "longest_first", "direction": "right"}
)
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
tmpdirname = tempfile.mkdtemp()
fast_tokenizer.save_pretrained(tmpdirname)
fast_from_saved = PreTrainedTokenizerFast.from_pretrained(tmpdirname)
for tok in [fast_tokenizer, fast_from_saved]:
self.assertEqual(tok.truncation_side, "right")
self.assertEqual(tok.init_kwargs["truncation_strategy"], "longest_first")
self.assertEqual(tok.init_kwargs["max_length"], 8)
self.assertEqual(tok.init_kwargs["stride"], 0)
# NOTE even if the model has a default max_length, it is not used...
# thus tok(sentences, truncation = True) does nothing and does not warn either
self.assertEqual(tok(sentences, truncation = True, max_length = 8), {'input_ids': [[8774, 6, 3, 63, 31, 1748, 55, 1],[ 571, 33, 25, 3, 2, 3, 58, 1]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 1]]}) # fmt: skip
This will reduce "Already borrowed error": (#12550) * This will reduce "Already borrowed error": Original issue https://github.com/huggingface/tokenizers/issues/537 The original issue is caused by transformers calling many times mutable functions on the rust tokenizers. Rust needs to guarantee that only 1 agent has a mutable reference to memory at a given time (for many reasons which don't need explaining here). Usually, the rust compiler can guarantee that this property is true at compile time. Unfortunately, this is impossible for Python to do that, so PyO3, the bridge between rust and python used by `tokenizers`, will change the compile guarantee for a dynamic guarantee, so if multiple agents try to have multiple mutable borrows at the same time, then the runtime will yell with "Already borrowed". The proposed fix here in transformers, is simply to reduce the actual number of calls that really need mutable borrows. By reducing them, we reduce the risk of running into "Already borrowed" error. The caveat is now we add a call to read the current configuration of the `_tokenizer`, so worst case we have 2 calls instead of 1, and best case we simply have 1 + a Python comparison of a dict (should be negligible). * Adding a test. * trivial error :(. * Update tests/test_tokenization_fast.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * Adding reference to original issues in the tests. * Update the tests with fast tokenizer. Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
2021-07-09 07:36:05 +00:00
@require_tokenizers
class TokenizerVersioningTest(unittest.TestCase):
def test_local_versioning(self):
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
json_tokenizer = json.loads(tokenizer._tokenizer.to_str())
json_tokenizer["model"]["vocab"]["huggingface"] = len(tokenizer)
with tempfile.TemporaryDirectory() as tmp_dir:
# Hack to save this in the tokenizer_config.json
tokenizer.init_kwargs["fast_tokenizer_files"] = ["tokenizer.4.0.0.json"]
tokenizer.save_pretrained(tmp_dir)
json.dump(json_tokenizer, open(os.path.join(tmp_dir, "tokenizer.4.0.0.json"), "w"))
# This should pick the new tokenizer file as the version of Transformers is > 4.0.0
new_tokenizer = AutoTokenizer.from_pretrained(tmp_dir)
self.assertEqual(len(new_tokenizer), len(tokenizer) + 1)
json_tokenizer = json.loads(new_tokenizer._tokenizer.to_str())
self.assertIn("huggingface", json_tokenizer["model"]["vocab"])
# Will need to be adjusted if we reach v42 and this test is still here.
# Should pick the old tokenizer file as the version of Transformers is < 4.0.0
shutil.move(os.path.join(tmp_dir, "tokenizer.4.0.0.json"), os.path.join(tmp_dir, "tokenizer.42.0.0.json"))
tokenizer.init_kwargs["fast_tokenizer_files"] = ["tokenizer.42.0.0.json"]
tokenizer.save_pretrained(tmp_dir)
new_tokenizer = AutoTokenizer.from_pretrained(tmp_dir)
self.assertEqual(len(new_tokenizer), len(tokenizer))
json_tokenizer = json.loads(new_tokenizer._tokenizer.to_str())
self.assertNotIn("huggingface", json_tokenizer["model"]["vocab"])
def test_repo_versioning(self):
# This repo has two tokenizer files, one for v4.0.0 and above with an added token, one for versions lower.
repo = "hf-internal-testing/test-two-tokenizers"
# This should pick the new tokenizer file as the version of Transformers is > 4.0.0
tokenizer = AutoTokenizer.from_pretrained(repo)
self.assertEqual(len(tokenizer), 28997)
json_tokenizer = json.loads(tokenizer._tokenizer.to_str())
self.assertIn("huggingface", json_tokenizer["model"]["vocab"])
# Testing an older version by monkey-patching the version in the module it's used.
import transformers as old_transformers
old_transformers.tokenization_utils_base.__version__ = "3.0.0"
old_tokenizer = old_transformers.models.auto.AutoTokenizer.from_pretrained(repo)
self.assertEqual(len(old_tokenizer), 28996)
json_tokenizer = json.loads(old_tokenizer._tokenizer.to_str())
self.assertNotIn("huggingface", json_tokenizer["model"]["vocab"])
This will reduce "Already borrowed error": (#12550) * This will reduce "Already borrowed error": Original issue https://github.com/huggingface/tokenizers/issues/537 The original issue is caused by transformers calling many times mutable functions on the rust tokenizers. Rust needs to guarantee that only 1 agent has a mutable reference to memory at a given time (for many reasons which don't need explaining here). Usually, the rust compiler can guarantee that this property is true at compile time. Unfortunately, this is impossible for Python to do that, so PyO3, the bridge between rust and python used by `tokenizers`, will change the compile guarantee for a dynamic guarantee, so if multiple agents try to have multiple mutable borrows at the same time, then the runtime will yell with "Already borrowed". The proposed fix here in transformers, is simply to reduce the actual number of calls that really need mutable borrows. By reducing them, we reduce the risk of running into "Already borrowed" error. The caveat is now we add a call to read the current configuration of the `_tokenizer`, so worst case we have 2 calls instead of 1, and best case we simply have 1 + a Python comparison of a dict (should be negligible). * Adding a test. * trivial error :(. * Update tests/test_tokenization_fast.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * Adding reference to original issues in the tests. * Update the tests with fast tokenizer. Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
2021-07-09 07:36:05 +00:00
@require_tokenizers
class ReduceMutableBorrowTests(unittest.TestCase):
def test_async_share_tokenizer(self):
# See https://github.com/huggingface/transformers/pull/12550
# and https://github.com/huggingface/tokenizers/issues/537
tokenizer = PreTrainedTokenizerFast.from_pretrained("robot-test/dummy-tokenizer-wordlevel")
text = "The Matrix is a 1999 science fiction action film."
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(self.fetch, tokenizer, text) for i in range(10)]
return_value = [future.result() for future in futures]
self.assertEqual(return_value, [[1, 10, 0, 8, 0, 18, 0, 0, 0, 2] for i in range(10)])
def fetch(self, tokenizer, text):
return tokenizer.encode(text, truncation="longest_first", padding="longest")