mirror of
https://github.com/saymrwulf/transformers.git
synced 2026-05-14 20:58:08 +00:00
[Marian] documentation and AutoModel support (#4152)
- MarianSentencepieceTokenizer - > MarianTokenizer - Start using unk token. - add docs page - add better generation params to MarianConfig - more conversion utilities
This commit is contained in:
parent
9d2f467bfb
commit
3487be75ef
14 changed files with 355 additions and 102 deletions
|
|
@ -164,8 +164,9 @@ At some point in the future, you'll be able to seamlessly move from pre-training
|
|||
17. **[ELECTRA](https://huggingface.co/transformers/model_doc/electra.html)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
||||
18. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
|
||||
19. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
|
||||
20. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
|
||||
21. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
|
||||
20. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
|
||||
21. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
|
||||
22. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
|
||||
|
||||
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
|
||||
|
||||
|
|
|
|||
|
|
@ -108,3 +108,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
|
|||
model_doc/electra
|
||||
model_doc/dialogpt
|
||||
model_doc/reformer
|
||||
model_doc/marian
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
Bart
|
||||
----------------------------------------------------
|
||||
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
|
||||
**DISCLAIMER:** If you see something strange,
|
||||
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
|
||||
@sshleifer
|
||||
|
||||
|
|
|
|||
43
docs/source/model_doc/marian.rst
Normal file
43
docs/source/model_doc/marian.rst
Normal file
|
|
@ -0,0 +1,43 @@
|
|||
MarianMTModel
|
||||
----------------------------------------------------
|
||||
**DISCLAIMER:** If you see something strange,
|
||||
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
|
||||
@sshleifer
|
||||
These models are for machine translation. The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
|
||||
|
||||
Opus Project
|
||||
~~~~~~~~~~~~
|
||||
The 1,000+ models were originally trained by `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian <https://marian-nmt.github.io/>`_ C++ library, which supports fast training and translation.
|
||||
All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented in a model card.
|
||||
|
||||
Implementation Notes
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
- each model is about 298 MB on disk, there are 1,000+ models.
|
||||
- Models are named with the following patter 'Helsinki-NLP/opus-mt-{src_langs}-{targ_langs}'. If there are multiple source or target languages they are joined by a '+' symbol.
|
||||
- the 80 opus models that require BPE preprocessing are not supported.
|
||||
- There is an outstanding issue w.r.t multilingual models and language codes.
|
||||
- The modeling code is the same as ``BartModel`` with a few minor modifications:
|
||||
- static (sinusoid) positional embeddings (``MarianConfig.static_position_embeddings=True``)
|
||||
- a new final_logits_bias (``MarianConfig.add_bias_logits=True``)
|
||||
- no layernorm_embedding (``MarianConfig.normalize_embedding=False``)
|
||||
- the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. (Bart uses <s/>)
|
||||
- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``
|
||||
|
||||
|
||||
MarianMTModel
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
|
||||
Model API is identical to BartForConditionalGeneration.
|
||||
Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__
|
||||
This class inherits all functionality from ``BartForConditionalGeneration``, see that page for method signatures.
|
||||
|
||||
.. autoclass:: transformers.MarianMTModel
|
||||
:members:
|
||||
|
||||
|
||||
MarianTokenizer
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.MarianTokenizer
|
||||
:members: prepare_translation_batch
|
||||
|
|
@ -275,7 +275,7 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
|
|||
| | | | FlauBERT large architecture |
|
||||
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Bart | ``bart-large`` | | 12-layer, 1024-hidden, 16-heads, 406M parameters |
|
||||
| Bart | ``bart-large`` | | 24-layer, 1024-hidden, 16-heads, 406M parameters |
|
||||
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_) |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``bart-large-mnli`` | | Adds a 2 layer classification head with 1 million parameters |
|
||||
|
|
@ -299,3 +299,6 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
|
|||
| Reformer | ``reformer-crime-and-punishment`` | | 6-layer, 256-hidden, 2-heads, 3M parameters |
|
||||
| | | | Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| MarianMT | ``Helsinki-NLP/opus-mt-{src}-{tgt}`` | | 12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size. |
|
||||
| | | | (see `model list <https://huggingface.co/Helsinki-NLP>`_ |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|
|
|
|||
|
|
@ -248,7 +248,7 @@ if is_torch_available():
|
|||
BART_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
)
|
||||
from .modeling_marian import MarianMTModel
|
||||
from .tokenization_marian import MarianSentencePieceTokenizer
|
||||
from .tokenization_marian import MarianTokenizer
|
||||
from .modeling_roberta import (
|
||||
RobertaForMaskedLM,
|
||||
RobertaModel,
|
||||
|
|
|
|||
|
|
@ -28,6 +28,7 @@ from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, Electr
|
|||
from .configuration_encoder_decoder import EncoderDecoderConfig
|
||||
from .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
|
||||
from .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
|
||||
from .configuration_marian import MarianConfig
|
||||
from .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
|
||||
from .configuration_reformer import ReformerConfig
|
||||
from .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
|
||||
|
|
@ -73,6 +74,7 @@ CONFIG_MAPPING = OrderedDict(
|
|||
("albert", AlbertConfig,),
|
||||
("camembert", CamembertConfig,),
|
||||
("xlm-roberta", XLMRobertaConfig,),
|
||||
("marian", MarianConfig,),
|
||||
("bart", BartConfig,),
|
||||
("reformer", ReformerConfig,),
|
||||
("roberta", RobertaConfig,),
|
||||
|
|
|
|||
|
|
@ -23,4 +23,5 @@ PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
|||
|
||||
|
||||
class MarianConfig(BartConfig):
|
||||
model_type = "marian"
|
||||
pretrained_config_archive_map = PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
|
|
|
|||
|
|
@ -11,7 +11,8 @@ import numpy as np
|
|||
import torch
|
||||
from tqdm import tqdm
|
||||
|
||||
from transformers import MarianConfig, MarianMTModel, MarianSentencePieceTokenizer
|
||||
from transformers import MarianConfig, MarianMTModel, MarianTokenizer
|
||||
from transformers.hf_api import HfApi
|
||||
|
||||
|
||||
def remove_prefix(text: str, prefix: str):
|
||||
|
|
@ -38,6 +39,19 @@ def load_layers_(layer_lst: torch.nn.ModuleList, opus_state: dict, converter, is
|
|||
layer.load_state_dict(sd, strict=True)
|
||||
|
||||
|
||||
def find_pretrained_model(src_lang: str, tgt_lang: str) -> List[str]:
|
||||
"""Find models that can accept src_lang as input and return tgt_lang as output."""
|
||||
prefix = "Helsinki-NLP/opus-mt-"
|
||||
api = HfApi()
|
||||
model_list = api.model_list()
|
||||
model_ids = [x.modelId for x in model_list if x.modelId.startswith("Helsinki-NLP")]
|
||||
src_and_targ = [
|
||||
remove_prefix(m, prefix).lower().split("-") for m in model_ids if "+" not in m
|
||||
] # + cant be loaded.
|
||||
matching = [f"{prefix}{a}-{b}" for (a, b) in src_and_targ if src_lang in a and tgt_lang in b]
|
||||
return matching
|
||||
|
||||
|
||||
def add_emb_entries(wemb, final_bias, n_special_tokens=1):
|
||||
vsize, d_model = wemb.shape
|
||||
embs_to_add = np.zeros((n_special_tokens, d_model))
|
||||
|
|
@ -81,7 +95,12 @@ def find_model_file(dest_dir): # this one better
|
|||
return model_file
|
||||
|
||||
|
||||
def parse_readmes(repo_path):
|
||||
def make_registry(repo_path="Opus-MT-train/models"):
|
||||
if not (Path(repo_path) / "fr-en" / "README.md").exists():
|
||||
raise ValueError(
|
||||
f"repo_path:{repo_path} does not exist: "
|
||||
"You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git before calling."
|
||||
)
|
||||
results = {}
|
||||
for p in Path(repo_path).ls():
|
||||
n_dash = p.name.count("-")
|
||||
|
|
@ -90,22 +109,53 @@ def parse_readmes(repo_path):
|
|||
else:
|
||||
lns = list(open(p / "README.md").readlines())
|
||||
results[p.name] = _parse_readme(lns)
|
||||
return results
|
||||
return [(k, v["pre-processing"], v["download"]) for k, v in results.items()]
|
||||
|
||||
|
||||
def download_all_sentencepiece_models(repo_path="Opus-MT-train/models"):
|
||||
CH_GROUP = "cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh"
|
||||
|
||||
|
||||
def convert_all_sentencepiece_models(model_list=None, repo_path=None):
|
||||
"""Requires 300GB"""
|
||||
save_dir = Path("marian_ckpt")
|
||||
if not Path(repo_path).exists():
|
||||
raise ValueError("You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git")
|
||||
results: dict = parse_readmes(repo_path)
|
||||
for k, v in tqdm(list(results.items())):
|
||||
if os.path.exists(save_dir / k):
|
||||
print(f"already have path {k}")
|
||||
dest_dir = Path("marian_converted")
|
||||
dest_dir.mkdir(exist_ok=True)
|
||||
if model_list is None:
|
||||
model_list: list = make_registry(repo_path=repo_path)
|
||||
for k, prepro, download in tqdm(model_list):
|
||||
if "SentencePiece" not in prepro: # dont convert BPE models.
|
||||
continue
|
||||
if "SentencePiece" not in v["pre-processing"]:
|
||||
if not os.path.exists(save_dir / k / "pytorch_model.bin"):
|
||||
download_and_unzip(download, save_dir / k)
|
||||
pair_name = k.replace(CH_GROUP, "ch_group")
|
||||
convert(save_dir / k, dest_dir / f"opus-mt-{pair_name}")
|
||||
|
||||
|
||||
def lmap(f, x) -> List:
|
||||
return list(map(f, x))
|
||||
|
||||
|
||||
def fetch_test_set(readmes_raw, pair):
|
||||
import wget
|
||||
|
||||
download_url = readmes_raw[pair]["download"]
|
||||
test_set_url = download_url[:-4] + ".test.txt"
|
||||
fname = wget.download(test_set_url, f"opus_test_{pair}.txt")
|
||||
lns = Path(fname).open().readlines()
|
||||
src = lmap(str.strip, lns[::4])
|
||||
gold = lmap(str.strip, lns[1::4])
|
||||
mar_model = lmap(str.strip, lns[2::4])
|
||||
assert len(gold) == len(mar_model) == len(src)
|
||||
os.remove(fname)
|
||||
return src, mar_model, gold
|
||||
|
||||
|
||||
def convert_whole_dir(path=Path("marian_ckpt/")):
|
||||
for subdir in tqdm(list(path.ls())):
|
||||
dest_dir = f"marian_converted/{subdir.name}"
|
||||
if (dest_dir / "pytorch_model.bin").exists():
|
||||
continue
|
||||
download_and_unzip(v["download"], save_dir / k)
|
||||
convert(source_dir, dest_dir)
|
||||
|
||||
|
||||
def _parse_readme(lns):
|
||||
|
|
@ -131,7 +181,7 @@ def _parse_readme(lns):
|
|||
return subres
|
||||
|
||||
|
||||
def write_metadata(dest_dir: Path):
|
||||
def save_tokenizer_config(dest_dir: Path):
|
||||
dname = dest_dir.name.split("-")
|
||||
dct = dict(target_lang=dname[-1], source_lang="-".join(dname[:-1]))
|
||||
save_json(dct, dest_dir / "tokenizer_config.json")
|
||||
|
|
@ -148,13 +198,17 @@ def add_to_vocab_(vocab: Dict[str, int], special_tokens: List[str]):
|
|||
return added
|
||||
|
||||
|
||||
def find_vocab_file(model_dir):
|
||||
return list(model_dir.glob("*vocab.yml"))[0]
|
||||
|
||||
|
||||
def add_special_tokens_to_vocab(model_dir: Path) -> None:
|
||||
vocab = load_yaml(model_dir / "opus.spm32k-spm32k.vocab.yml")
|
||||
vocab = load_yaml(find_vocab_file(model_dir))
|
||||
vocab = {k: int(v) for k, v in vocab.items()}
|
||||
num_added = add_to_vocab_(vocab, ["<pad>"])
|
||||
print(f"added {num_added} tokens to vocab")
|
||||
save_json(vocab, model_dir / "vocab.json")
|
||||
write_metadata(model_dir)
|
||||
save_tokenizer_config(model_dir)
|
||||
|
||||
|
||||
def save_tokenizer(self, save_directory):
|
||||
|
|
@ -251,7 +305,6 @@ class OpusState:
|
|||
|
||||
# Process decoder.yml
|
||||
decoder_yml = cast_marian_config(load_yaml(source_dir / "decoder.yml"))
|
||||
# TODO: what are normalize and word-penalty?
|
||||
check_marian_cfg_assumptions(cfg)
|
||||
self.hf_config = MarianConfig(
|
||||
vocab_size=cfg["vocab_size"],
|
||||
|
|
@ -273,6 +326,9 @@ class OpusState:
|
|||
dropout=0.1, # see opus-mt-train repo/transformer-dropout param.
|
||||
# default: add_final_layer_norm=False,
|
||||
num_beams=decoder_yml["beam-size"],
|
||||
decoder_start_token_id=self.pad_token_id,
|
||||
bad_words_ids=[[self.pad_token_id]],
|
||||
max_length=512,
|
||||
)
|
||||
|
||||
def _check_layer_entries(self):
|
||||
|
|
@ -349,12 +405,12 @@ def download_and_unzip(url, dest_dir):
|
|||
os.remove(filename)
|
||||
|
||||
|
||||
def main(source_dir, dest_dir):
|
||||
def convert(source_dir: Path, dest_dir):
|
||||
dest_dir = Path(dest_dir)
|
||||
dest_dir.mkdir(exist_ok=True)
|
||||
|
||||
add_special_tokens_to_vocab(source_dir)
|
||||
tokenizer = MarianSentencePieceTokenizer.from_pretrained(str(source_dir))
|
||||
tokenizer = MarianTokenizer.from_pretrained(str(source_dir))
|
||||
save_tokenizer(tokenizer, dest_dir)
|
||||
|
||||
opus_state = OpusState(source_dir)
|
||||
|
|
@ -377,7 +433,7 @@ if __name__ == "__main__":
|
|||
source_dir = Path(args.src)
|
||||
assert source_dir.exists()
|
||||
dest_dir = f"converted-{source_dir.name}" if args.dest is None else args.dest
|
||||
main(source_dir, dest_dir)
|
||||
convert(source_dir, dest_dir)
|
||||
|
||||
|
||||
def load_yaml(path):
|
||||
|
|
|
|||
|
|
@ -39,6 +39,7 @@ from .configuration_auto import (
|
|||
XLMRobertaConfig,
|
||||
XLNetConfig,
|
||||
)
|
||||
from .configuration_marian import MarianConfig
|
||||
from .configuration_utils import PretrainedConfig
|
||||
from .modeling_albert import (
|
||||
ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
|
|
@ -98,6 +99,7 @@ from .modeling_flaubert import (
|
|||
FlaubertWithLMHeadModel,
|
||||
)
|
||||
from .modeling_gpt2 import GPT2_PRETRAINED_MODEL_ARCHIVE_MAP, GPT2LMHeadModel, GPT2Model
|
||||
from .modeling_marian import MarianMTModel
|
||||
from .modeling_openai import OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP, OpenAIGPTLMHeadModel, OpenAIGPTModel
|
||||
from .modeling_reformer import ReformerModel, ReformerModelWithLMHead
|
||||
from .modeling_roberta import (
|
||||
|
|
@ -214,6 +216,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
|
|||
(AlbertConfig, AlbertForMaskedLM),
|
||||
(CamembertConfig, CamembertForMaskedLM),
|
||||
(XLMRobertaConfig, XLMRobertaForMaskedLM),
|
||||
(MarianConfig, MarianMTModel),
|
||||
(BartConfig, BartForConditionalGeneration),
|
||||
(RobertaConfig, RobertaForMaskedLM),
|
||||
(BertConfig, BertForMaskedLM),
|
||||
|
|
|
|||
|
|
@ -18,16 +18,30 @@
|
|||
from transformers.modeling_bart import BartForConditionalGeneration
|
||||
|
||||
|
||||
PRETRAINED_MODEL_ARCHIVE_MAP = {
|
||||
"opus-mt-en-de": "https://cdn.huggingface.co/Helsinki-NLP/opus-mt-en-de/pytorch_model.bin",
|
||||
}
|
||||
|
||||
|
||||
class MarianMTModel(BartForConditionalGeneration):
|
||||
"""Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
|
||||
Model API is identical to BartForConditionalGeneration"""
|
||||
r"""
|
||||
Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
|
||||
Model API is identical to BartForConditionalGeneration.
|
||||
Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__
|
||||
|
||||
pretrained_model_archive_map = PRETRAINED_MODEL_ARCHIVE_MAP
|
||||
Examples::
|
||||
|
||||
from transformers import MarianTokenizer, MarianMTModel
|
||||
from typing import List
|
||||
src = 'fr' # source language
|
||||
trg = 'en' # target language
|
||||
sample_text = "où est l'arrêt de bus ?"
|
||||
mname = f'Helsinki-NLP/opus-mt-{src}-{trg}' # `Model List`__
|
||||
|
||||
model = MarianMTModel.from_pretrained(mname)
|
||||
tok = MarianTokenizer.from_pretrained(mname)
|
||||
batch = tok.prepare_translation_batch(src_texts=[sample_text]) # don't need tgt_text for inference
|
||||
gen = model.generate(**batch) # for forward pass: model(**batch)
|
||||
words: List[str] = tok.decode_batch(gen, skip_special_tokens=True) # returns "Where is the the bus stop ?"
|
||||
|
||||
"""
|
||||
|
||||
pretrained_model_archive_map = {} # see https://huggingface.co/models?search=Helsinki-NLP
|
||||
|
||||
def prepare_scores_for_generation(self, scores, cur_len, max_length):
|
||||
if cur_len == max_length - 1 and self.config.eos_token_id is not None:
|
||||
|
|
|
|||
|
|
@ -38,6 +38,7 @@ from .configuration_auto import (
|
|||
XLMRobertaConfig,
|
||||
XLNetConfig,
|
||||
)
|
||||
from .configuration_marian import MarianConfig
|
||||
from .configuration_utils import PretrainedConfig
|
||||
from .tokenization_albert import AlbertTokenizer
|
||||
from .tokenization_bart import BartTokenizer
|
||||
|
|
@ -49,6 +50,7 @@ from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFas
|
|||
from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
|
||||
from .tokenization_flaubert import FlaubertTokenizer
|
||||
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
|
||||
from .tokenization_marian import MarianTokenizer
|
||||
from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
|
||||
from .tokenization_reformer import ReformerTokenizer
|
||||
from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
|
||||
|
|
@ -69,6 +71,7 @@ TOKENIZER_MAPPING = OrderedDict(
|
|||
(AlbertConfig, (AlbertTokenizer, None)),
|
||||
(CamembertConfig, (CamembertTokenizer, None)),
|
||||
(XLMRobertaConfig, (XLMRobertaTokenizer, None)),
|
||||
(MarianConfig, (MarianTokenizer, None)),
|
||||
(BartConfig, (BartTokenizer, None)),
|
||||
(RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)),
|
||||
(ReformerConfig, (ReformerTokenizer, None)),
|
||||
|
|
|
|||
|
|
@ -22,7 +22,21 @@ PRETRAINED_VOCAB_FILES_MAP = {
|
|||
# Example URL https://s3.amazonaws.com/models.huggingface.co/bert/Helsinki-NLP/opus-mt-en-de/vocab.json
|
||||
|
||||
|
||||
class MarianSentencePieceTokenizer(PreTrainedTokenizer):
|
||||
class MarianTokenizer(PreTrainedTokenizer):
|
||||
"""Sentencepiece tokenizer for marian. Source and target languages have different SPM models.
|
||||
The logic is use the relevant source_spm or target_spm to encode txt as pieces, then look up each piece in a vocab dictionary.
|
||||
|
||||
Examples::
|
||||
|
||||
from transformers import MarianTokenizer
|
||||
tok = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')
|
||||
src_texts = [ "I am a small frog.", "Tom asked his teacher for advice."]
|
||||
tgt_texts = ["Ich bin ein kleiner Frosch.", "Tom bat seinen Lehrer um Rat."] # optional
|
||||
batch_enc: BatchEncoding = tok.prepare_translation_batch(src_texts, tgt_texts=tgt_texts)
|
||||
# keys [input_ids, attention_mask, decoder_input_ids, decoder_attention_mask].
|
||||
# model(**batch) should work
|
||||
"""
|
||||
|
||||
vocab_files_names = vocab_files_names
|
||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||
max_model_input_sizes = {m: 512 for m in MODEL_NAMES}
|
||||
|
|
@ -49,6 +63,8 @@ class MarianSentencePieceTokenizer(PreTrainedTokenizer):
|
|||
pad_token=pad_token,
|
||||
)
|
||||
self.encoder = load_json(vocab)
|
||||
if self.unk_token not in self.encoder:
|
||||
raise KeyError("<unk> token must be in vocab")
|
||||
assert self.pad_token in self.encoder
|
||||
self.decoder = {v: k for k, v in self.encoder.items()}
|
||||
|
||||
|
|
@ -64,8 +80,11 @@ class MarianSentencePieceTokenizer(PreTrainedTokenizer):
|
|||
self.spm_target = sentencepiece.SentencePieceProcessor()
|
||||
self.spm_target.Load(target_spm)
|
||||
|
||||
# Note(SS): splitter would require lots of book-keeping.
|
||||
# self.sentence_splitter = MosesSentenceSplitter(source_lang)
|
||||
# Multilingual target side: default to using first supported language code.
|
||||
self.supported_language_codes: list = [k for k in self.encoder if k.startswith(">>") and k.endswith("<<")]
|
||||
self.tgt_lang_id = None # will not be used unless it is set through prepare_translation_batch
|
||||
|
||||
# Note(SS): sentence_splitter would require lots of book-keeping.
|
||||
try:
|
||||
from mosestokenizer import MosesPunctuationNormalizer
|
||||
|
||||
|
|
@ -75,11 +94,10 @@ class MarianSentencePieceTokenizer(PreTrainedTokenizer):
|
|||
self.punc_normalizer = lambda x: x
|
||||
|
||||
def _convert_token_to_id(self, token):
|
||||
return self.encoder[token]
|
||||
return self.encoder.get(token, self.encoder[self.unk_token])
|
||||
|
||||
def _tokenize(self, text: str, src=True) -> List[str]:
|
||||
spm = self.spm_source if src else self.spm_target
|
||||
return spm.EncodeAsPieces(text)
|
||||
def _tokenize(self, text: str) -> List[str]:
|
||||
return self.current_spm.EncodeAsPieces(text)
|
||||
|
||||
def _convert_id_to_token(self, index: int) -> str:
|
||||
"""Converts an index (integer) in a token (str) using the encoder."""
|
||||
|
|
@ -89,10 +107,6 @@ class MarianSentencePieceTokenizer(PreTrainedTokenizer):
|
|||
"""Uses target language sentencepiece model"""
|
||||
return self.spm_target.DecodePieces(tokens)
|
||||
|
||||
def _append_special_tokens_and_truncate(self, tokens: str, max_length: int,) -> List[int]:
|
||||
ids: list = self.convert_tokens_to_ids(tokens)[:max_length]
|
||||
return ids + [self.eos_token_id]
|
||||
|
||||
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]:
|
||||
"""Build model inputs from a sequence by appending eos_token_id."""
|
||||
if token_ids_1 is None:
|
||||
|
|
@ -100,7 +114,7 @@ class MarianSentencePieceTokenizer(PreTrainedTokenizer):
|
|||
# We don't expect to process pairs, but leave the pair logic for API consistency
|
||||
return token_ids_0 + token_ids_1 + [self.eos_token_id]
|
||||
|
||||
def decode_batch(self, token_ids, **kwargs) -> List[str]:
|
||||
def batch_decode(self, token_ids, **kwargs) -> List[str]:
|
||||
return [self.decode(ids, **kwargs) for ids in token_ids]
|
||||
|
||||
def prepare_translation_batch(
|
||||
|
|
@ -114,40 +128,38 @@ class MarianSentencePieceTokenizer(PreTrainedTokenizer):
|
|||
"""
|
||||
Arguments:
|
||||
src_texts: list of src language texts
|
||||
src_lang: default en_XX (english)
|
||||
tgt_texts: list of tgt language texts
|
||||
tgt_lang: default ro_RO (romanian)
|
||||
max_length: (None) defer to config (1024 for mbart-large-en-ro)
|
||||
pad_to_max_length: (bool)
|
||||
return_tensors: (str) default "pt" returns pytorch tensors, pass None to return lists.
|
||||
|
||||
Returns:
|
||||
BatchEncoding: with keys [input_ids, attention_mask, decoder_input_ids, decoder_attention_mask]
|
||||
all shaped bs, seq_len. (BatchEncoding is a dict of string -> tensor or lists)
|
||||
|
||||
Examples:
|
||||
from transformers import MarianS
|
||||
all shaped bs, seq_len. (BatchEncoding is a dict of string -> tensor or lists).
|
||||
If no tgt_text is specified, the only keys will be input_ids and attention_mask.
|
||||
"""
|
||||
self.current_spm = self.spm_source
|
||||
model_inputs: BatchEncoding = self.batch_encode_plus(
|
||||
src_texts,
|
||||
add_special_tokens=True,
|
||||
return_tensors=return_tensors,
|
||||
max_length=max_length,
|
||||
pad_to_max_length=pad_to_max_length,
|
||||
src=True,
|
||||
)
|
||||
if tgt_texts is None:
|
||||
return model_inputs
|
||||
|
||||
self.current_spm = self.spm_target
|
||||
decoder_inputs: BatchEncoding = self.batch_encode_plus(
|
||||
tgt_texts,
|
||||
add_special_tokens=True,
|
||||
return_tensors=return_tensors,
|
||||
max_length=max_length,
|
||||
pad_to_max_length=pad_to_max_length,
|
||||
src=False,
|
||||
)
|
||||
for k, v in decoder_inputs.items():
|
||||
model_inputs[f"decoder_{k}"] = v
|
||||
self.current_spm = self.spm_source
|
||||
return model_inputs
|
||||
|
||||
@property
|
||||
|
|
|
|||
|
|
@ -18,35 +18,94 @@ import unittest
|
|||
|
||||
from transformers import is_torch_available
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.hf_api import HfApi
|
||||
|
||||
from .utils import require_torch, slow, torch_device
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
from transformers import MarianMTModel, MarianSentencePieceTokenizer
|
||||
from transformers import (
|
||||
AutoTokenizer,
|
||||
MarianConfig,
|
||||
AutoConfig,
|
||||
AutoModelWithLMHead,
|
||||
MarianTokenizer,
|
||||
MarianMTModel,
|
||||
)
|
||||
|
||||
|
||||
class ModelManagementTests(unittest.TestCase):
|
||||
@slow
|
||||
def test_model_count(self):
|
||||
model_list = HfApi().model_list()
|
||||
expected_num_models = 1011
|
||||
actual_num_models = len([x for x in model_list if x.modelId.startswith("Helsinki-NLP")])
|
||||
self.assertEqual(expected_num_models, actual_num_models)
|
||||
|
||||
|
||||
@require_torch
|
||||
class IntegrationTests(unittest.TestCase):
|
||||
class MarianIntegrationTest(unittest.TestCase):
|
||||
src = "en"
|
||||
tgt = "de"
|
||||
src_text = [
|
||||
"I am a small frog.",
|
||||
"Now I can forget the 100 words of german that I know.",
|
||||
"Tom asked his teacher for advice.",
|
||||
"That's how I would do it.",
|
||||
"Tom really admired Mary's courage.",
|
||||
"Turn around and close your eyes.",
|
||||
]
|
||||
expected_text = [
|
||||
"Ich bin ein kleiner Frosch.",
|
||||
"Jetzt kann ich die 100 Wörter des Deutschen vergessen, die ich kenne.",
|
||||
"Tom bat seinen Lehrer um Rat.",
|
||||
"So würde ich das machen.",
|
||||
"Tom bewunderte Marias Mut wirklich.",
|
||||
"Drehen Sie sich um und schließen Sie die Augen.",
|
||||
]
|
||||
# ^^ actual C++ output differs slightly: (1) des Deutschen removed, (2) ""-> "O", (3) tun -> machen
|
||||
|
||||
@classmethod
|
||||
def setUpClass(cls) -> None:
|
||||
cls.model_name = "Helsinki-NLP/opus-mt-en-de"
|
||||
cls.tokenizer = MarianSentencePieceTokenizer.from_pretrained(cls.model_name)
|
||||
cls.model_name = f"Helsinki-NLP/opus-mt-{cls.src}-{cls.tgt}"
|
||||
cls.tokenizer: MarianTokenizer = AutoTokenizer.from_pretrained(cls.model_name)
|
||||
cls.eos_token_id = cls.tokenizer.eos_token_id
|
||||
return cls
|
||||
|
||||
@cached_property
|
||||
def model(self):
|
||||
model = MarianMTModel.from_pretrained(self.model_name).to(torch_device)
|
||||
model: MarianMTModel = AutoModelWithLMHead.from_pretrained(self.model_name).to(torch_device)
|
||||
c = model.config
|
||||
self.assertListEqual(c.bad_words_ids, [[c.pad_token_id]])
|
||||
self.assertEqual(c.max_length, 512)
|
||||
self.assertEqual(c.decoder_start_token_id, c.pad_token_id)
|
||||
|
||||
if torch_device == "cuda":
|
||||
return model.half()
|
||||
else:
|
||||
return model
|
||||
|
||||
def _assert_generated_batch_equal_expected(self, **tokenizer_kwargs):
|
||||
generated_words = self.translate_src_text(**tokenizer_kwargs)
|
||||
self.assertListEqual(self.expected_text, generated_words)
|
||||
|
||||
def translate_src_text(self, **tokenizer_kwargs):
|
||||
model_inputs: dict = self.tokenizer.prepare_translation_batch(src_texts=self.src_text, **tokenizer_kwargs).to(
|
||||
torch_device
|
||||
)
|
||||
self.assertEqual(self.model.device, model_inputs["input_ids"].device)
|
||||
generated_ids = self.model.generate(
|
||||
model_inputs["input_ids"], attention_mask=model_inputs["attention_mask"], num_beams=2
|
||||
)
|
||||
generated_words = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
|
||||
return generated_words
|
||||
|
||||
|
||||
class TestMarian_EN_DE_More(MarianIntegrationTest):
|
||||
@slow
|
||||
def test_forward(self):
|
||||
src, tgt = ["I am a small frog"], ["▁Ich ▁bin ▁ein ▁kleiner ▁Fro sch"]
|
||||
src, tgt = ["I am a small frog"], ["Ich bin ein kleiner Frosch."]
|
||||
expected = [38, 121, 14, 697, 38848, 0]
|
||||
|
||||
model_inputs: dict = self.tokenizer.prepare_translation_batch(src, tgt_texts=tgt).to(torch_device)
|
||||
|
|
@ -62,57 +121,112 @@ class IntegrationTests(unittest.TestCase):
|
|||
with torch.no_grad():
|
||||
logits, *enc_features = self.model(**model_inputs)
|
||||
max_indices = logits.argmax(-1)
|
||||
self.tokenizer.decode_batch(max_indices)
|
||||
self.tokenizer.batch_decode(max_indices)
|
||||
|
||||
@slow
|
||||
def test_repl_generate_one(self):
|
||||
src = ["I am a small frog.", "Hello"]
|
||||
model_inputs: dict = self.tokenizer.prepare_translation_batch(src).to(torch_device)
|
||||
self.assertEqual(self.model.device, model_inputs["input_ids"].device)
|
||||
generated_ids = self.model.generate(model_inputs["input_ids"], num_beams=6,)
|
||||
generated_words = self.tokenizer.decode_batch(generated_ids)[0]
|
||||
expected_words = "Ich bin ein kleiner Frosch."
|
||||
self.assertEqual(expected_words, generated_words)
|
||||
|
||||
@slow
|
||||
def test_repl_generate_batch(self):
|
||||
src = [
|
||||
"I am a small frog.",
|
||||
"Now I can forget the 100 words of german that I know.",
|
||||
"O",
|
||||
"Tom asked his teacher for advice.",
|
||||
"That's how I would do it.",
|
||||
"Tom really admired Mary's courage.",
|
||||
"Turn around and close your eyes.",
|
||||
]
|
||||
model_inputs: dict = self.tokenizer.prepare_translation_batch(src).to(torch_device)
|
||||
self.assertEqual(self.model.device, model_inputs["input_ids"].device)
|
||||
generated_ids = self.model.generate(
|
||||
model_inputs["input_ids"],
|
||||
length_penalty=1.0,
|
||||
num_beams=2, # 6 is the default
|
||||
bad_words_ids=[[self.tokenizer.pad_token_id]],
|
||||
)
|
||||
expected = [
|
||||
"Ich bin ein kleiner Frosch.",
|
||||
"Jetzt kann ich die 100 Wörter des Deutschen vergessen, die ich kenne.",
|
||||
"",
|
||||
"Tom bat seinen Lehrer um Rat.",
|
||||
"So würde ich das tun.",
|
||||
"Tom bewunderte Marias Mut wirklich.",
|
||||
"Umdrehen und die Augen schließen.",
|
||||
]
|
||||
# actual C++ output differences: (1) des Deutschen removed, (2) ""-> "O", (3) tun -> machen
|
||||
generated_words = self.tokenizer.decode_batch(generated_ids, skip_special_tokens=True)
|
||||
self.assertListEqual(expected, generated_words)
|
||||
|
||||
def test_marian_equivalence(self):
|
||||
def test_tokenizer_equivalence(self):
|
||||
batch = self.tokenizer.prepare_translation_batch(["I am a small frog"]).to(torch_device)
|
||||
input_ids = batch["input_ids"][0]
|
||||
expected = [38, 121, 14, 697, 38848, 0]
|
||||
self.assertListEqual(expected, input_ids.tolist())
|
||||
|
||||
def test_unk_support(self):
|
||||
t = self.tokenizer
|
||||
ids = t.prepare_translation_batch(["||"]).to(torch_device)["input_ids"][0].tolist()
|
||||
expected = [t.unk_token_id, t.unk_token_id, t.eos_token_id]
|
||||
self.assertEqual(expected, ids)
|
||||
|
||||
def test_pad_not_split(self):
|
||||
input_ids_w_pad = self.tokenizer.prepare_translation_batch(["I am a small frog <pad>"])["input_ids"][0]
|
||||
expected_w_pad = [38, 121, 14, 697, 38848, self.tokenizer.pad_token_id, 0] # pad
|
||||
self.assertListEqual(expected_w_pad, input_ids_w_pad.tolist())
|
||||
|
||||
@slow
|
||||
def test_batch_generation_en_de(self):
|
||||
self._assert_generated_batch_equal_expected()
|
||||
|
||||
def test_auto_config(self):
|
||||
config = AutoConfig.from_pretrained(self.model_name)
|
||||
self.assertIsInstance(config, MarianConfig)
|
||||
|
||||
|
||||
class TestMarian_EN_FR(MarianIntegrationTest):
|
||||
src = "en"
|
||||
tgt = "fr"
|
||||
src_text = [
|
||||
"I am a small frog.",
|
||||
"Now I can forget the 100 words of german that I know.",
|
||||
]
|
||||
expected_text = [
|
||||
"Je suis une petite grenouille.",
|
||||
"Maintenant, je peux oublier les 100 mots d'allemand que je connais.",
|
||||
]
|
||||
|
||||
@slow
|
||||
def test_batch_generation_en_fr(self):
|
||||
self._assert_generated_batch_equal_expected()
|
||||
|
||||
|
||||
class TestMarian_FR_EN(MarianIntegrationTest):
|
||||
src = "fr"
|
||||
tgt = "en"
|
||||
src_text = [
|
||||
"Donnez moi le micro.",
|
||||
"Tom et Mary étaient assis à une table.", # Accents
|
||||
]
|
||||
expected_text = [
|
||||
"Give me the microphone.",
|
||||
"Tom and Mary were sitting at a table.",
|
||||
]
|
||||
|
||||
@slow
|
||||
def test_batch_generation_fr_en(self):
|
||||
self._assert_generated_batch_equal_expected()
|
||||
|
||||
|
||||
class TestMarian_RU_FR(MarianIntegrationTest):
|
||||
src = "ru"
|
||||
tgt = "fr"
|
||||
src_text = ["Он показал мне рукопись своей новой пьесы."]
|
||||
expected_text = ["Il me montre un manuscrit de sa nouvelle pièce."]
|
||||
|
||||
@slow
|
||||
def test_batch_generation_ru_fr(self):
|
||||
self._assert_generated_batch_equal_expected()
|
||||
|
||||
|
||||
class TestMarian_MT_EN(MarianIntegrationTest):
|
||||
src = "mt"
|
||||
tgt = "en"
|
||||
src_text = ["Il - Babiloniżi b'mod żbaljat ikkonkludew li l - Alla l - veru kien dgħajjef."]
|
||||
expected_text = ["The Babylonians wrongly concluded that the true God was weak."]
|
||||
|
||||
@unittest.skip("") # Known Issue: This model generates a string of .... at the end of the translation.
|
||||
def test_batch_generation_mt_en(self):
|
||||
self._assert_generated_batch_equal_expected()
|
||||
|
||||
|
||||
class TestMarian_DE_Multi(MarianIntegrationTest):
|
||||
src = "de"
|
||||
tgt = "ch_group"
|
||||
src_text = ["Er aber sprach: Das ist die Gottlosigkeit."]
|
||||
|
||||
@slow
|
||||
def test_translation_de_multi_does_not_error(self):
|
||||
self.translate_src_text()
|
||||
|
||||
@unittest.skip("") # "Language codes are not yet supported."
|
||||
def test_batch_generation_de_multi_tgt(self):
|
||||
self._assert_generated_batch_equal_expected()
|
||||
|
||||
@unittest.skip("") # "Language codes are not yet supported."
|
||||
def test_lang_code(self):
|
||||
t = "Er aber sprach"
|
||||
zh_code = self.code
|
||||
tok_fn = self.tokenizer.prepare_translation_batch
|
||||
pass_code = tok_fn(src_texts=[t], tgt_lang_code=zh_code)["input_ids"][0]
|
||||
preprocess_with_code = tok_fn(src_texts=[zh_code + " " + t])["input_ids"][0]
|
||||
self.assertListEqual(pass_code.tolist(), preprocess_with_code.tolist())
|
||||
for code in self.tokenizer.supported_language_codes:
|
||||
self.assertIn(code, self.tokenizer.encoder)
|
||||
pass_only_code = tok_fn(src_texts=[""], tgt_lang_code=zh_code)["input_ids"][0].tolist()
|
||||
self.assertListEqual(pass_only_code, [self.tokenizer.encoder[zh_code], self.tokenizer.eos_token_id])
|
||||
|
|
|
|||
Loading…
Reference in a new issue