mirror of
https://github.com/saymrwulf/transformers.git
synced 2026-05-14 20:58:08 +00:00
loodos turkish model cards added (#6840)
This commit is contained in:
parent
502d194b95
commit
bff6d517cd
6 changed files with 270 additions and 0 deletions
48
model_cards/loodos/albert-base-turkish-uncased/README.md
Normal file
48
model_cards/loodos/albert-base-turkish-uncased/README.md
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
---
|
||||
language: tr
|
||||
---
|
||||
|
||||
# Turkish Language Models with Huggingface's Transformers
|
||||
|
||||
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
|
||||
|
||||
# Turkish ALBERT-Base (uncased)
|
||||
|
||||
This is ALBERT-Base model which has 12 repeated encore layers with 768 hidden layer size trained on uncased Turkish dataset.
|
||||
|
||||
## Usage
|
||||
|
||||
Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
|
||||
|
||||
```python
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained("loodos/albert-base-turkish-uncased", do_lower_case=False, keep_accents=True)
|
||||
model = AutoModel.from_pretrained("loodos/albert-base-turkish-uncased")
|
||||
|
||||
normalizer = TextNormalization()
|
||||
normalized_text = normalizer(text, do_lower_case=True)
|
||||
tokenizer.tokenize(normalized_text)
|
||||
```
|
||||
|
||||
### Notes on Tokenizers
|
||||
Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons.
|
||||
|
||||
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
|
||||
|
||||
2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
|
||||
|
||||
- "I" and "İ" to 'i'
|
||||
- 'i' and 'ı' to 'I'
|
||||
|
||||
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
|
||||
|
||||
We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
|
||||
|
||||
# Details and Contact
|
||||
|
||||
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
|
||||
|
||||
# Acknowledgments
|
||||
|
||||
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
|
||||
|
||||
49
model_cards/loodos/bert-base-turkish-uncased/README.md
Normal file
49
model_cards/loodos/bert-base-turkish-uncased/README.md
Normal file
|
|
@ -0,0 +1,49 @@
|
|||
---
|
||||
language: tr
|
||||
---
|
||||
|
||||
# Turkish Language Models with Huggingface's Transformers
|
||||
|
||||
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
|
||||
|
||||
# Turkish BERT-Base (uncased)
|
||||
|
||||
This is BERT-Base model which has 12 encoder layer with 768 hidden layer size trained on uncased Turkish dataset.
|
||||
|
||||
## Usage
|
||||
|
||||
Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
|
||||
|
||||
```python
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False)
|
||||
model = AutoModel.from_pretrained("loodos/bert-base-turkish-uncased")
|
||||
|
||||
normalizer = TextNormalization()
|
||||
normalized_text = normalizer(text, do_lower_case=True)
|
||||
tokenizer.tokenize(normalized_text)
|
||||
```
|
||||
|
||||
### Notes on Tokenizers
|
||||
Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons.
|
||||
|
||||
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
|
||||
|
||||
2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
|
||||
|
||||
- "I" and "İ" to 'i'
|
||||
- 'i' and 'ı' to 'I'
|
||||
|
||||
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
|
||||
|
||||
We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
|
||||
|
||||
|
||||
# Contact
|
||||
|
||||
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
|
||||
|
||||
# Acknowledgments
|
||||
|
||||
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
|
||||
|
||||
|
|
@ -0,0 +1,48 @@
|
|||
---
|
||||
language: tr
|
||||
---
|
||||
|
||||
# Turkish Language Models with Huggingface's Transformers
|
||||
|
||||
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
|
||||
|
||||
# Turkish ELECTRA-Base-discriminator (uncased/64k)
|
||||
|
||||
This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. This version has a vocab of size 64k different from default, 32k.
|
||||
|
||||
## Usage
|
||||
|
||||
Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
|
||||
|
||||
```python
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator", do_lower_case=False)
|
||||
model = AutoModel.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator")
|
||||
|
||||
normalizer = TextNormalization()
|
||||
normalized_text = normalizer(text, do_lower_case=True)
|
||||
tokenizer.tokenize(normalized_text)
|
||||
```
|
||||
|
||||
### Notes on Tokenizers
|
||||
Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons.
|
||||
|
||||
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
|
||||
|
||||
2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
|
||||
|
||||
- "I" and "İ" to 'i'
|
||||
- 'i' and 'ı' to 'I'
|
||||
|
||||
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
|
||||
|
||||
We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
|
||||
|
||||
# Details and Contact
|
||||
|
||||
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
|
||||
|
||||
# Acknowledgments
|
||||
|
||||
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
|
||||
|
||||
49
model_cards/loodos/electra-base-turkish-uncased/README.md
Normal file
49
model_cards/loodos/electra-base-turkish-uncased/README.md
Normal file
|
|
@ -0,0 +1,49 @@
|
|||
---
|
||||
language: tr
|
||||
---
|
||||
|
||||
# Turkish Language Models with Huggingface's Transformers
|
||||
|
||||
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
|
||||
|
||||
# Turkish ELECTRA-Base (uncased)
|
||||
|
||||
This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset.
|
||||
|
||||
## Usage
|
||||
|
||||
Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
|
||||
|
||||
```python
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-uncased", do_lower_case=False)
|
||||
model = AutoModel.from_pretrained("loodos/electra-base-turkish-uncased")
|
||||
|
||||
normalizer = TextNormalization()
|
||||
normalized_text = normalizer(text, do_lower_case=True)
|
||||
tokenizer.tokenize(normalized_text)
|
||||
```
|
||||
|
||||
### Notes on Tokenizers
|
||||
Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons.
|
||||
|
||||
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
|
||||
|
||||
2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
|
||||
|
||||
- "I" and "İ" to 'i'
|
||||
- 'i' and 'ı' to 'I'
|
||||
|
||||
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
|
||||
|
||||
We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
|
||||
|
||||
|
||||
# Details and Contact
|
||||
|
||||
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
|
||||
|
||||
# Acknowledgments
|
||||
|
||||
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
|
||||
|
||||
|
|
@ -0,0 +1,27 @@
|
|||
---
|
||||
language: tr
|
||||
---
|
||||
|
||||
# Turkish Language Models with Huggingface's Transformers
|
||||
|
||||
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
|
||||
|
||||
# Turkish ELECTRA-Small-discriminator (cased)
|
||||
|
||||
This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on cased Turkish dataset.
|
||||
|
||||
## Usage
|
||||
|
||||
Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
|
||||
|
||||
```python
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-cased-discriminator")
|
||||
model = AutoModel.from_pretrained("loodos/electra-small-turkish-cased-discriminator")
|
||||
```
|
||||
|
||||
# Details and Contact
|
||||
|
||||
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
|
||||
|
||||
# Acknowledgments
|
||||
|
|
@ -0,0 +1,49 @@
|
|||
---
|
||||
language: tr
|
||||
---
|
||||
|
||||
# Turkish Language Models with Huggingface's Transformers
|
||||
|
||||
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. The details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
|
||||
|
||||
# Turkish ELECTRA-Small-discriminator (uncased)
|
||||
|
||||
This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on uncased Turkish dataset. Please refer to
|
||||
|
||||
## Usage
|
||||
|
||||
Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
|
||||
|
||||
```python
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-uncased-discriminator", do_lower_case=False)
|
||||
model = AutoModel.from_pretrained("loodos/electra-small-turkish-uncased-discriminator")
|
||||
|
||||
normalizer = TextNormalization()
|
||||
normalized_text = normalizer(text, do_lower_case=True)
|
||||
tokenizer.tokenize(normalized_text)
|
||||
```
|
||||
|
||||
### Notes on Tokenizers
|
||||
Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons.
|
||||
|
||||
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
|
||||
|
||||
2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
|
||||
|
||||
- "I" and "İ" to 'i'
|
||||
- 'i' and 'ı' to 'I'
|
||||
|
||||
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
|
||||
|
||||
We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
|
||||
|
||||
|
||||
# Details and Contact
|
||||
|
||||
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
|
||||
|
||||
# Acknowledgments
|
||||
|
||||
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
|
||||
|
||||
Loading…
Reference in a new issue