diff --git a/model_cards/sarahlintang/IndoBERT/README.md b/model_cards/sarahlintang/IndoBERT/README.md new file mode 100644 index 000000000..bb5348cca --- /dev/null +++ b/model_cards/sarahlintang/IndoBERT/README.md @@ -0,0 +1,43 @@ +--- +language: id +datasets: +- oscar +--- +# IndoBERT (Indonesian BERT Model) + +## Model description +IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language. + +This model is base-uncased version which use bert-base config. + +## Intended uses & limitations + +#### How to use + +```python +from transformers import AutoTokenizer, AutoModel +tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT") +model = AutoModel.from_pretrained("sarahlintang/IndoBERT") +tokenizer.encode("hai aku mau makan.") +[2, 8078, 1785, 2318, 1946, 18, 4] +``` + + +## Training data + +This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/). + +This model is equal to bert-base model which has 32,000 vocabulary size. + +## Training procedure + +The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2. +We used a Google Cloud Storage bucket, for persistent storage of training data and models. + +## Eval results + +We evaluate this model on three Indonesian NLP downstream task: +- some extractive summarization model +- sentiment analysis +- Part-of-Speech Tagger +it was proven that this model outperforms multilingual BERT for all downstream tasks.