File size: 2,253 Bytes
22e24fc
 
 
 
 
 
 
 
96b4655
6641744
96b4655
 
 
 
 
 
 
674c987
6641744
 
81dd73b
 
 
 
 
6641744
 
 
 
 
 
 
 
 
 
 
ac6585c
 
 
 
 
6641744
ac6585c
 
6641744
ac6585c
 
81dd73b
 
 
6641744
 
ac6585c
6641744
ac6585c
 
 
6641744
 
ac6585c
6641744
ac6585c
6641744
 
ac6585c
6641744
ac6585c
 
6641744
 
ac6585c
 
 
 
81dd73b
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: mit
language:
- es
metrics:
- accuracy
pipeline_tag: fill-mask
widget:
- text: Vamos a comer unos [MASK]
  example_title: "Vamos a comer unos tacos"
tags:
- code
- nlp
- custom
- bilma
tokenizer:
- yes
---
# BILMA (Bert In Latin aMericA)

Bilma is a BERT implementation in tensorflow and trained on the Masked Language Model task under the 
https://sadit.github.io/regional-spanish-models-talk-2022/ datasets. It is a model trained on regionalized 
Spanish short texts from the Twitter (now X) platform.

We have pretrained models for the countries of Argentina, Chile, Colombia, Spain, Mexico, United States, Uruguay, and Venezuela.

The accuracy of the models trained on the MLM task for different regions are:

![bilma-mlm-comp](https://user-images.githubusercontent.com/392873/163045798-89bd45c5-b654-4f16-b3e2-5cf404e12ddd.png)

# Pre-requisites

You will need TensorFlow 2.4 or newer.

# Quick guide

Install the following version for the transformers library
```
!pip install transformers==4.30.2
```



Instanciate the tokenizer and the trained model
```
from transformers import AutoTokenizer
from transformers import TFAutoModel

tok = AutoTokenizer.from_pretrained("guillermoruiz/bilma_mx")
model = TFAutoModel.from_pretrained("guillermoruiz/bilma_mx", trust_remote_code=True)
```

Now,we will need some text and then pass it through the tokenizer:
```
text = ["Vamos a comer [MASK].",
        "Hace mucho que no voy al [MASK]."]
t = tok(text, padding="max_length", return_tensors="tf", max_length=280)
```

With this, we are ready to use the model
```
p = model(t)
```

Now, we get the most likely words with:
```
import tensorflow as tf
tok.batch_decode(tf.argmax(p["logits"], 2)[:,1:], skip_special_tokens=True)
```

which produces the output:
```
['vamos a comer tacos.', 'hace mucho que no voy al gym.']
```

If you find this model useful for your research, please cite the following paper:
```
@misc{tellez2022regionalized,
      title={Regionalized models for Spanish language variations based on Twitter}, 
      author={Eric S. Tellez and Daniela Moctezuma and Sabino Miranda and Mario Graff and Guillermo Ruiz},
      year={2022},
      eprint={2110.06128},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```