File size: 1,642 Bytes
166b319
 
1656329
 
 
 
166b319
 
1656329
166b319
47dc124
166b319
1656329
166b319
 
1eca14a
 
e4facfe
1eca14a
 
 
1656329
166b319
 
1656329
 
166b319
1656329
166b319
1656329
 
 
166b319
1656329
 
 
 
 
 
 
 
 
166b319
1656329
166b319
ccbed2a
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
library_name: transformers
license: apache-2.0
language:
- km
pipeline_tag: fill-mask
---

# XLMRoBERTa for Khmer Language

Training from scratch using **Masked Language Modeling** task on 5M Khmer sentences or 162M words or 578K unique words for 1M steps.

Training data is created by crawling publicly available publicly news sites and Wikipedia.


## Why?

1. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) is big. (279M parameters, while this is only 49M parameters).
2. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) is not optimized for Khmer language.
3. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) Vocab size is bigger (250,002) and this model uses 8000 vocab size.

## Usage


```python
from transformers import pipeline

pipe = pipeline("fill-mask", "seanghay/xlm-roberta-khmer-small")

result = pipe("αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž»<mask>!")
print(result)
```

```python
[
  {"score": 0.8130345344543457, "token": 11, "token_str": "αž‡αžΆ", "sequence": "αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž»αž‡αžΆ!"},
  {"score": 0.17512884736061096, "token": 160, "token_str": "αž‡", "sequence": "αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž»αž‡!"},
  {"score": 0.0034702506382018328, "token": 143, "token_str": "αž‡αžΆ", "sequence": "αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž» αž‡αžΆ!"},
  {"score": 0.00305828545242548, "token": 16, "token_str": "រ", "sequence": "αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž»αžš!"},
  {"score": 0.0007526700501330197, "token": 133, "token_str": "αž‚", "sequence": "αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž»αž‚!"},
]
```

## License

`Apache-2.0`

## Citation

No need. :)