File size: 1,642 Bytes
166b319 1656329 166b319 1656329 166b319 47dc124 166b319 1656329 166b319 1eca14a e4facfe 1eca14a 1656329 166b319 1656329 166b319 1656329 166b319 1656329 166b319 1656329 166b319 1656329 166b319 ccbed2a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
---
library_name: transformers
license: apache-2.0
language:
- km
pipeline_tag: fill-mask
---
# XLMRoBERTa for Khmer Language
Training from scratch using **Masked Language Modeling** task on 5M Khmer sentences or 162M words or 578K unique words for 1M steps.
Training data is created by crawling publicly available publicly news sites and Wikipedia.
## Why?
1. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) is big. (279M parameters, while this is only 49M parameters).
2. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) is not optimized for Khmer language.
3. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) Vocab size is bigger (250,002) and this model uses 8000 vocab size.
## Usage
```python
from transformers import pipeline
pipe = pipeline("fill-mask", "seanghay/xlm-roberta-khmer-small")
result = pipe("αα½ααααΈααααα»<mask>!")
print(result)
```
```python
[
{"score": 0.8130345344543457, "token": 11, "token_str": "ααΆ", "sequence": "αα½ααααΈααααα»ααΆ!"},
{"score": 0.17512884736061096, "token": 160, "token_str": "α", "sequence": "αα½ααααΈααααα»α!"},
{"score": 0.0034702506382018328, "token": 143, "token_str": "ααΆ", "sequence": "αα½ααααΈααααα» ααΆ!"},
{"score": 0.00305828545242548, "token": 16, "token_str": "α", "sequence": "αα½ααααΈααααα»α!"},
{"score": 0.0007526700501330197, "token": 133, "token_str": "α", "sequence": "αα½ααααΈααααα»α!"},
]
```
## License
`Apache-2.0`
## Citation
No need. :)
|