XLMRoBERTa for Khmer Language

Training from scratch using Masked Language Modeling task on 5M Khmer sentences or 162M words or 578K unique words for 1M steps.

Training data is created by crawling publicly available publicly news sites and Wikipedia.

Why?

  1. xlm-roberta-base is big. (279M parameters, while this is only 49M parameters).
  2. xlm-roberta-base is not optimized for Khmer language.
  3. xlm-roberta-base Vocab size is bigger (250,002) and this model uses 8000 vocab size.

Usage

from transformers import pipeline

pipe = pipeline("fill-mask", "seanghay/xlm-roberta-khmer-small")

result = pipe("αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž»<mask>!")
print(result)
[
  {"score": 0.8130345344543457, "token": 11, "token_str": "αž‡αžΆ", "sequence": "αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž»αž‡αžΆ!"},
  {"score": 0.17512884736061096, "token": 160, "token_str": "αž‡", "sequence": "αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž»αž‡!"},
  {"score": 0.0034702506382018328, "token": 143, "token_str": "αž‡αžΆ", "sequence": "αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž» αž‡αžΆ!"},
  {"score": 0.00305828545242548, "token": 16, "token_str": "រ", "sequence": "αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž»αžš!"},
  {"score": 0.0007526700501330197, "token": 133, "token_str": "αž‚", "sequence": "αžŸαž½αžŸαŸ’αžŠαžΈαž€αž˜αŸ’αž–αž»αž‚!"},
]

License

Apache-2.0

Citation

No need. :)

Downloads last month
5
Safetensors
Model size
49.7M params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.