|
--- |
|
language: |
|
- he |
|
pipeline_tag: fill-mask |
|
datasets: |
|
- HeNLP/HeDC4 |
|
--- |
|
|
|
|
|
## Hebrew Language Model for Long Documents |
|
|
|
State-of-the-art Longformer language model for Hebrew. |
|
|
|
#### How to use |
|
|
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('HeNLP/LongHeRo') |
|
model = AutoModelForMaskedLM.from_pretrained('HeNLP/LongHeRo') |
|
|
|
# Tokenization Example: |
|
# Encoding and tokenizing |
|
encoded_string = tokenizer('ืฉืืื ืืืืื') |
|
|
|
# Decoding |
|
decoded_string = tokenizer.decode(encoded_string['input_ids'], skip_special_tokens=True) |
|
``` |
|
|
|
|
|
### Citing |
|
|
|
If you use LongHeRo in your research, please cite [HeRo: RoBERTa and Longformer Hebrew Language Models](http://arxiv.org/abs/2304.11077). |
|
``` |
|
@article{shalumov2023hero, |
|
title={HeRo: RoBERTa and Longformer Hebrew Language Models}, |
|
author={Vitaly Shalumov and Harel Haskey}, |
|
year={2023}, |
|
journal={arXiv:2304.11077}, |
|
} |
|
``` |