turkish-medium-bert-uncased
This is a Turkish Medium uncased BERT model, developed to fill the gap for small-sized BERT models for Turkish. Since this model is uncased: it does not make a difference between turkish and Turkish.
⚠ Uncased use requires manual lowercase conversion
Don't use the do_lower_case = True
flag with the tokenizer. Instead, convert your text to lower case as follows:
text.replace("I", "ı").lower()
This is due to a known issue with the tokenizer.
Be aware that this model may exhibit biased predictions as it was trained primarily on crawled data, which inherently can contain various biases.
Other relevant information can be found in the paper.
Example Usage
from transformers import AutoTokenizer, BertForMaskedLM
from transformers import pipeline
model = BertForMaskedLM.from_pretrained("ytu-ce-cosmos/turkish-medium-bert-uncased")
# or
# model = BertForMaskedLM.from_pretrained("ytu-ce-cosmos/turkish-medium-bert-uncased", from_tf = True)
tokenizer = AutoTokenizer.from_pretrained("ytu-ce-cosmos/turkish-medium-bert-uncased")
unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
unmasker("gelirken bir litre [MASK] aldım.")
[{'score': 0.6158884763717651,
'token': 11818,
'token_str': 'benzin',
'sequence': 'gelirken bir litre benzin aldım.'},
{'score': 0.1580735594034195,
'token': 2417,
'token_str': 'su',
'sequence': 'gelirken bir litre su aldım.'},
{'score': 0.07746931910514832,
'token': 29480,
'token_str': 'mazot',
'sequence': 'gelirken bir litre mazot aldım.'},
{'score': 0.0339476652443409,
'token': 4521,
'token_str': 'süt',
'sequence': 'gelirken bir litre süt aldım.'},
{'score': 0.021608062088489532,
'token': 7279,
'token_str': 'alkol',
'sequence': 'gelirken bir litre alkol aldım.'}]
Acknowledgments
- Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
- Thanks to the generous support from the Hugging Face team, it is possible to download models from their S3 storage 🤗
Citations
@article{kesgin2023developing,
title={Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models},
author={Kesgin, Himmet Toprak and Yuce, Muzaffer Kaan and Amasyali, Mehmet Fatih},
journal={arXiv preprint arXiv:2307.14134},
year={2023}
}
License
MIT
- Downloads last month
- 143
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.