---
language:
- ta
pipeline_tag: fill-mask
widget:
- text: தமிழ் [MASK] வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்
- text: காந்தியோட [MASK] காந்தியா சார்? கோட்சே ஓட பையன் கோட்சே வா சார்? 
datasets:
- AnanthZeke/tamil_sentences_master_raw

---
# Model Card for Deepakvictor/tamil_bs_bert

# BERT base model 
Pretrained model on Tamil language using a masked language modeling (MLM) objective.It was introduced in this paper and first released in this repository.

# Model description 
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion.In the same way this model is trained on tamil in a objective to predict a masked word [MASK].
Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence.

# Training of this model
This model was trained on the dataset AnanthZeke/tamil_sentences_master_raw. the first 10.6M sentences are used in training this model with a batch_size of 64.
the model performed a loss of 0.687 in overall training.
the model performed a loss of 0.80  in evaluation. the dataset used for for evaluation is the same dataset with last 120000 rows

# Model variations
BERT has originally been released in base and large variations, for cased and uncased input text. 
this model doesn't face any "case" input since language tamil doesn't work on cases.
this bert model is base model with 110M parameteres
| Model | #params | Language |
|------------------------|--------------------------------|-------|
| [`bert-base-uncased`](https://huggingface.co/Deepakvictor/tamil_bs_bert) | 110M   | Tamil |

# Intended uses & limitations
You can use this raw model for masked language modeling. and can be used to finetune any task.
since this model doesn't follow wordpiece tokenization and performed on subword tokenization there might be a higher chance that the predicted masked word may be a subword.


### How to use
```python
from transformers import pipeline
unmasker = pipeline('fill-mask', model='Deepakvictor/tamil_bs_bert')
unmasker("தமிழ் [MASK] வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்")

[{'score': 0.14111991226673126,
  'token': 12540,
  'token_str': 'மொழியை',
  'sequence': 'தமிழ் மொழியை வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
 {'score': 0.0806930884718895,
  'token': 2461,
  'token_str': 'மக்களுக்கு',
  'sequence': 'தமிழ் மக்களுக்கு வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
 {'score': 0.016404788941144943,
  'token': 3461,
  'token_str': 'எழுத',
  'sequence': 'தமிழ் எழுத வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
 {'score': 0.015853099524974823,
  'token': 5849,
  'token_str': 'எழுதி',
  'sequence': 'தமிழ் எழுதி வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
 {'score': 0.015091801062226295,
  'token': 1107,
  'token_str': 'எப்படி',
  'sequence': 'தமிழ் எப்படி வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'}]

```

To use the model in pytorch

```python
# Load the model and tokenizer
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Deepakvictor/tamil_bs_bert")
model = AutoModelForMaskedLM.from_pretrained("Deepakvictor/tamil_bs_bert")

#tokenize the input
inp = tokenizer("தமிழ் [MASK] வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்",return_tensors="pt")
out = model(**inp)

#decode Process
tokenizer.decode(out.logits.softmax(-1).argmax(-1).view(-1).tolist(),skip_special_tokens=True)
```

# Limitations and bias 
As mentioned the model may output a subword with the masked token
since the model is trained self-supervised there might be any biased found.

# Training data 
This BERT model was pretrained on [tamil-sentence](https://huggingface.co/datasets/AnanthZeke/tamil_sentences_master_raw) 

# Training procedure 
##  Preprocessing 
a Tokenizer is trained with the same dataset [tamil-sentence](https://huggingface.co/datasets/AnanthZeke/tamil_sentences_master_raw) with a vocab size of 29677
The details of the masking procedure for each sentence are the following:
    15% of the tokens are masked.
## pretraining 
The model was trained on P100 GPU for ten million sentences with a batch size of 64.The optimizer used is AdamW with a learning rate of 1e-5,

# Evaluation results 
this bert-base model produces a evaluation loss of 0.8 on 1,20,200 sentences


### BibTeX entry and citation info

```bibtex
@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```