IndicTrans2
This is the model card of IndicTrans2 En-Indic Distilled 200M variant.
Please refer to section 7.6: Distilled Models in the TMLR submission for further details on model training, data and metrics.
Usage Instructions
Please refer to the github repository for a detail description on how to use HF compatible IndicTrans2 models for inference.
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransTokenizer import IndicProcessor
model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, trust_remote_code=True)
ip = IndicProcessor(inference=True)
input_sentences = [
"When I was young, I used to go to the park every day.",
"We watched a new movie last week, which was very inspiring.",
"If you had met me at that time, we would have gone out to eat.",
"My friend has invited me to his birthday party, and I will give him a gift.",
]
src_lang, tgt_lang = "eng_Latn", "hin_Deva"
batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
inputs = tokenizer(
batch,
truncation=True,
padding="longest",
return_tensors="pt",
return_attention_mask=True,
).to(DEVICE)
with torch.no_grad():
generated_tokens = model.generate(
**inputs,
use_cache=True,
min_length=0,
max_length=256,
num_beams=5,
num_return_sequences=1,
)
with tokenizer.as_target_tokenizer():
generated_tokens = tokenizer.batch_decode(
generated_tokens.detach().cpu().tolist(),
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
for input_sentence, translation in zip(input_sentences, translations):
print(f"{src_lang}: {input_sentence}")
print(f"{tgt_lang}: {translation}")
Note: IndicTrans2 is now compatible with AutoTokenizer, however you need to use IndicProcessor from IndicTransTokenizer for preprocessing before tokenization.
Citation
If you consider using our work then please cite using:
@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}