metadata

metrics:
  - Recall @10 0.438
  - MRR @10 0.247
base_model:
  - unicamp-dl/mt5-base-mmarco-v2
tags:
  - Information Retrieval
  - Natural Language Processing
  - Question Answering
license: apache-2.0

Urdu-mT5-mmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval

As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu. We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model. To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset.

Model Details

Model Description

Developed by: Umer Butt
Model type: MT5ForConditionalGeneration
Language(s) (NLP): Python/pytorch

Uses

Direct Use

Bias, Risks, and Limitations

Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.

Recommendations

How to Get Started with the Model

Example Code for Scoring Query-Document Pairs: In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco")
model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco")

# Define the query and candidate documents
query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟"
document_1 = "پاکستان کی معیشت میں حالیہ ترقی کے بارے میں معلومات۔"
document_2 = "فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔"

# Tokenize query-document pairs and calculate relevance scores
def get_score(query, document):
    input_text = f"Query: {query} Document: {document}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True)
    
    # Pass through the model and get the relevance score (logits)
    outputs = model(**inputs)
    score = outputs.logits[0, -1, :]  # last token logits
    return torch.softmax(score, dim=0)[tokenizer.eos_token_id].item()

# Get scores for each document
score_1 = get_score(query, document_1)
score_2 = get_score(query, document_2)

print(f"Relevance Score for Document 1: {score_1}")
print(f"Relevance Score for Document 2: {score_2}")

# Higher score indicates higher relevance

Evaluation

The evaluation was done using the scripts in the pygaggle library. Specifically these files: evaluate_monot5_reranker.py ms_marco_eval.py

Metrics

Following the approach in the mmarco work. The same two metrics were used.

Recal @10 : 0.438 MRR @10 : 0.247

Results

Model	Name	Data	Recall@10	MRR@10	Queries Ranked
bm25 (k = 1000)	BM25 - Baseline from mmarco paper	English data	0.391	0.187	6980
unicamp-dl/mt5-base-mmarco-v2	mmarco reranker - Baseline from paper	English data		0.370	6980
bm25 (k = 1000)	BM25	Urdu data	0.2675	0.129	6980
unicamp-dl/mt5-base-mmarco-v2	Zero-shot mmarco	Urdu data	0.408	0.204	6980
This work	Mavkif/urdu-mt5-mmarco	Urdu data	0.438	0.247	6980

Model Architecture and Objective

{ "_name_or_path": "unicamp-dl/mt5-base-mmarco-v2", "architectures": ["MT5ForConditionalGeneration"], "d_model": 768, "num_heads": 12, "num_layers": 12, "dropout_rate": 0.1, "vocab_size": 250112, "model_type": "mt5", "transformers_version": "4.38.2" } For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation.

Model Card Authors [optional]

Umer Butt

Model Card Contact

mumertbutt@gmail.com