urdu-mt5-mmarco / README.md
Mavkif's picture
Update README.md
ed6cbe0 verified
|
raw
history blame
4.69 kB
---
metrics:
- Recall @10 0.438
- MRR @10 0.247
base_model:
- unicamp-dl/mt5-base-mmarco-v2
tags:
- Information Retrieval
- Natural Language Processing
- Question Answering
license: apache-2.0
---
# Urdu-mT5-mmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval
As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu.
We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model.
To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model
and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** Umer Butt
- **Model type:** MT5ForConditionalGeneration
- **Language(s) (NLP):** Python/pytorch
## Uses
### Direct Use
## Bias, Risks, and Limitations
Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.
### Recommendations
## How to Get Started with the Model
Example Code for Scoring Query-Document Pairs:
In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking.
```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco")
model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco")
# Define the query and candidate documents
query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟"
document_1 = "پاکستان کی معیشت میں حالیہ ترقی کے بارے میں معلومات۔"
document_2 = "فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔"
# Tokenize query-document pairs and calculate relevance scores
def get_score(query, document):
input_text = f"Query: {query} Document: {document}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True)
# Pass through the model and get the relevance score (logits)
outputs = model(**inputs)
score = outputs.logits[0, -1, :] # last token logits
return torch.softmax(score, dim=0)[tokenizer.eos_token_id].item()
# Get scores for each document
score_1 = get_score(query, document_1)
score_2 = get_score(query, document_2)
print(f"Relevance Score for Document 1: {score_1}")
print(f"Relevance Score for Document 2: {score_2}")
# Higher score indicates higher relevance
```
## Evaluation
The evaluation was done using the scripts in the pygaggle library. Specifically these files:
evaluate_monot5_reranker.py
ms_marco_eval.py
#### Metrics
Following the approach in the mmarco work. The same two metrics were used.
Recal @10 : 0.438
MRR @10 : 0.247
### Results
| Model | Name | Data | Recall@10 | MRR@10 | Queries Ranked |
|---------------------------------------|---------------------------------------|--------------|-----------|--------|----------------|
| bm25 (k = 1000) | BM25 - Baseline from mmarco paper | English data | 0.391 | 0.187 | 6980 |
| unicamp-dl/mt5-base-mmarco-v2 | mmarco reranker - Baseline from paper | English data | | 0.370 | 6980 |
| bm25 (k = 1000) | BM25 | Urdu data | 0.2675 | 0.129 | 6980 |
| unicamp-dl/mt5-base-mmarco-v2 | Zero-shot mmarco | Urdu data | 0.408 | 0.204 | 6980 |
| This work | Mavkif/urdu-mt5-mmarco | Urdu data | 0.438 | 0.247 | 6980 |
### Model Architecture and Objective
{
"_name_or_path": "unicamp-dl/mt5-base-mmarco-v2",
"architectures": ["MT5ForConditionalGeneration"],
"d_model": 768,
"num_heads": 12,
"num_layers": 12,
"dropout_rate": 0.1,
"vocab_size": 250112,
"model_type": "mt5",
"transformers_version": "4.38.2"
}
For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation.
## Model Card Authors [optional]
Umer Butt
## Model Card Contact
mumertbutt@gmail.com