File size: 4,687 Bytes
eb8e2d8
2f43805
 
 
eb8e2d8
 
 
 
 
 
 
2f43805
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed6cbe0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f43805
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed6cbe0
 
 
 
 
 
 
 
 
2f43805
ed6cbe0
2f43805
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
metrics: 
- Recall @10 0.438
- MRR @10 0.247
base_model:
- unicamp-dl/mt5-base-mmarco-v2
tags:
- Information Retrieval
- Natural Language Processing
- Question Answering
license: apache-2.0
---

# Urdu-mT5-mmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval

As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu. 
We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model. 
To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model 
and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset.

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->



- **Developed by:** Umer Butt
- **Model type:** MT5ForConditionalGeneration
- **Language(s) (NLP):** Python/pytorch



## Uses



### Direct Use




## Bias, Risks, and Limitations

Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.

### Recommendations


## How to Get Started with the Model

Example Code for Scoring Query-Document Pairs:
In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking.
```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco")
model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco")

# Define the query and candidate documents
query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟"
document_1 = "پاکستان کی معیشت میں حالیہ ترقی کے بارے میں معلومات۔"
document_2 = "فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔"

# Tokenize query-document pairs and calculate relevance scores
def get_score(query, document):
    input_text = f"Query: {query} Document: {document}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True)
    
    # Pass through the model and get the relevance score (logits)
    outputs = model(**inputs)
    score = outputs.logits[0, -1, :]  # last token logits
    return torch.softmax(score, dim=0)[tokenizer.eos_token_id].item()

# Get scores for each document
score_1 = get_score(query, document_1)
score_2 = get_score(query, document_2)

print(f"Relevance Score for Document 1: {score_1}")
print(f"Relevance Score for Document 2: {score_2}")

# Higher score indicates higher relevance

```



## Evaluation

The evaluation was done using the scripts in the pygaggle library. Specifically these files:
evaluate_monot5_reranker.py
ms_marco_eval.py

#### Metrics
Following the approach in the mmarco work. The same two metrics were used.

Recal @10 : 0.438
MRR @10 : 0.247


### Results

| Model                                 | Name                                  | Data         | Recall@10 | MRR@10 | Queries Ranked |
|---------------------------------------|---------------------------------------|--------------|-----------|--------|----------------|
| bm25 (k = 1000)                       | BM25 - Baseline from mmarco paper     | English data | 0.391     | 0.187  | 6980           |
| unicamp-dl/mt5-base-mmarco-v2         | mmarco reranker - Baseline from paper | English data |           | 0.370  | 6980           |
| bm25 (k = 1000)                       | BM25                                  | Urdu data    | 0.2675    | 0.129  | 6980           |
| unicamp-dl/mt5-base-mmarco-v2         | Zero-shot mmarco                      | Urdu data    | 0.408     | 0.204  | 6980           |
| This work                             | Mavkif/urdu-mt5-mmarco                | Urdu data    | 0.438     | 0.247  | 6980           |





### Model Architecture and Objective
{
    "_name_or_path": "unicamp-dl/mt5-base-mmarco-v2",
    "architectures": ["MT5ForConditionalGeneration"],
    "d_model": 768,
    "num_heads": 12,
    "num_layers": 12,
    "dropout_rate": 0.1,
    "vocab_size": 250112,
    "model_type": "mt5",
    "transformers_version": "4.38.2"
}
For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation.


## Model Card Authors [optional]

Umer Butt 


## Model Card Contact

mumertbutt@gmail.com