urdu-mt5-mmarco / README.md

Update README.md

ed6cbe0 verified 2 months ago

4.69 kB

	---
	metrics:
	- Recall @10 0.438
	- MRR @10 0.247
	base_model:
	- unicamp-dl/mt5-base-mmarco-v2
	tags:
	- Information Retrieval
	- Natural Language Processing
	- Question Answering
	license: apache-2.0
	---

	# Urdu-mT5-mmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval

	As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu.
	We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model.
	To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model
	and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset.

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->



	- Developed by: Umer Butt
	- Model type: MT5ForConditionalGeneration
	- Language(s) (NLP): Python/pytorch



	## Uses



	### Direct Use




	## Bias, Risks, and Limitations

	Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.

	### Recommendations


	## How to Get Started with the Model

	Example Code for Scoring Query-Document Pairs:
	In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking.
	```
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
	import torch

	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco")
	model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco")

	# Define the query and candidate documents
	query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟"
	document_1 = "پاکستان کی معیشت میں حالیہ ترقی کے بارے میں معلومات۔"
	document_2 = "فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔"

	# Tokenize query-document pairs and calculate relevance scores
	def get_score(query, document):
	input_text = f"Query: {query} Document: {document}"
	inputs = tokenizer(input_text, return_tensors="pt", truncation=True)

	# Pass through the model and get the relevance score (logits)
	outputs = model(**inputs)
	score = outputs.logits[0, -1, :] # last token logits
	return torch.softmax(score, dim=0)[tokenizer.eos_token_id].item()

	# Get scores for each document
	score_1 = get_score(query, document_1)
	score_2 = get_score(query, document_2)

	print(f"Relevance Score for Document 1: {score_1}")
	print(f"Relevance Score for Document 2: {score_2}")

	# Higher score indicates higher relevance

	```



	## Evaluation

	The evaluation was done using the scripts in the pygaggle library. Specifically these files:
	evaluate_monot5_reranker.py
	ms_marco_eval.py

	#### Metrics
	Following the approach in the mmarco work. The same two metrics were used.

	Recal @10 : 0.438
	MRR @10 : 0.247


	### Results

	\| Model \| Name \| Data \| Recall@10 \| MRR@10 \| Queries Ranked \|
	\|---------------------------------------\|---------------------------------------\|--------------\|-----------\|--------\|----------------\|
	\| bm25 (k = 1000) \| BM25 - Baseline from mmarco paper \| English data \| 0.391 \| 0.187 \| 6980 \|
	\| unicamp-dl/mt5-base-mmarco-v2 \| mmarco reranker - Baseline from paper \| English data \| \| 0.370 \| 6980 \|
	\| bm25 (k = 1000) \| BM25 \| Urdu data \| 0.2675 \| 0.129 \| 6980 \|
	\| unicamp-dl/mt5-base-mmarco-v2 \| Zero-shot mmarco \| Urdu data \| 0.408 \| 0.204 \| 6980 \|
	\| This work \| Mavkif/urdu-mt5-mmarco \| Urdu data \| 0.438 \| 0.247 \| 6980 \|





	### Model Architecture and Objective
	{
	"_name_or_path": "unicamp-dl/mt5-base-mmarco-v2",
	"architectures": ["MT5ForConditionalGeneration"],
	"d_model": 768,
	"num_heads": 12,
	"num_layers": 12,
	"dropout_rate": 0.1,
	"vocab_size": 250112,
	"model_type": "mt5",
	"transformers_version": "4.38.2"
	}
	For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation.


	## Model Card Authors [optional]

	Umer Butt


	## Model Card Contact

	mumertbutt@gmail.com