--- metrics: - Recall @10 0.438 - MRR @10 0.247 base_model: - unicamp-dl/mt5-base-mmarco-v2 tags: - Information Retrieval - Natural Language Processing - Question Answering license: apache-2.0 --- # Urdu-mT5-mmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu. We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model. To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset. ## Model Details ### Model Description - **Developed by:** Umer Butt - **Model type:** MT5ForConditionalGeneration - **Language(s) (NLP):** Python/pytorch ## Uses ### Direct Use ## Bias, Risks, and Limitations Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too. ### Recommendations ## How to Get Started with the Model Example Code for Scoring Query-Document Pairs: In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking. ``` from transformers import AutoTokenizer, AutoModelForSeq2SeqLM import torch # Load the tokenizer and model tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco") model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco") # Define the query and candidate documents query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟" document_1 = "پاکستان کی معیشت میں حالیہ ترقی کے بارے میں معلومات۔" document_2 = "فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔" # Tokenize query-document pairs and calculate relevance scores def get_score(query, document): input_text = f"Query: {query} Document: {document}" inputs = tokenizer(input_text, return_tensors="pt", truncation=True) # Pass through the model and get the relevance score (logits) outputs = model(**inputs) score = outputs.logits[0, -1, :] # last token logits return torch.softmax(score, dim=0)[tokenizer.eos_token_id].item() # Get scores for each document score_1 = get_score(query, document_1) score_2 = get_score(query, document_2) print(f"Relevance Score for Document 1: {score_1}") print(f"Relevance Score for Document 2: {score_2}") # Higher score indicates higher relevance ``` ## Evaluation The evaluation was done using the scripts in the pygaggle library. Specifically these files: evaluate_monot5_reranker.py ms_marco_eval.py #### Metrics Following the approach in the mmarco work. The same two metrics were used. Recal @10 : 0.438 MRR @10 : 0.247 ### Results | Model | Name | Data | Recall@10 | MRR@10 | Queries Ranked | |---------------------------------------|---------------------------------------|--------------|-----------|--------|----------------| | bm25 (k = 1000) | BM25 - Baseline from mmarco paper | English data | 0.391 | 0.187 | 6980 | | unicamp-dl/mt5-base-mmarco-v2 | mmarco reranker - Baseline from paper | English data | | 0.370 | 6980 | | bm25 (k = 1000) | BM25 | Urdu data | 0.2675 | 0.129 | 6980 | | unicamp-dl/mt5-base-mmarco-v2 | Zero-shot mmarco | Urdu data | 0.408 | 0.204 | 6980 | | This work | Mavkif/urdu-mt5-mmarco | Urdu data | 0.438 | 0.247 | 6980 | ### Model Architecture and Objective { "_name_or_path": "unicamp-dl/mt5-base-mmarco-v2", "architectures": ["MT5ForConditionalGeneration"], "d_model": 768, "num_heads": 12, "num_layers": 12, "dropout_rate": 0.1, "vocab_size": 250112, "model_type": "mt5", "transformers_version": "4.38.2" } For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation. ## Model Card Authors [optional] Umer Butt ## Model Card Contact mumertbutt@gmail.com