Mavkif
/

urdu-mt5-mmarco

Safetensors

mt5

Information Retrieval

Natural Language Processing

Question Answering

Model card Files Files and versions Community

Mavkif commited on Nov 2, 2024

Commit

603d977

verified ·

1 Parent(s): ed6cbe0

Update README.md

Browse files

Files changed (1) hide show

README.md +86 -42

README.md CHANGED Viewed

@@ -45,47 +45,6 @@ and then applied fine-tuning with the mMARCO multilingual IR methodology on the
 Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.
-### Recommendations
-## How to Get Started with the Model
-Example Code for Scoring Query-Document Pairs:
-In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking.
-```
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
-import torch
-# Load the tokenizer and model
-tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco")
-model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco")
-# Define the query and candidate documents
-query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟"
-document_1 = "پاکستان کی معیشت میں حالیہ ترقی کے بارے میں معلومات۔"
-document_2 = "فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔"
-# Tokenize query-document pairs and calculate relevance scores
-def get_score(query, document):
-    input_text = f"Query: {query} Document: {document}"
-    inputs = tokenizer(input_text, return_tensors="pt", truncation=True)
-    # Pass through the model and get the relevance score (logits)
-    outputs = model(**inputs)
-    score = outputs.logits[0, -1, :]  # last token logits
-    return torch.softmax(score, dim=0)[tokenizer.eos_token_id].item()
-# Get scores for each document
-score_1 = get_score(query, document_1)
-score_2 = get_score(query, document_2)
-print(f"Relevance Score for Document 1: {score_1}")
-print(f"Relevance Score for Document 2: {score_2}")
-# Higher score indicates higher relevance
-```
 ## Evaluation
@@ -130,6 +89,91 @@ MRR @10 : 0.247
 For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation.
 ## Model Card Authors [optional]
 Umer Butt
@@ -137,4 +181,4 @@ Umer Butt
 ## Model Card Contact
-mumertbutt@gmail.com

 Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.
 ## Evaluation
 For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation.
+## How to Get Started with the Model
+Example Code for Scoring Query-Document Pairs:
+In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking.
+```
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+import torch
+import torch.nn.functional as F
+# Load the tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco")
+model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco")
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+def rank_documents(query, documents):
+    # Create input pairs of query and documents
+    query_document_pairs = [f"{query} [SEP] {doc}" for doc in documents]
+    # Tokenize the input pairs
+    inputs = tokenizer(query_document_pairs, padding=True, truncation=True, return_tensors="pt", max_length=512)
+    inputs = {k: v.to(device) for k, v in inputs.items()}
+    # Generate decoder input ids (starting with the decoder start token)
+    decoder_input_ids = torch.full(
+        (inputs["input_ids"].shape[0], 1), model.config.decoder_start_token_id, dtype=torch.long, device=device
+    )
+    # Perform inference to get the logits
+    with torch.no_grad():
+        outputs = model(**inputs, decoder_input_ids=decoder_input_ids)
+    # Get the logits for the sequence output
+    logits = outputs.logits
+    # Extract the probabilities for the generated sequence
+    scores = []
+    for idx, doc in enumerate(documents):
+        # Calculate the softmax over the entire vocabulary for each token in the sequence
+        doc_logits = logits[idx]
+        doc_probs = F.softmax(doc_logits, dim=-1)
+        # Get the probability score for "ہاں" token in the output sequence
+        token_true_id = tokenizer.convert_tokens_to_ids("ہاں")
+        token_probs = doc_probs[:, token_true_id]
+        sum_prob = token_probs.sum().item()  # Sum probability over the sequence
+        scores.append((doc, sum_prob))  # Use the summed probability directly as the score
+    # Normalize scores to be between 0 and 1
+    max_score = max(score for _, score in scores)
+    min_score = min(score for _, score in scores)
+    normalized_scores = [((score - min_score) / (max_score - min_score) if max_score > min_score else 0.5) for _, score in scores]
+    # Create a list of documents with normalized scores
+    ranked_documents = [(documents[idx], normalized_scores[idx]) for idx in range(len(documents))]
+    # Sort documents based on scores (descending order)
+    ranked_documents = sorted(ranked_documents, key=lambda x: x[1], reverse=True)
+    return ranked_documents
+# Example query and documents
+query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟"
+documents = [
+    "پاکستان ��ی معیشت میں بہتری کے اشارے ہیں۔",
+    "زر مبادلہ کے ذخائر میں کمی دیکھی گئی ہے۔",
+    "فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔"
+]
+# Get ranked documents
+ranked_docs = rank_documents(query, documents)
+# Print the ranked documents
+for idx, (doc, score) in enumerate(ranked_docs):
+    print(f"Rank {idx + 1}: Score: {score}, Document: {doc}")
+```
 ## Model Card Authors [optional]
 Umer Butt
 ## Model Card Contact
+mumertbutt@gmail.com