File size: 9,148 Bytes
59d067d e2fe48d 59d067d cacff67 59d067d 0d885dd 59d067d 18555d3 cfd41be 18555d3 59d067d 18555d3 92abd3f 18555d3 59d067d 18555d3 59d067d 1bebdbc 18555d3 59d067d 58c71a0 59d067d 58c71a0 59d067d 18555d3 92abd3f 44f67a3 18555d3 92abd3f 59d067d 92abd3f 59d067d 18555d3 59d067d 18555d3 59d067d abf047d 59d067d a267d35 59d067d a267d35 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
---
license: cc-by-nc-nd-4.0
language:
- te
datasets:
- MIRACL
tags:
- miniMiracle
- passage-retrieval
- knowledge-distillation
- middle-training
- sentence-transformers
pretty_name: >-
miniMiracle is a family of High-quality, Light Weight and Easy deploy
multilingual embedders / retrievers, primarily focussed on Indo-Aryan and
Indo-Dravidin Languages.
library_name: transformers
pipeline_tag: sentence-similarity
---
<center>
<img src="./logo.png" width=250/>
<img src="./te_intro.png" width=120%/>
</center>
<center>
<img src="./te_metrics_1.png" width=90%/>
<b><p>Table 1: Telugu retrieval performance on the MIRACL dev set (measured by nDCG@10)</p></b>
</center>
## Architecture:
- Model: BERT.
- Tokenizer: XLM-Roberta's Tokenizer.
<br/>
<center>
<h1> Table Of Contents </h1>
</center>
- [License and Terms:](#license-and-terms)
- [Detailed comparison & Our Contribution:](#detailed-comparison--our-contribution)
- [ONNX & GGUF Status:](#onnx--gguf-status)
- [Usage:](#usage)
- [With Sentence Transformers:](#with-sentence-transformers)
- [With Huggingface Transformers:](#with-huggingface-transformers)
- [FAQs](#faqs)
- [How can I reduce overall inference cost ?](#how-can-i-reduce-overall-inference-cost)
- [How do I reduce vector storage cost?](#how-do-i-reduce-vector-storage-cost)
- [How do I offer hybrid search to improve accuracy?](#how-do-i-offer-hybrid-search-to-improve-accuracy)
- [Why not run MTEB?](#why-not-run-mteb)
- [Roadmap](#roadmap)
- [Notes on Reproducing:](#notes-on-reproducing)
- [Reference:](#reference)
- [Note on model bias](#note-on-model-bias)
# License and Terms:
<center>
<img src="./terms.png" width=200%/>
</center>
## Detailed comparison & Our Contribution:
English language famously have **all-minilm** series models which were great for quick experimentations and for certain production workloads. The Idea is to have same for the other popular langauges, starting with Indo-Aryan and Indo-Dravidian languages. Our innovation is in bringing high quality models which easy to serve and embeddings are cheaper to store without ANY pretraining or expensive finetuning. For instance, **all-minilm** are finetuned on 1-Billion pairs. We offer a very lean model but with a huge vocabulary - around 250K.
We will add more details here.
<center>
<img src="./te_metrics_2.png" width=120%/>
<b><p>Table 2: Detailed Telugu retrieval performance on the MIRACL dev set (measured by nDCG@10)</p></b>
</center>
Full set of evaluation numbers for our model
```python
{'NDCG@1': 0.45773, 'NDCG@3': 0.58701, 'NDCG@5': 0.60938, 'NDCG@10': 0.63416, 'NDCG@100': 0.66138, 'NDCG@1000': 0.6682}
{'MAP@1': 0.45129, 'MAP@3': 0.55509, 'MAP@5': 0.56774, 'MAP@10': 0.57728, 'MAP@100': 0.58319, 'MAP@1000': 0.58346}
{'Recall@10': 0.79247, 'Recall@50': 0.89936, 'Recall@100': 0.93639, 'Recall@200': 0.96276, 'Recall@500': 0.97967, 'Recall@1000': 0.98933}
{'P@1': 0.45773, 'P@3': 0.22947, 'P@5': 0.14903, 'P@10': 0.08152, 'P@100': 0.00965, 'P@1000': 0.00102}
{'MRR@10': 0.5813, 'MRR@100': 0.58704, 'MRR@1000': 0.58729}
```
<br/>
# ONNX & GGUF Status:
|Variant| Status |
|:---:|:---:|
|FP16 ONNX | ✅ |
|GGUF | WIP|
# Usage:
#### With Sentence Transformers:
```python
from sentence_transformers import SentenceTransformer
import scipy.spatial
model = SentenceTransformer('prithivida/miniMiracle_te_v1')
corpus = [
'ఒక వ్యక్తి ఆహారం తింటున్నాడు.',
'ప్రజలు రొట్టె ముక్క తింటారు.',
'అమ్మాయి ఒక బిడ్డను ఎత్తుకుందు.',
'ఒక వ్యక్తి గుర్రం మీద సవారీ చేస్తున్నాడు.',
'ఒక మహిళ వయోలిన్ వాయిస్తోంది.',
'రెండు వ్యక్తులు అడవిలో కారును తోస్తున్నారు.',
'ఒక వ్యక్తి ఒక తెల్ల గుర్రం మీద ఒక మూసిన ప్రదేశంలో సవారీ చేస్తున్నాడు.',
'ఒక కోతి డ్రమ్ వాయిస్తోంది.',
'ఒక చిరుత తన వేట వెనుక పరుగెడుతోంది.',
'ప్రజలు పెద్ద భోజనాన్ని ఆస్వాదించారు.'
]
queries = [
'ఒక వ్యక్తి పాస్తా తింటున్నాడు.',
'ఒక గొరిల్లా సూట్ ధరించిన వ్యక్తి డ్రమ్ వాయిస్తోంది.'
]
corpus_embeddings = model.encode(corpus)
query_embeddings = model.encode(queries)
# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 3
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
print("\n======================\n")
print("Query:", query)
print("\nTop 3 most similar sentences in corpus:\n")
for idx, distance in results[0:closest_n]:
print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
# Optional: How to quantize the embeddings
# binary_embeddings = quantize_embeddings(embeddings, precision="ubinary")
```
#### With Huggingface Transformers:
- T.B.A
# FAQS
#### How can I reduce overall inference cost ?
- You can host these models without heavy torch dependency using the ONNX flavours of these models via [FlashEmbed](https://github.com/PrithivirajDamodaran/flashembed) library.
#### How do I reduce vector storage cost ?
[Use Binary and Scalar Quantisation](https://huggingface.co/blog/embedding-quantization)
#### How do I offer hybrid search to improve accuracy ?
MIRACL paper shows simply combining BM25 is a good starting point for a Hybrid option:
The below numbers are with mDPR model, but miniMiracle_te_v1 should give a even better hybrid performance.
| Language | ISO | nDCG@10 BM25 | nDCG@10 mDPR | nDCG@10 Hybrid |
|-----------|-----|--------------|--------------|----------------|
| **Telugu** | **te** | **0.383** | **0.356** | **0.602** |
*Note: MIRACL paper shows a different (higher) value for BM25 Telugu, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
#### Why not run MTEB?
MTEB is a general purpose embedding evaluation bechmark covering wide range of tasks available currently only for English, Chinese, French and few other languages but not Indic languages. Besides like BGE-M3, miniMiracle models are predominantly tuned for retireval tasks aimed at search & IR based usecases.
At the moment MIRACL is the gold standard for a subset of Indic languages.
# Roadmap
We will add miniMiracle series of models for all popular languages as we see fit or based on community requests in phases. Some of the languages we have in our list are
- Spanish
- Tamil
- Arabic
- German
- English ?
# Notes on reproducing:
We welcome anyone to reproduce our results. Here are some tips and observations:
- Use CLS Pooling and Inner Product.
- There *may be* minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9.
Here are our numbers for the full hindi run on BGE-M3
```python
{'NDCG@1': 0.49714, 'NDCG@3': 0.5115, 'NDCG@5': 0.53908, 'NDCG@10': 0.58936, 'NDCG@100': 0.6457, 'NDCG@1000': 0.65336}
{'MAP@1': 0.28845, 'MAP@3': 0.42424, 'MAP@5': 0.46455, 'MAP@10': 0.49955, 'MAP@100': 0.51886, 'MAP@1000': 0.51933}
{'Recall@10': 0.73032, 'Recall@50': 0.8987, 'Recall@100': 0.93974, 'Recall@200': 0.95763, 'Recall@500': 0.97813, 'Recall@1000': 0.9902}
{'P@1': 0.49714, 'P@3': 0.33048, 'P@5': 0.24629, 'P@10': 0.15543, 'P@100': 0.0202, 'P@1000': 0.00212}
{'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
```
Fair warning BGE-M3 is $ expensive to evaluate, probably that's why it's not part of any of the retrieval slice of MTEB benchmarks.
# Reference:
- [All Cohere numbers are copied form here](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12)
- [BGE M3-Embedding: Multi-Lingual, Multi-Functionality,
Multi-Granularity Text Embeddings Through Self-Knowledge Distillation](https://arxiv.org/pdf/2402.03216.pdf)
- [Making a MIRACL: Multilingual Information Retrieval
Across a Continuum of Languages](https://arxiv.org/pdf/2210.09984.pdf)
- [IndicIRSuite: Multilingual Dataset and Neural
Information Models for Indian Languages](https://arxiv.org/pdf/2312.09508)
# Note on model bias:
- Like any model this model might carry inherent biases from the base models and the datasets it was pretrained and finetuned on. Please use responsibly.
# How to cite?
Damodaran, P. (2024). MiniDense: Family of Low footprint multilingual retrievers for search and RAG pipelines (Version 1.0.0) [Computer software].
|