--- pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers - mteb model-index: - name: mmlw-e5-large results: - task: type: Clustering dataset: type: PL-MTEB/8tags-clustering name: MTEB 8TagsClustering config: default split: test revision: None metrics: - type: v_measure value: 30.623921415441725 - task: type: Classification dataset: type: PL-MTEB/allegro-reviews name: MTEB AllegroReviews config: default split: test revision: None metrics: - type: accuracy value: 37.683896620278325 - type: f1 value: 34.19193027014284 - task: type: Retrieval dataset: type: arguana-pl name: MTEB ArguAna-PL config: default split: test revision: None metrics: - type: map_at_1 value: 38.407000000000004 - type: map_at_10 value: 55.147 - type: map_at_100 value: 55.757 - type: map_at_1000 value: 55.761 - type: map_at_3 value: 51.268 - type: map_at_5 value: 53.696999999999996 - type: mrr_at_1 value: 40.043 - type: mrr_at_10 value: 55.840999999999994 - type: mrr_at_100 value: 56.459 - type: mrr_at_1000 value: 56.462999999999994 - type: mrr_at_3 value: 52.074 - type: mrr_at_5 value: 54.364999999999995 - type: ndcg_at_1 value: 38.407000000000004 - type: ndcg_at_10 value: 63.248000000000005 - type: ndcg_at_100 value: 65.717 - type: ndcg_at_1000 value: 65.79 - type: ndcg_at_3 value: 55.403999999999996 - type: ndcg_at_5 value: 59.760000000000005 - type: precision_at_1 value: 38.407000000000004 - type: precision_at_10 value: 8.862 - type: precision_at_100 value: 0.991 - type: precision_at_1000 value: 0.1 - type: precision_at_3 value: 22.451 - type: precision_at_5 value: 15.576 - type: recall_at_1 value: 38.407000000000004 - type: recall_at_10 value: 88.62 - type: recall_at_100 value: 99.075 - type: recall_at_1000 value: 99.57300000000001 - type: recall_at_3 value: 67.354 - type: recall_at_5 value: 77.881 - task: type: Classification dataset: type: PL-MTEB/cbd name: MTEB CBD config: default split: test revision: None metrics: - type: accuracy value: 66.14999999999999 - type: ap value: 21.69513674684204 - type: f1 value: 56.48142830893528 - task: type: PairClassification dataset: type: PL-MTEB/cdsce-pairclassification name: MTEB CDSC-E config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 89.4 - type: cos_sim_ap value: 76.83228768203222 - type: cos_sim_f1 value: 65.3658536585366 - type: cos_sim_precision value: 60.909090909090914 - type: cos_sim_recall value: 70.52631578947368 - type: dot_accuracy value: 84.1 - type: dot_ap value: 57.26072201751864 - type: dot_f1 value: 62.75395033860045 - type: dot_precision value: 54.9407114624506 - type: dot_recall value: 73.15789473684211 - type: euclidean_accuracy value: 89.4 - type: euclidean_ap value: 76.59095263388942 - type: euclidean_f1 value: 65.21739130434783 - type: euclidean_precision value: 60.26785714285714 - type: euclidean_recall value: 71.05263157894737 - type: manhattan_accuracy value: 89.4 - type: manhattan_ap value: 76.58825999753456 - type: manhattan_f1 value: 64.72019464720195 - type: manhattan_precision value: 60.18099547511312 - type: manhattan_recall value: 70.0 - type: max_accuracy value: 89.4 - type: max_ap value: 76.83228768203222 - type: max_f1 value: 65.3658536585366 - task: type: STS dataset: type: PL-MTEB/cdscr-sts name: MTEB CDSC-R config: default split: test revision: None metrics: - type: cos_sim_pearson value: 93.73949495291659 - type: cos_sim_spearman value: 93.50397366192922 - type: euclidean_pearson value: 92.47498888987636 - type: euclidean_spearman value: 93.39315936230747 - type: manhattan_pearson value: 92.47250250777654 - type: manhattan_spearman value: 93.36739690549109 - task: type: Retrieval dataset: type: dbpedia-pl name: MTEB DBPedia-PL config: default split: test revision: None metrics: - type: map_at_1 value: 8.434 - type: map_at_10 value: 18.424 - type: map_at_100 value: 26.428 - type: map_at_1000 value: 28.002 - type: map_at_3 value: 13.502 - type: map_at_5 value: 15.577 - type: mrr_at_1 value: 63.0 - type: mrr_at_10 value: 72.714 - type: mrr_at_100 value: 73.021 - type: mrr_at_1000 value: 73.028 - type: mrr_at_3 value: 70.75 - type: mrr_at_5 value: 72.3 - type: ndcg_at_1 value: 52.75 - type: ndcg_at_10 value: 39.839999999999996 - type: ndcg_at_100 value: 44.989000000000004 - type: ndcg_at_1000 value: 52.532999999999994 - type: ndcg_at_3 value: 45.198 - type: ndcg_at_5 value: 42.015 - type: precision_at_1 value: 63.0 - type: precision_at_10 value: 31.05 - type: precision_at_100 value: 10.26 - type: precision_at_1000 value: 1.9879999999999998 - type: precision_at_3 value: 48.25 - type: precision_at_5 value: 40.45 - type: recall_at_1 value: 8.434 - type: recall_at_10 value: 24.004 - type: recall_at_100 value: 51.428 - type: recall_at_1000 value: 75.712 - type: recall_at_3 value: 15.015 - type: recall_at_5 value: 18.282999999999998 - task: type: Retrieval dataset: type: fiqa-pl name: MTEB FiQA-PL config: default split: test revision: None metrics: - type: map_at_1 value: 19.088 - type: map_at_10 value: 31.818 - type: map_at_100 value: 33.689 - type: map_at_1000 value: 33.86 - type: map_at_3 value: 27.399 - type: map_at_5 value: 29.945 - type: mrr_at_1 value: 38.117000000000004 - type: mrr_at_10 value: 47.668 - type: mrr_at_100 value: 48.428 - type: mrr_at_1000 value: 48.475 - type: mrr_at_3 value: 45.242 - type: mrr_at_5 value: 46.716 - type: ndcg_at_1 value: 38.272 - type: ndcg_at_10 value: 39.903 - type: ndcg_at_100 value: 46.661 - type: ndcg_at_1000 value: 49.625 - type: ndcg_at_3 value: 35.921 - type: ndcg_at_5 value: 37.558 - type: precision_at_1 value: 38.272 - type: precision_at_10 value: 11.358 - type: precision_at_100 value: 1.8190000000000002 - type: precision_at_1000 value: 0.23500000000000001 - type: precision_at_3 value: 24.434 - type: precision_at_5 value: 18.395 - type: recall_at_1 value: 19.088 - type: recall_at_10 value: 47.355999999999995 - type: recall_at_100 value: 72.451 - type: recall_at_1000 value: 90.257 - type: recall_at_3 value: 32.931 - type: recall_at_5 value: 39.878 - task: type: Retrieval dataset: type: hotpotqa-pl name: MTEB HotpotQA-PL config: default split: test revision: None metrics: - type: map_at_1 value: 39.095 - type: map_at_10 value: 62.529 - type: map_at_100 value: 63.425 - type: map_at_1000 value: 63.483000000000004 - type: map_at_3 value: 58.887 - type: map_at_5 value: 61.18599999999999 - type: mrr_at_1 value: 78.123 - type: mrr_at_10 value: 84.231 - type: mrr_at_100 value: 84.408 - type: mrr_at_1000 value: 84.414 - type: mrr_at_3 value: 83.286 - type: mrr_at_5 value: 83.94 - type: ndcg_at_1 value: 78.19 - type: ndcg_at_10 value: 70.938 - type: ndcg_at_100 value: 73.992 - type: ndcg_at_1000 value: 75.1 - type: ndcg_at_3 value: 65.863 - type: ndcg_at_5 value: 68.755 - type: precision_at_1 value: 78.19 - type: precision_at_10 value: 14.949000000000002 - type: precision_at_100 value: 1.733 - type: precision_at_1000 value: 0.188 - type: precision_at_3 value: 42.381 - type: precision_at_5 value: 27.711000000000002 - type: recall_at_1 value: 39.095 - type: recall_at_10 value: 74.747 - type: recall_at_100 value: 86.631 - type: recall_at_1000 value: 93.923 - type: recall_at_3 value: 63.571999999999996 - type: recall_at_5 value: 69.27799999999999 - task: type: Retrieval dataset: type: msmarco-pl name: MTEB MSMARCO-PL config: default split: validation revision: None metrics: - type: map_at_1 value: 19.439999999999998 - type: map_at_10 value: 30.264000000000003 - type: map_at_100 value: 31.438 - type: map_at_1000 value: 31.495 - type: map_at_3 value: 26.735 - type: map_at_5 value: 28.716 - type: mrr_at_1 value: 19.914 - type: mrr_at_10 value: 30.753999999999998 - type: mrr_at_100 value: 31.877 - type: mrr_at_1000 value: 31.929000000000002 - type: mrr_at_3 value: 27.299 - type: mrr_at_5 value: 29.254 - type: ndcg_at_1 value: 20.014000000000003 - type: ndcg_at_10 value: 36.472 - type: ndcg_at_100 value: 42.231 - type: ndcg_at_1000 value: 43.744 - type: ndcg_at_3 value: 29.268 - type: ndcg_at_5 value: 32.79 - type: precision_at_1 value: 20.014000000000003 - type: precision_at_10 value: 5.814 - type: precision_at_100 value: 0.8710000000000001 - type: precision_at_1000 value: 0.1 - type: precision_at_3 value: 12.426 - type: precision_at_5 value: 9.238 - type: recall_at_1 value: 19.439999999999998 - type: recall_at_10 value: 55.535000000000004 - type: recall_at_100 value: 82.44399999999999 - type: recall_at_1000 value: 94.217 - type: recall_at_3 value: 35.963 - type: recall_at_5 value: 44.367000000000004 - task: type: Classification dataset: type: mteb/amazon_massive_intent name: MTEB MassiveIntentClassification (pl) config: pl split: test revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7 metrics: - type: accuracy value: 72.01412239408205 - type: f1 value: 70.04544187503352 - task: type: Classification dataset: type: mteb/amazon_massive_scenario name: MTEB MassiveScenarioClassification (pl) config: pl split: test revision: 7d571f92784cd94a019292a1f45445077d0ef634 metrics: - type: accuracy value: 75.26899798251513 - type: f1 value: 75.55876166863844 - task: type: Retrieval dataset: type: nfcorpus-pl name: MTEB NFCorpus-PL config: default split: test revision: None metrics: - type: map_at_1 value: 5.772 - type: map_at_10 value: 12.708 - type: map_at_100 value: 16.194 - type: map_at_1000 value: 17.630000000000003 - type: map_at_3 value: 9.34 - type: map_at_5 value: 10.741 - type: mrr_at_1 value: 43.344 - type: mrr_at_10 value: 53.429 - type: mrr_at_100 value: 53.88699999999999 - type: mrr_at_1000 value: 53.925 - type: mrr_at_3 value: 51.342 - type: mrr_at_5 value: 52.456 - type: ndcg_at_1 value: 41.641 - type: ndcg_at_10 value: 34.028000000000006 - type: ndcg_at_100 value: 31.613000000000003 - type: ndcg_at_1000 value: 40.428 - type: ndcg_at_3 value: 38.991 - type: ndcg_at_5 value: 36.704 - type: precision_at_1 value: 43.034 - type: precision_at_10 value: 25.324999999999996 - type: precision_at_100 value: 7.889 - type: precision_at_1000 value: 2.069 - type: precision_at_3 value: 36.739 - type: precision_at_5 value: 32.074000000000005 - type: recall_at_1 value: 5.772 - type: recall_at_10 value: 16.827 - type: recall_at_100 value: 32.346000000000004 - type: recall_at_1000 value: 62.739 - type: recall_at_3 value: 10.56 - type: recall_at_5 value: 12.655 - task: type: Retrieval dataset: type: nq-pl name: MTEB NQ-PL config: default split: test revision: None metrics: - type: map_at_1 value: 26.101000000000003 - type: map_at_10 value: 39.912 - type: map_at_100 value: 41.037 - type: map_at_1000 value: 41.077000000000005 - type: map_at_3 value: 35.691 - type: map_at_5 value: 38.155 - type: mrr_at_1 value: 29.403000000000002 - type: mrr_at_10 value: 42.376999999999995 - type: mrr_at_100 value: 43.248999999999995 - type: mrr_at_1000 value: 43.277 - type: mrr_at_3 value: 38.794000000000004 - type: mrr_at_5 value: 40.933 - type: ndcg_at_1 value: 29.519000000000002 - type: ndcg_at_10 value: 47.33 - type: ndcg_at_100 value: 52.171 - type: ndcg_at_1000 value: 53.125 - type: ndcg_at_3 value: 39.316 - type: ndcg_at_5 value: 43.457 - type: precision_at_1 value: 29.519000000000002 - type: precision_at_10 value: 8.03 - type: precision_at_100 value: 1.075 - type: precision_at_1000 value: 0.117 - type: precision_at_3 value: 18.009 - type: precision_at_5 value: 13.221 - type: recall_at_1 value: 26.101000000000003 - type: recall_at_10 value: 67.50399999999999 - type: recall_at_100 value: 88.64699999999999 - type: recall_at_1000 value: 95.771 - type: recall_at_3 value: 46.669 - type: recall_at_5 value: 56.24 - task: type: Classification dataset: type: laugustyniak/abusive-clauses-pl name: MTEB PAC config: default split: test revision: None metrics: - type: accuracy value: 63.76773819866782 - type: ap value: 74.87896817642536 - type: f1 value: 61.420506092721425 - task: type: PairClassification dataset: type: PL-MTEB/ppc-pairclassification name: MTEB PPC config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 82.1 - type: cos_sim_ap value: 91.09417013497443 - type: cos_sim_f1 value: 84.78437754271766 - type: cos_sim_precision value: 83.36 - type: cos_sim_recall value: 86.25827814569537 - type: dot_accuracy value: 75.9 - type: dot_ap value: 86.82680649789796 - type: dot_f1 value: 80.5379746835443 - type: dot_precision value: 77.12121212121212 - type: dot_recall value: 84.27152317880795 - type: euclidean_accuracy value: 81.6 - type: euclidean_ap value: 90.81248760600693 - type: euclidean_f1 value: 84.35374149659863 - type: euclidean_precision value: 86.7132867132867 - type: euclidean_recall value: 82.11920529801324 - type: manhattan_accuracy value: 81.6 - type: manhattan_ap value: 90.81272803548767 - type: manhattan_f1 value: 84.33530906011855 - type: manhattan_precision value: 86.30849220103987 - type: manhattan_recall value: 82.45033112582782 - type: max_accuracy value: 82.1 - type: max_ap value: 91.09417013497443 - type: max_f1 value: 84.78437754271766 - task: type: PairClassification dataset: type: PL-MTEB/psc-pairclassification name: MTEB PSC config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 98.05194805194806 - type: cos_sim_ap value: 99.52709687103496 - type: cos_sim_f1 value: 96.83257918552036 - type: cos_sim_precision value: 95.82089552238806 - type: cos_sim_recall value: 97.86585365853658 - type: dot_accuracy value: 92.30055658627087 - type: dot_ap value: 94.12759311032353 - type: dot_f1 value: 87.00906344410878 - type: dot_precision value: 86.22754491017965 - type: dot_recall value: 87.8048780487805 - type: euclidean_accuracy value: 98.05194805194806 - type: euclidean_ap value: 99.49402675624125 - type: euclidean_f1 value: 96.8133535660091 - type: euclidean_precision value: 96.37462235649546 - type: euclidean_recall value: 97.2560975609756 - type: manhattan_accuracy value: 98.05194805194806 - type: manhattan_ap value: 99.50120505935962 - type: manhattan_f1 value: 96.8133535660091 - type: manhattan_precision value: 96.37462235649546 - type: manhattan_recall value: 97.2560975609756 - type: max_accuracy value: 98.05194805194806 - type: max_ap value: 99.52709687103496 - type: max_f1 value: 96.83257918552036 - task: type: Classification dataset: type: PL-MTEB/polemo2_in name: MTEB PolEmo2.0-IN config: default split: test revision: None metrics: - type: accuracy value: 69.45983379501385 - type: f1 value: 68.60917948426784 - task: type: Classification dataset: type: PL-MTEB/polemo2_out name: MTEB PolEmo2.0-OUT config: default split: test revision: None metrics: - type: accuracy value: 43.13765182186235 - type: f1 value: 36.15557441785656 - task: type: Retrieval dataset: type: quora-pl name: MTEB Quora-PL config: default split: test revision: None metrics: - type: map_at_1 value: 67.448 - type: map_at_10 value: 81.566 - type: map_at_100 value: 82.284 - type: map_at_1000 value: 82.301 - type: map_at_3 value: 78.425 - type: map_at_5 value: 80.43400000000001 - type: mrr_at_1 value: 77.61 - type: mrr_at_10 value: 84.467 - type: mrr_at_100 value: 84.63199999999999 - type: mrr_at_1000 value: 84.634 - type: mrr_at_3 value: 83.288 - type: mrr_at_5 value: 84.095 - type: ndcg_at_1 value: 77.66 - type: ndcg_at_10 value: 85.63199999999999 - type: ndcg_at_100 value: 87.166 - type: ndcg_at_1000 value: 87.306 - type: ndcg_at_3 value: 82.32300000000001 - type: ndcg_at_5 value: 84.22 - type: precision_at_1 value: 77.66 - type: precision_at_10 value: 13.136000000000001 - type: precision_at_100 value: 1.522 - type: precision_at_1000 value: 0.156 - type: precision_at_3 value: 36.153 - type: precision_at_5 value: 23.982 - type: recall_at_1 value: 67.448 - type: recall_at_10 value: 93.83200000000001 - type: recall_at_100 value: 99.212 - type: recall_at_1000 value: 99.94 - type: recall_at_3 value: 84.539 - type: recall_at_5 value: 89.71000000000001 - task: type: Retrieval dataset: type: scidocs-pl name: MTEB SCIDOCS-PL config: default split: test revision: None metrics: - type: map_at_1 value: 4.393 - type: map_at_10 value: 11.472 - type: map_at_100 value: 13.584999999999999 - type: map_at_1000 value: 13.918 - type: map_at_3 value: 8.212 - type: map_at_5 value: 9.864 - type: mrr_at_1 value: 21.7 - type: mrr_at_10 value: 32.268 - type: mrr_at_100 value: 33.495000000000005 - type: mrr_at_1000 value: 33.548 - type: mrr_at_3 value: 29.15 - type: mrr_at_5 value: 30.91 - type: ndcg_at_1 value: 21.6 - type: ndcg_at_10 value: 19.126 - type: ndcg_at_100 value: 27.496 - type: ndcg_at_1000 value: 33.274 - type: ndcg_at_3 value: 18.196 - type: ndcg_at_5 value: 15.945 - type: precision_at_1 value: 21.6 - type: precision_at_10 value: 9.94 - type: precision_at_100 value: 2.1999999999999997 - type: precision_at_1000 value: 0.359 - type: precision_at_3 value: 17.2 - type: precision_at_5 value: 14.12 - type: recall_at_1 value: 4.393 - type: recall_at_10 value: 20.166999999999998 - type: recall_at_100 value: 44.678000000000004 - type: recall_at_1000 value: 72.868 - type: recall_at_3 value: 10.473 - type: recall_at_5 value: 14.313 - task: type: PairClassification dataset: type: PL-MTEB/sicke-pl-pairclassification name: MTEB SICK-E-PL config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 82.65389319200979 - type: cos_sim_ap value: 76.13749398520014 - type: cos_sim_f1 value: 66.64355062413314 - type: cos_sim_precision value: 64.93243243243244 - type: cos_sim_recall value: 68.44729344729345 - type: dot_accuracy value: 76.0905014268243 - type: dot_ap value: 58.058968583382494 - type: dot_f1 value: 61.181080324657145 - type: dot_precision value: 50.391885661595204 - type: dot_recall value: 77.84900284900284 - type: euclidean_accuracy value: 82.61312678353036 - type: euclidean_ap value: 76.10290283033221 - type: euclidean_f1 value: 66.50782845473111 - type: euclidean_precision value: 63.6897001303781 - type: euclidean_recall value: 69.58689458689459 - type: manhattan_accuracy value: 82.6742763962495 - type: manhattan_ap value: 76.12712309700966 - type: manhattan_f1 value: 66.59700452803902 - type: manhattan_precision value: 65.16700749829583 - type: manhattan_recall value: 68.09116809116809 - type: max_accuracy value: 82.6742763962495 - type: max_ap value: 76.13749398520014 - type: max_f1 value: 66.64355062413314 - task: type: STS dataset: type: PL-MTEB/sickr-pl-sts name: MTEB SICK-R-PL config: default split: test revision: None metrics: - type: cos_sim_pearson value: 81.23898481255246 - type: cos_sim_spearman value: 76.0416957474899 - type: euclidean_pearson value: 78.96475496102107 - type: euclidean_spearman value: 76.07208683063504 - type: manhattan_pearson value: 78.92666424673251 - type: manhattan_spearman value: 76.04968227583831 - task: type: STS dataset: type: mteb/sts22-crosslingual-sts name: MTEB STS22 (pl) config: pl split: test revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80 metrics: - type: cos_sim_pearson value: 39.13987124398541 - type: cos_sim_spearman value: 40.40194528288759 - type: euclidean_pearson value: 29.14566247168167 - type: euclidean_spearman value: 39.97389932591777 - type: manhattan_pearson value: 29.172993134388935 - type: manhattan_spearman value: 39.85681935287037 - task: type: Retrieval dataset: type: scifact-pl name: MTEB SciFact-PL config: default split: test revision: None metrics: - type: map_at_1 value: 57.260999999999996 - type: map_at_10 value: 66.92399999999999 - type: map_at_100 value: 67.443 - type: map_at_1000 value: 67.47800000000001 - type: map_at_3 value: 64.859 - type: map_at_5 value: 65.71900000000001 - type: mrr_at_1 value: 60.333000000000006 - type: mrr_at_10 value: 67.95400000000001 - type: mrr_at_100 value: 68.42 - type: mrr_at_1000 value: 68.45 - type: mrr_at_3 value: 66.444 - type: mrr_at_5 value: 67.128 - type: ndcg_at_1 value: 60.333000000000006 - type: ndcg_at_10 value: 71.209 - type: ndcg_at_100 value: 73.37 - type: ndcg_at_1000 value: 74.287 - type: ndcg_at_3 value: 67.66799999999999 - type: ndcg_at_5 value: 68.644 - type: precision_at_1 value: 60.333000000000006 - type: precision_at_10 value: 9.467 - type: precision_at_100 value: 1.053 - type: precision_at_1000 value: 0.11299999999999999 - type: precision_at_3 value: 26.778000000000002 - type: precision_at_5 value: 16.933 - type: recall_at_1 value: 57.260999999999996 - type: recall_at_10 value: 83.256 - type: recall_at_100 value: 92.767 - type: recall_at_1000 value: 100.0 - type: recall_at_3 value: 72.933 - type: recall_at_5 value: 75.744 - task: type: Retrieval dataset: type: trec-covid-pl name: MTEB TRECCOVID-PL config: default split: test revision: None metrics: - type: map_at_1 value: 0.22 - type: map_at_10 value: 1.693 - type: map_at_100 value: 9.281 - type: map_at_1000 value: 21.462999999999997 - type: map_at_3 value: 0.609 - type: map_at_5 value: 0.9570000000000001 - type: mrr_at_1 value: 80.0 - type: mrr_at_10 value: 88.73299999999999 - type: mrr_at_100 value: 88.73299999999999 - type: mrr_at_1000 value: 88.73299999999999 - type: mrr_at_3 value: 88.333 - type: mrr_at_5 value: 88.73299999999999 - type: ndcg_at_1 value: 79.0 - type: ndcg_at_10 value: 71.177 - type: ndcg_at_100 value: 52.479 - type: ndcg_at_1000 value: 45.333 - type: ndcg_at_3 value: 77.48 - type: ndcg_at_5 value: 76.137 - type: precision_at_1 value: 82.0 - type: precision_at_10 value: 74.0 - type: precision_at_100 value: 53.68000000000001 - type: precision_at_1000 value: 19.954 - type: precision_at_3 value: 80.667 - type: precision_at_5 value: 80.80000000000001 - type: recall_at_1 value: 0.22 - type: recall_at_10 value: 1.934 - type: recall_at_100 value: 12.728 - type: recall_at_1000 value: 41.869 - type: recall_at_3 value: 0.637 - type: recall_at_5 value: 1.042 language: pl license: apache-2.0 widget: - source_sentence: "query: Jak dożyć 100 lat?" sentences: - "passage: Trzeba zdrowo się odżywiać i uprawiać sport." - "passage: Trzeba pić alkohol, imprezować i jeździć szybkimi autami." - "passage: Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." ---

MMLW-e5-large

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. This is a distilled model that can be used to generate embeddings applicable to many tasks such as semantic similarity, clustering, information retrieval. The model can also serve as a base for further fine-tuning. It transforms texts to 1024 dimensional vectors. The model was initialized with multilingual E5 checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-base-en) as teacher models for distillation. ## Usage (Sentence-Transformers) ⚠️ Our embedding models require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with **"query: "** and passages with **"passage: "** ⚠️ You can use the model like this with [sentence-transformers](https://www.SBERT.net): ```python from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim query_prefix = "query: " answer_prefix = "passage: " queries = [query_prefix + "Jak dożyć 100 lat?"] answers = [ answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.", answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.", answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." ] model = SentenceTransformer("sdadas/mmlw-e5-large") queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False) answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False) best_answer = cos_sim(queries_emb, answers_emb).argmax().item() print(answers[best_answer]) # Trzeba zdrowo się odżywiać i uprawiać sport. ``` ## Evaluation Results - The model achieves an **Average Score** of **61.17** on the Polish Massive Text Embedding Benchmark (MTEB). See [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for detailed results. - The model achieves **NDCG@10** of **56.09** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results. ## Acknowledgements This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative. ## Citation ```bibtex @article{dadas2024pirb, title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata}, year={2024}, eprint={2402.13350}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```