pawasthy commited on
Commit
e28c32b
1 Parent(s): ead3ddb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +852 -3
README.md CHANGED
@@ -1,3 +1,852 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ar
5
+ - cs
6
+ - de
7
+ - es
8
+ - fr
9
+ - it
10
+ - ja
11
+ - ko
12
+ - nl
13
+ - pt
14
+ - zh
15
+ license: apache-2.0
16
+ library_name: transformers
17
+ tags:
18
+ - language
19
+ - granite
20
+ - embeddings
21
+ - multilingual
22
+ model-index:
23
+ - name: ibm-granite/granite-embedding-278m-multilingual
24
+ results:
25
+ - dataset:
26
+ type: miracl/mmteb-miracl
27
+ name: Miracl (en)
28
+ config: en
29
+ split: dev
30
+ task:
31
+ type: Retrieval
32
+ metrics:
33
+ - type: ndcg_at_1
34
+ value: 0.45557
35
+ - type: ndcg_at_10
36
+ value: 0.49372
37
+ - type: ndcg_at_100
38
+ value: 0.5728
39
+ - type: ndcg_at_1000
40
+ value: 0.59187
41
+ - type: ndcg_at_20
42
+ value: 0.52863
43
+ - type: ndcg_at_3
44
+ value: 0.43969
45
+ - type: ndcg_at_5
46
+ value: 0.45551
47
+ - type: recall_at_1
48
+ value: 0.21785
49
+ - type: recall_at_10
50
+ value: 0.59513
51
+ - type: recall_at_100
52
+ value: 0.85785
53
+ - type: recall_at_1000
54
+ value: 0.96041
55
+ - type: recall_at_20
56
+ value: 0.69357
57
+ - type: recall_at_3
58
+ value: 0.40403
59
+ - type: recall_at_5
60
+ value: 0.48499
61
+ - dataset:
62
+ type: miracl/mmteb-miracl
63
+ name: Miracl (ar)
64
+ config: ar
65
+ split: dev
66
+ task:
67
+ type: Retrieval
68
+ metrics:
69
+ - type: ndcg_at_1
70
+ value: 0.57459
71
+ - type: ndcg_at_10
72
+ value: 0.64238
73
+ - type: ndcg_at_100
74
+ value: 0.6867
75
+ - type: ndcg_at_1000
76
+ value: 0.6951
77
+ - type: ndcg_at_20
78
+ value: 0.66455
79
+ - type: ndcg_at_3
80
+ value: 0.58162
81
+ - type: ndcg_at_5
82
+ value: 0.60831
83
+ - type: recall_at_1
84
+ value: 0.38064
85
+ - type: recall_at_10
86
+ value: 0.75098
87
+ - type: recall_at_100
88
+ value: 0.91203
89
+ - type: recall_at_1000
90
+ value: 0.96706
91
+ - type: recall_at_20
92
+ value: 0.81978
93
+ - type: recall_at_3
94
+ value: 0.58618
95
+ - type: recall_at_5
96
+ value: 0.66353
97
+ - dataset:
98
+ type: miracl/mmteb-miracl
99
+ name: Miracl (bn)
100
+ config: bn
101
+ split: dev
102
+ task:
103
+ type: Retrieval
104
+ metrics:
105
+ - type: ndcg_at_1
106
+ value: 0.60341
107
+ - type: ndcg_at_10
108
+ value: 0.68055
109
+ - type: ndcg_at_100
110
+ value: 0.72008
111
+ - type: ndcg_at_1000
112
+ value: 0.72716
113
+ - type: ndcg_at_20
114
+ value: 0.69914
115
+ - type: ndcg_at_3
116
+ value: 0.60805
117
+ - type: ndcg_at_5
118
+ value: 0.64486
119
+ - type: recall_at_1
120
+ value: 0.37948
121
+ - type: recall_at_10
122
+ value: 0.80609
123
+ - type: recall_at_100
124
+ value: 0.94305
125
+ - type: recall_at_1000
126
+ value: 0.98625
127
+ - type: recall_at_20
128
+ value: 0.86141
129
+ - type: recall_at_3
130
+ value: 0.61095
131
+ - type: recall_at_5
132
+ value: 0.71316
133
+ - dataset:
134
+ type: miracl/mmteb-miracl
135
+ name: Miracl (de)
136
+ config: de
137
+ split: dev
138
+ task:
139
+ type: Retrieval
140
+ metrics:
141
+ - type: ndcg_at_1
142
+ value: 0.45574
143
+ - type: ndcg_at_10
144
+ value: 0.48123
145
+ - type: ndcg_at_100
146
+ value: 0.56049
147
+ - type: ndcg_at_1000
148
+ value: 0.57979
149
+ - type: ndcg_at_20
150
+ value: 0.51785
151
+ - type: ndcg_at_3
152
+ value: 0.41243
153
+ - type: ndcg_at_5
154
+ value: 0.4386
155
+ - type: recall_at_1
156
+ value: 0.20401
157
+ - type: recall_at_10
158
+ value: 0.58779
159
+ - type: recall_at_100
160
+ value: 0.8584
161
+ - type: recall_at_1000
162
+ value: 0.97364
163
+ - type: recall_at_20
164
+ value: 0.69061
165
+ - type: recall_at_3
166
+ value: 0.36573
167
+ - type: recall_at_5
168
+ value: 0.47495
169
+ - dataset:
170
+ type: miracl/mmteb-miracl
171
+ name: Miracl (es)
172
+ config: es
173
+ split: dev
174
+ task:
175
+ type: Retrieval
176
+ metrics:
177
+ - type: ndcg_at_1
178
+ value: 0.5571
179
+ - type: ndcg_at_10
180
+ value: 0.49688
181
+ - type: ndcg_at_100
182
+ value: 0.60493
183
+ - type: ndcg_at_1000
184
+ value: 0.62922
185
+ - type: ndcg_at_20
186
+ value: 0.54438
187
+ - type: ndcg_at_3
188
+ value: 0.47981
189
+ - type: ndcg_at_5
190
+ value: 0.46584
191
+ - type: recall_at_1
192
+ value: 0.1638
193
+ - type: recall_at_10
194
+ value: 0.54155
195
+ - type: recall_at_100
196
+ value: 0.85136
197
+ - type: recall_at_1000
198
+ value: 0.96951
199
+ - type: recall_at_20
200
+ value: 0.65329
201
+ - type: recall_at_3
202
+ value: 0.31503
203
+ - type: recall_at_5
204
+ value: 0.40356
205
+ - dataset:
206
+ type: miracl/mmteb-miracl
207
+ name: Miracl (fa)
208
+ config: fa
209
+ split: dev
210
+ task:
211
+ type: Retrieval
212
+ metrics:
213
+ - type: ndcg_at_1
214
+ value: 0.39873
215
+ - type: ndcg_at_10
216
+ value: 0.50226
217
+ - type: ndcg_at_100
218
+ value: 0.56517
219
+ - type: ndcg_at_1000
220
+ value: 0.57967
221
+ - type: ndcg_at_20
222
+ value: 0.5292
223
+ - type: ndcg_at_3
224
+ value: 0.42738
225
+ - type: ndcg_at_5
226
+ value: 0.45843
227
+ - type: recall_at_1
228
+ value: 0.25369
229
+ - type: recall_at_10
230
+ value: 0.63776
231
+ - type: recall_at_100
232
+ value: 0.87686
233
+ - type: recall_at_1000
234
+ value: 0.9671
235
+ - type: recall_at_20
236
+ value: 0.72099
237
+ - type: recall_at_3
238
+ value: 0.43808
239
+ - type: recall_at_5
240
+ value: 0.52378
241
+ - dataset:
242
+ type: miracl/mmteb-miracl
243
+ name: Miracl (fi)
244
+ config: fi
245
+ split: dev
246
+ task:
247
+ type: Retrieval
248
+ metrics:
249
+ - type: ndcg_at_1
250
+ value: 0.60818
251
+ - type: ndcg_at_10
252
+ value: 0.6746
253
+ - type: ndcg_at_100
254
+ value: 0.71516
255
+ - type: ndcg_at_1000
256
+ value: 0.7218
257
+ - type: ndcg_at_20
258
+ value: 0.69692
259
+ - type: ndcg_at_3
260
+ value: 0.6006
261
+ - type: ndcg_at_5
262
+ value: 0.63842
263
+ - type: recall_at_1
264
+ value: 0.39264
265
+ - type: recall_at_10
266
+ value: 0.78577
267
+ - type: recall_at_100
268
+ value: 0.93291
269
+ - type: recall_at_1000
270
+ value: 0.97493
271
+ - type: recall_at_20
272
+ value: 0.85435
273
+ - type: recall_at_3
274
+ value: 0.61055
275
+ - type: recall_at_5
276
+ value: 0.69774
277
+ - dataset:
278
+ type: miracl/mmteb-miracl
279
+ name: Miracl (fr)
280
+ config: fr
281
+ split: dev
282
+ task:
283
+ type: Retrieval
284
+ metrics:
285
+ - type: ndcg_at_1
286
+ value: 0.3965
287
+ - type: ndcg_at_10
288
+ value: 0.49891
289
+ - type: ndcg_at_100
290
+ value: 0.56492
291
+ - type: ndcg_at_1000
292
+ value: 0.57837
293
+ - type: ndcg_at_20
294
+ value: 0.53163
295
+ - type: ndcg_at_3
296
+ value: 0.39843
297
+ - type: ndcg_at_5
298
+ value: 0.44416
299
+ - type: recall_at_1
300
+ value: 0.22644
301
+ - type: recall_at_10
302
+ value: 0.65169
303
+ - type: recall_at_100
304
+ value: 0.89786
305
+ - type: recall_at_1000
306
+ value: 0.98081
307
+ - type: recall_at_20
308
+ value: 0.75338
309
+ - type: recall_at_3
310
+ value: 0.39798
311
+ - type: recall_at_5
312
+ value: 0.51001
313
+ - dataset:
314
+ type: miracl/mmteb-miracl
315
+ name: Miracl (hi)
316
+ config: hi
317
+ split: dev
318
+ task:
319
+ type: Retrieval
320
+ metrics:
321
+ - type: ndcg_at_1
322
+ value: 0.36857
323
+ - type: ndcg_at_10
324
+ value: 0.46141
325
+ - type: ndcg_at_100
326
+ value: 0.52565
327
+ - type: ndcg_at_1000
328
+ value: 0.54319
329
+ - type: ndcg_at_20
330
+ value: 0.49384
331
+ - type: ndcg_at_3
332
+ value: 0.39469
333
+ - type: ndcg_at_5
334
+ value: 0.4184
335
+ - type: recall_at_1
336
+ value: 0.20185
337
+ - type: recall_at_10
338
+ value: 0.59474
339
+ - type: recall_at_100
340
+ value: 0.83385
341
+ - type: recall_at_1000
342
+ value: 0.94813
343
+ - type: recall_at_20
344
+ value: 0.69437
345
+ - type: recall_at_3
346
+ value: 0.38993
347
+ - type: recall_at_5
348
+ value: 0.47881
349
+ - dataset:
350
+ type: miracl/mmteb-miracl
351
+ name: Miracl (id)
352
+ config: id
353
+ split: dev
354
+ task:
355
+ type: Retrieval
356
+ metrics:
357
+ - type: ndcg_at_1
358
+ value: 0.46354
359
+ - type: ndcg_at_10
360
+ value: 0.47229
361
+ - type: ndcg_at_100
362
+ value: 0.5525
363
+ - type: ndcg_at_1000
364
+ value: 0.57648
365
+ - type: ndcg_at_20
366
+ value: 0.50606
367
+ - type: ndcg_at_3
368
+ value: 0.42538
369
+ - type: ndcg_at_5
370
+ value: 0.43717
371
+ - type: recall_at_1
372
+ value: 0.20787
373
+ - type: recall_at_10
374
+ value: 0.54771
375
+ - type: recall_at_100
376
+ value: 0.80689
377
+ - type: recall_at_1000
378
+ value: 0.94032
379
+ - type: recall_at_20
380
+ value: 0.63842
381
+ - type: recall_at_3
382
+ value: 0.36229
383
+ - type: recall_at_5
384
+ value: 0.44437
385
+ - dataset:
386
+ type: miracl/mmteb-miracl
387
+ name: Miracl (ja)
388
+ config: ja
389
+ split: dev
390
+ task:
391
+ type: Retrieval
392
+ metrics:
393
+ - type: ndcg_at_1
394
+ value: 0.56279
395
+ - type: ndcg_at_10
396
+ value: 0.6281
397
+ - type: ndcg_at_100
398
+ value: 0.67757
399
+ - type: ndcg_at_1000
400
+ value: 0.68667
401
+ - type: ndcg_at_20
402
+ value: 0.6521
403
+ - type: ndcg_at_3
404
+ value: 0.56226
405
+ - type: ndcg_at_5
406
+ value: 0.5866
407
+ - type: recall_at_1
408
+ value: 0.36648
409
+ - type: recall_at_10
410
+ value: 0.7496
411
+ - type: recall_at_100
412
+ value: 0.92461
413
+ - type: recall_at_1000
414
+ value: 0.97827
415
+ - type: recall_at_20
416
+ value: 0.82326
417
+ - type: recall_at_3
418
+ value: 0.55845
419
+ - type: recall_at_5
420
+ value: 0.63854
421
+ - dataset:
422
+ type: miracl/mmteb-miracl
423
+ name: Miracl (ko)
424
+ config: ko
425
+ split: dev
426
+ task:
427
+ type: Retrieval
428
+ metrics:
429
+ - type: ndcg_at_1
430
+ value: 0.52582
431
+ - type: ndcg_at_10
432
+ value: 0.59216
433
+ - type: ndcg_at_100
434
+ value: 0.65093
435
+ - type: ndcg_at_1000
436
+ value: 0.66204
437
+ - type: ndcg_at_20
438
+ value: 0.62427
439
+ - type: ndcg_at_3
440
+ value: 0.5373
441
+ - type: ndcg_at_5
442
+ value: 0.55886
443
+ - type: recall_at_1
444
+ value: 0.30521
445
+ - type: recall_at_10
446
+ value: 0.71159
447
+ - type: recall_at_100
448
+ value: 0.90203
449
+ - type: recall_at_1000
450
+ value: 0.96714
451
+ - type: recall_at_20
452
+ value: 0.80209
453
+ - type: recall_at_3
454
+ value: 0.515
455
+ - type: recall_at_5
456
+ value: 0.6071
457
+ - dataset:
458
+ type: miracl/mmteb-miracl
459
+ name: Miracl (ru)
460
+ config: ru
461
+ split: dev
462
+ task:
463
+ type: Retrieval
464
+ metrics:
465
+ - type: ndcg_at_1
466
+ value: 0.47524
467
+ - type: ndcg_at_10
468
+ value: 0.52349
469
+ - type: ndcg_at_100
470
+ value: 0.59725
471
+ - type: ndcg_at_1000
472
+ value: 0.61313
473
+ - type: ndcg_at_20
474
+ value: 0.55669
475
+ - type: ndcg_at_3
476
+ value: 0.46812
477
+ - type: ndcg_at_5
478
+ value: 0.48442
479
+ - type: recall_at_1
480
+ value: 0.24337
481
+ - type: recall_at_10
482
+ value: 0.62437
483
+ - type: recall_at_100
484
+ value: 0.86489
485
+ - type: recall_at_1000
486
+ value: 0.95266
487
+ - type: recall_at_20
488
+ value: 0.71411
489
+ - type: recall_at_3
490
+ value: 0.42927
491
+ - type: recall_at_5
492
+ value: 0.51258
493
+ - dataset:
494
+ type: miracl/mmteb-miracl
495
+ name: Miracl (sw)
496
+ config: sw
497
+ split: dev
498
+ task:
499
+ type: Retrieval
500
+ metrics:
501
+ - type: ndcg_at_1
502
+ value: 0.5166
503
+ - type: ndcg_at_10
504
+ value: 0.61271
505
+ - type: ndcg_at_100
506
+ value: 0.66099
507
+ - type: ndcg_at_1000
508
+ value: 0.66867
509
+ - type: ndcg_at_20
510
+ value: 0.63643
511
+ - type: ndcg_at_3
512
+ value: 0.54828
513
+ - type: ndcg_at_5
514
+ value: 0.57382
515
+ - type: recall_at_1
516
+ value: 0.35277
517
+ - type: recall_at_10
518
+ value: 0.74368
519
+ - type: recall_at_100
520
+ value: 0.92261
521
+ - type: recall_at_1000
522
+ value: 0.97109
523
+ - type: recall_at_20
524
+ value: 0.81888
525
+ - type: recall_at_3
526
+ value: 0.56739
527
+ - type: recall_at_5
528
+ value: 0.6421
529
+ - dataset:
530
+ type: miracl/mmteb-miracl
531
+ name: Miracl (te)
532
+ config: te
533
+ split: dev
534
+ task:
535
+ type: Retrieval
536
+ metrics:
537
+ - type: ndcg_at_1
538
+ value: 0.63768
539
+ - type: ndcg_at_10
540
+ value: 0.79193
541
+ - type: ndcg_at_100
542
+ value: 0.80243
543
+ - type: ndcg_at_1000
544
+ value: 0.80438
545
+ - type: ndcg_at_20
546
+ value: 0.79549
547
+ - type: ndcg_at_3
548
+ value: 0.76031
549
+ - type: ndcg_at_5
550
+ value: 0.77915
551
+ - type: recall_at_1
552
+ value: 0.63084
553
+ - type: recall_at_10
554
+ value: 0.92411
555
+ - type: recall_at_100
556
+ value: 0.97363
557
+ - type: recall_at_1000
558
+ value: 0.98833
559
+ - type: recall_at_20
560
+ value: 0.9374
561
+ - type: recall_at_3
562
+ value: 0.84159
563
+ - type: recall_at_5
564
+ value: 0.88627
565
+ - dataset:
566
+ type: miracl/mmteb-miracl
567
+ name: Miracl (th)
568
+ config: th
569
+ split: dev
570
+ task:
571
+ type: Retrieval
572
+ metrics:
573
+ - type: ndcg_at_1
574
+ value: 0.66712
575
+ - type: ndcg_at_10
576
+ value: 0.73324
577
+ - type: ndcg_at_100
578
+ value: 0.76633
579
+ - type: ndcg_at_1000
580
+ value: 0.77119
581
+ - type: ndcg_at_20
582
+ value: 0.75243
583
+ - type: ndcg_at_3
584
+ value: 0.67393
585
+ - type: ndcg_at_5
586
+ value: 0.70201
587
+ - type: recall_at_1
588
+ value: 0.47106
589
+ - type: recall_at_10
590
+ value: 0.84294
591
+ - type: recall_at_100
592
+ value: 0.95949
593
+ - type: recall_at_1000
594
+ value: 0.98874
595
+ - type: recall_at_20
596
+ value: 0.90085
597
+ - type: recall_at_3
598
+ value: 0.68456
599
+ - type: recall_at_5
600
+ value: 0.75915
601
+ - dataset:
602
+ type: miracl/mmteb-miracl
603
+ name: Miracl (yo)
604
+ config: yo
605
+ split: dev
606
+ task:
607
+ type: Retrieval
608
+ metrics:
609
+ - type: ndcg_at_1
610
+ value: 0.4958
611
+ - type: ndcg_at_10
612
+ value: 0.68705
613
+ - type: ndcg_at_100
614
+ value: 0.70664
615
+ - type: ndcg_at_1000
616
+ value: 0.71197
617
+ - type: ndcg_at_20
618
+ value: 0.698
619
+ - type: ndcg_at_3
620
+ value: 0.64793
621
+ - type: ndcg_at_5
622
+ value: 0.66709
623
+ - type: recall_at_1
624
+ value: 0.46289
625
+ - type: recall_at_10
626
+ value: 0.85154
627
+ - type: recall_at_100
628
+ value: 0.93557
629
+ - type: recall_at_1000
630
+ value: 0.97479
631
+ - type: recall_at_20
632
+ value: 0.89076
633
+ - type: recall_at_3
634
+ value: 0.7507
635
+ - type: recall_at_5
636
+ value: 0.79202
637
+ - dataset:
638
+ type: miracl/mmteb-miracl
639
+ name: Miracl (zh)
640
+ config: zh
641
+ split: dev
642
+ task:
643
+ type: Retrieval
644
+ metrics:
645
+ - type: ndcg_at_1
646
+ value: 0.47583
647
+ - type: ndcg_at_10
648
+ value: 0.52553
649
+ - type: ndcg_at_100
650
+ value: 0.6
651
+ - type: ndcg_at_1000
652
+ value: 0.61415
653
+ - type: ndcg_at_20
654
+ value: 0.55668
655
+ - type: ndcg_at_3
656
+ value: 0.45839
657
+ - type: ndcg_at_5
658
+ value: 0.48127
659
+ - type: recall_at_1
660
+ value: 0.24488
661
+ - type: recall_at_10
662
+ value: 0.63659
663
+ - type: recall_at_100
664
+ value: 0.89702
665
+ - type: recall_at_1000
666
+ value: 0.97996
667
+ - type: recall_at_20
668
+ value: 0.72652
669
+ - type: recall_at_3
670
+ value: 0.42827
671
+ - type: recall_at_5
672
+ value: 0.52081
673
+ ---
674
+ # Granite-Embedding-278m-multilingual
675
+
676
+ **Model Summary:**
677
+ Granite-Embedding-278M-Multilingual is a 278M parameter model from the Granite Embeddings suite that can be used to generate high quality text embeddings. This model produces embedding vectors of size 768 and is trained using a combination of open source relevance-pair datasets with permissive, enterprise-friendly license, and IBM collected and generated datasets. This model is developed using contrastive finetuning, knowledge distillation and model merging for improved performance.
678
+
679
+ - **Developers:** Granite Embedding Team, IBM
680
+ - **GitHub Repository:** [ibm-granite/granite-embedding-models](https://github.com/ibm-granite/granite-embedding-models)
681
+ - **Website**: [Granite Docs](https://www.ibm.com/granite/docs/)
682
+ - **Paper:** Coming Soon
683
+ - **Release Date**: December 18th, 2024
684
+ - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
685
+
686
+ **Supported Languages:**
687
+ English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite-Embedding-278M-Multilingual for languages beyond these 12 languages.
688
+
689
+ **Intended use:**
690
+ The model is designed to produce fixed length vector representations for a given text, which can be used for text similarity, retrieval, and search applications.
691
+
692
+ **Usage with Sentence Transformers:**
693
+ The model is compatible with SentenceTransformer library and is very easy to use:
694
+
695
+ First, install the sentence transformers library
696
+ ```shell
697
+ pip install sentence_transformers
698
+ ```
699
+
700
+ The model can then be used to encode pairs of text and find the similarity between their representations
701
+
702
+ ```python
703
+ from sentence_transformers import SentenceTransformer, util
704
+
705
+ model_path = "ibm-granite/granite-embedding-278m-multilingual"
706
+ # Load the Sentence Transformer model
707
+ model = SentenceTransformer(model_path)
708
+
709
+ input_queries = [
710
+ ' Who made the song My achy breaky heart? ',
711
+ 'summit define'
712
+ ]
713
+
714
+ input_passages = [
715
+ "Achy Breaky Heart is a country song written by Don Von Tress. Originally titled Don't Tell My Heart and performed by The Marcy Brothers in 1991. ",
716
+ "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
717
+ ]
718
+
719
+ # encode queries and passages
720
+ query_embeddings = model.encode(input_queries)
721
+ passage_embeddings = model.encode(input_passages)
722
+
723
+ # calculate cosine similarity
724
+ print(util.cos_sim(query_embeddings, passage_embeddings))
725
+ ```
726
+
727
+ **Usage with Huggingface Transformers:**
728
+ This is a simple example of how to use the Granite-Embedding-278m-Multilingual model with the Transformers library and PyTorch.
729
+
730
+ First, install the required libraries
731
+ ```shell
732
+ pip install transformers torch
733
+ ```
734
+
735
+ The model can then be used to encode pairs of text
736
+
737
+ ```python
738
+ import torch
739
+ from transformers import AutoModel, AutoTokenizer
740
+
741
+ model_path = "ibm-granite/granite-embedding-278m-multilingual"
742
+
743
+ # Load the model and tokenizer
744
+ model = AutoModel.from_pretrained(model_path)
745
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
746
+ model.eval()
747
+
748
+ input_queries = [
749
+ ' Who made the song My achy breaky heart? ',
750
+ 'summit define'
751
+ ]
752
+
753
+ # tokenize inputs
754
+ tokenized_queries = tokenizer(input_queries, padding=True, truncation=True, return_tensors='pt')
755
+
756
+ # encode queries
757
+ with torch.no_grad():
758
+ # Queries
759
+ model_output = model(**tokenized_queries)
760
+ # Perform pooling. granite-embedding-278m-multilingual uses CLS Pooling
761
+ query_embeddings = model_output[0][:, 0]
762
+
763
+ # normalize the embeddings
764
+ query_embeddings = torch.nn.functional.normalize(query_embeddings, dim=1)
765
+
766
+ ```
767
+
768
+ **Evaluation:**
769
+ The average performance of the Granite-Embedding-278M-Multilingual on Multilingual Miracl (across 18 langauges), Mintaka Retrieval (across 8 languages) and MTEB Retrieval for English (across 15 tasks), German (across 4 tasks), Spanish (across 2 tasks), Frenc (across 5 tasks), Japanese (across 2 tasks), Arabic (1 task), Korean (1 task) and Chinese (across 8 tasks) is reported below.
770
+
771
+ | Model | Paramters (M)| Embedding Dimension | Miracl (18) | Mintaka Retrieval (8) | MTEB English (15) | MTEB German (4) |MTEB Spanish (2) | MTEB French (5) | MTEB Japanese (2) | MTEB Arabic (1) | MTEB Korean (1) | MTEB Chinese (8) |
772
+ |:-----------------------------------|:------------:|:-------------------:|:-------------:| :---------------------:|:-----------------:|:---------------:|:---------------:|:---------------:|:----------------:|:----------------:|:---------------:|:----------------:|
773
+ |granite-embedding-278M-multilingual | 278 | 768 | 58.3 | 23.2 | 48.2 | 71.2 | 52.6 | 54.1 | 61.7 | 64.2 | 71.8 | 45.2 |
774
+
775
+ **Model Architecture:**
776
+ Granite-Embedding-278m-Multilingual is based on an encoder-only XLM-RoBERTa like transformer architecture, trained internally at IBM Research.
777
+
778
+ | Model | granite-embedding-30m-english | granite-embedding-125m-english | granite-embedding-107M-multilingual | granite-embedding-278m-multilingual |
779
+ | :-------- | :-------:| :-------: | :---------:| :-----:|
780
+ | Embedding size | 384 | 768 | 384 | **768** |
781
+ | Number of layers | 6 | 12 | 6 | **12** |
782
+ | Number of attention heads | 12 | 12 | 12 | **12** |
783
+ | Intermediate size | 1536 | 3072 | 1536 | **3072** |
784
+ | Activation Function | GeLU | GeLU | GeLU | **GeLU** |
785
+ | Vocabulary Size | 50265 | 50265 | 250002 | **250002** |
786
+ | Max. Sequence Length | 512 | 512 | 512 | **512** |
787
+ | # Parameters | 30M | 125M | 107M | **278M** |
788
+
789
+
790
+ **Training Data:**
791
+ Overall, the training data consists of four key sources: (1) unsupervised title-body paired data scraped from the web, (2) publicly available paired with permissive, enterprise-friendly license, (3) IBM-internal paired data targetting specific technical domains, and (4) IBM-generated synthetic data. The data is listed below:
792
+
793
+ | **Dataset** | **Num. Pairs** |
794
+ |:--------------------------------------------------------------------------|:--------------:|
795
+ | Multilingual MC4 | 52,823,484 |
796
+ | Multilingual Webhose | 12,369,322 |
797
+ | English Wikipedia | 20,745,403 |
798
+ | Multilingual Wikimedia | 2,911,090 |
799
+ | Miracl Corpus (Title-Body) | 10,120,398 |
800
+ | Stack Exchange Duplicate questions (titles) | 304,525 |
801
+ | Stack Exchange Duplicate questions (titles) | 304,525 |
802
+ | Stack Exchange Duplicate questions (bodies) | 250,519 |
803
+ | Machine Translations of Stack Exchange Duplicate questions (titles) | 187,195 |
804
+ | Stack Exchange (Title, Answer) pairs | 4,067,139 |
805
+ | Stack Exchange (Title, Body) pairs | 23,978,013 |
806
+ | Stack Exchange (Title, Body) pairs | 23,978,013 |
807
+ | Machine Translations of Stack Exchange (Title+Body, Answer) pairs | 1,827,15 |
808
+ | SearchQA | 582,261 |
809
+ | S2ORC (Title, Abstract) | 41,769,185 |
810
+ | WikiAnswers Duplicate question pairs | 77,427,422 |
811
+ | CCNews | 614,664 |
812
+ | XSum | 226,711 |
813
+ | SimpleWiki | 102,225 |
814
+ | Machine Translated Cross Lingual Parallel Corpora | 28,376,115 |
815
+ | SPECTER citation triplets | 684,100 |
816
+ | Machine Translations of SPECTER citation triplets | 4,104,600 |
817
+ | Natural Questions (NQ) | 100,231 |
818
+ | SQuAD2.0 | 87,599 |
819
+ | HotpotQA | 85,000 |
820
+ | Fever | 109,810 |
821
+ | PubMed | 20,000,000 |
822
+ | Multilingual Miracl Triples | 81,409 |
823
+ | Multilingual MrTydi Triples | 48,715 |
824
+ | Sadeeem Question Asnwering | 4,037 |
825
+ | DBPedia Title-Body Pairs | 4,635,922 |
826
+ | Synthetic: English Query-Wikipedia Passage | 1,879,093 |
827
+ | Synthetic: English Fact Verification | 9,888 |
828
+ | Synthetic: Multilingual Query-Wikipedia Passage | 300,266 |
829
+ | Synthetic: Multilingual News Summaries | 37,489 |
830
+ | IBM Internal Triples | 40,290 |
831
+ | IBM Internal Title-Body Pairs | 1,524,586 |
832
+
833
+ Notably, we do not use the popular MS-MARCO retrieval dataset in our training corpus due to its non-commercial license, while other open-source models train on this dataset due to its high quality.
834
+
835
+ **Infrastructure:**
836
+ We train Granite Embedding Models using IBM's computing cluster, Cognitive Compute Cluster, which is outfitted with NVIDIA A100 80gb GPUs. This cluster provides a scalable and efficient infrastructure for training our models over multiple GPUs.
837
+
838
+ **Ethical Considerations and Limitations:**
839
+ The data used to train the base language model was filtered to remove text containing hate, abuse, and profanity. Granite-Embedding-278m-Multilingual is trained only for English texts, and has a context length of 512 tokens (longer texts will be truncated to this size).
840
+
841
+
842
+ <!-- ## Citation
843
+ ```
844
+ @misc{granite-embedding-models,
845
+ author = {author 1, author2, ...},
846
+ title = {},
847
+ journal = {},
848
+ volume = {},
849
+ year = {2024},
850
+ url = {https://arxiv.org/abs/0000.00000},
851
+ }
852
+ ``` -->