File size: 3,730 Bytes
8c990a9
 
ec73b69
 
 
3e844cc
ec73b69
 
8c990a9
c2bae01
 
a2eb078
b386fca
c2bae01
b386fca
 
 
 
 
2b02579
22002d6
6843106
 
f38e2df
c2bae01
58ba0da
c2bae01
58ba0da
 
 
 
 
 
 
c2bae01
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3605b5f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
license: apache-2.0
language:
- en
tags:
- Phrase Representation
- String Matching
- Fuzzy Join
---
## PEARL-small
[Learning High-Quality and General-Purpose Phrase Representations](https://arxiv.org/pdf/2401.10407.pdf). <br>
[Lihu Chen](https://chenlihu.com), [Gaël Varoquaux](https://gael-varoquaux.info/), [Fabian M. Suchanek](https://suchanek.name/). 
Accepted by EACL Findings 2024 <br>

PEARL-small is a lightweight string embedding model. It is the tool of choice for semantic similarity computation for strings,
creating excellent embeddings for string matching, entity retrieval, entity clustering, fuzzy join...
<br>
It differents from typically sentence embedders because it adds a character-level representation giving a good support for open vocabulary.
The model is a variant of [E5-small](https://huggingface.co/intfloat/e5-small-v2) finetuned on our constructed context-free [dataset](https://zenodo.org/records/10676475) to yield better representations 
for phrases and strings. <br>

🤗 [PEARL-small](https://huggingface.co/Lihuchen/pearl_small) 🤗 [PEARL-base](https://huggingface.co/Lihuchen/pearl_base)
<br>


| Model |Size|Avg| PPDB | PPDB filtered |Turney|BIRD|YAGO|UMLS|CoNLL|BC5CDR|AutoFJ|
|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| FastText  |-| 40.3| 94.4  | 61.2  |  59.6  | 58.9  |16.9|14.5|3.0|0.2| 53.6|
| Sentence-BERT  |110M|50.1| 94.6  | 66.8  | 50.4  | 62.6  | 21.6|23.6|25.5|48.4| 57.2| 
| Phrase-BERT  |110M|54.5|  96.8  |  68.7  | 57.2  |  68.8  |23.7|26.1|35.4| 59.5|66.9| 
| E5-small  |34M|57.0|  96.0| 56.8|55.9| 63.1|43.3| 42.0|27.6| 53.7|74.8|
|E5-base|110M| 61.1| 95.4|65.6|59.4|66.3| 47.3|44.0|32.0| 69.3|76.1|
|PEARL-small|34M| 62.5| 97.0|70.2|57.9|68.1| 48.1|44.5|42.4|59.3|75.2|
|PEARL-base|110M|64.8|97.3|72.2|59.7|72.6|50.7|45.8|39.3|69.4|77.1|

## Usage

Below is an example of entity retrieval, and we reuse the code from E5.

```python
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def encode_text(model, input_texts):
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    return embeddings


query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')

# encode
embeddings = encode_text(model, input_texts)

# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

# expected outputs
# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]
```

## Training and Evaluation
Have a look at our code on [Github](https://github.com/tigerchen52/PEARL)



## Citation

If you find our work useful, please give us a citation:

```
@article{chen2024learning,
  title={Learning High-Quality and General-Purpose Phrase Representations},
  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
  journal={arXiv preprint arXiv:2401.10407},
  year={2024}
}
```