Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,90 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
tags:
|
6 |
+
- Phrase Representation
|
7 |
+
- String Matching
|
8 |
+
- Fuzzy Join
|
9 |
---
|
10 |
+
## PEARL-small
|
11 |
+
|
12 |
+
[Learning High-Quality and General-Purpose Phrase Representations](https://arxiv.org/pdf/2401.10407.pdf). <br>
|
13 |
+
[Lihu Chen](chenlihu.com), [Gaël Varoquaux](https://gael-varoquaux.info/), [Fabian M. Suchanek](https://suchanek.name/).
|
14 |
+
<br> Accepted by EACL Findings 2024
|
15 |
+
|
16 |
+
PEARL-base is finetuned on [E5-base](https://huggingface.co/intfloat/e5-base-v2),
|
17 |
+
which can yield better representations for various downstream tasks such as entity clustering, entity retrieval and fuzzy join.
|
18 |
+
|
19 |
+
| Model |Size|Avg| PPDB | PPDB filtered |Turney|BIRD|YAGO|UMLS|CoNLL|BC5CDR|AutoFJ|
|
20 |
+
|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
|
21 |
+
| FastText |-| 40.3| 94.4 | 61.2 | 59.6 | 58.9 |16.9|14.5|3.0|0.2| 53.6|
|
22 |
+
| Sentence-BERT |110M|50.1| 94.6 | 66.8 | 50.4 | 62.6 | 21.6|23.6|25.5|48.4| 57.2|
|
23 |
+
| Phrase-BERT |110M|54.5| 96.8 | 68.7 | 57.2 | 68.8 |23.7|26.1|35.4| 59.5|66.9|
|
24 |
+
| E5-small |34M|57.0| 96.0| 56.8|55.9| 63.1|43.3| 42.0|27.6| 53.7|74.8|
|
25 |
+
|E5-base|110M| 61.1| 95.4|65.6|59.4|66.3| 47.3|44.0|32.0| 69.3|76.1|
|
26 |
+
|PEARL-small|34M| 62.5| 97.0|70.2|57.9|68.1| 48.1|44.5|42.4|59.3|75.2|
|
27 |
+
|PEARL-base|110M|64.8|97.3|72.2|59.7|72.6|50.7|45.8|39.3|69.4|77.1|
|
28 |
+
|
29 |
+
## Usage
|
30 |
+
|
31 |
+
Below is an example of entity retrieval, and we reuse the code from E5.
|
32 |
+
|
33 |
+
```python
|
34 |
+
import torch.nn.functional as F
|
35 |
+
|
36 |
+
from torch import Tensor
|
37 |
+
from transformers import AutoTokenizer, AutoModel
|
38 |
+
|
39 |
+
|
40 |
+
def average_pool(last_hidden_states: Tensor,
|
41 |
+
attention_mask: Tensor) -> Tensor:
|
42 |
+
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
|
43 |
+
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
|
44 |
+
|
45 |
+
def encode_text(model, input_texts):
|
46 |
+
# Tokenize the input texts
|
47 |
+
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
|
48 |
+
|
49 |
+
outputs = model(**batch_dict)
|
50 |
+
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
|
51 |
+
|
52 |
+
return embeddings
|
53 |
+
|
54 |
+
|
55 |
+
query_texts = ["The New York Times"]
|
56 |
+
doc_texts = [ "NYTimes", "New York Post", "New York"]
|
57 |
+
input_texts = query_texts + doc_texts
|
58 |
+
|
59 |
+
tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_base')
|
60 |
+
model = AutoModel.from_pretrained('Lihuchen/pearl_base')
|
61 |
+
|
62 |
+
# encode
|
63 |
+
embeddings = encode_text(model, input_texts)
|
64 |
+
|
65 |
+
# calculate similarity
|
66 |
+
embeddings = F.normalize(embeddings, p=2, dim=1)
|
67 |
+
scores = (embeddings[:1] @ embeddings[1:].T) * 100
|
68 |
+
print(scores.tolist())
|
69 |
+
|
70 |
+
# expected outputs
|
71 |
+
# [[85.61601257324219, 73.65624237060547, 70.36172485351562]]
|
72 |
+
```
|
73 |
+
|
74 |
+
## Training and Evaluation
|
75 |
+
Have a look at our code on [Github](https://github.com/tigerchen52/PEARL)
|
76 |
+
|
77 |
+
|
78 |
+
|
79 |
+
## Citation
|
80 |
+
|
81 |
+
If you find our work useful, please give us a citation:
|
82 |
+
|
83 |
+
```
|
84 |
+
@article{chen2024learning,
|
85 |
+
title={Learning High-Quality and General-Purpose Phrase Representations},
|
86 |
+
author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
|
87 |
+
journal={arXiv preprint arXiv:2401.10407},
|
88 |
+
year={2024}
|
89 |
+
}
|
90 |
+
```
|