Lihuchen commited on
Commit
7ef3212
1 Parent(s): 6dec9de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md CHANGED
@@ -1,3 +1,90 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - Phrase Representation
7
+ - String Matching
8
+ - Fuzzy Join
9
  ---
10
+ ## PEARL-small
11
+
12
+ [Learning High-Quality and General-Purpose Phrase Representations](https://arxiv.org/pdf/2401.10407.pdf). <br>
13
+ [Lihu Chen](chenlihu.com), [Gaël Varoquaux](https://gael-varoquaux.info/), [Fabian M. Suchanek](https://suchanek.name/).
14
+ <br> Accepted by EACL Findings 2024
15
+
16
+ PEARL-base is finetuned on [E5-base](https://huggingface.co/intfloat/e5-base-v2),
17
+ which can yield better representations for various downstream tasks such as entity clustering, entity retrieval and fuzzy join.
18
+
19
+ | Model |Size|Avg| PPDB | PPDB filtered |Turney|BIRD|YAGO|UMLS|CoNLL|BC5CDR|AutoFJ|
20
+ |-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
21
+ | FastText |-| 40.3| 94.4 | 61.2 | 59.6 | 58.9 |16.9|14.5|3.0|0.2| 53.6|
22
+ | Sentence-BERT |110M|50.1| 94.6 | 66.8 | 50.4 | 62.6 | 21.6|23.6|25.5|48.4| 57.2|
23
+ | Phrase-BERT |110M|54.5| 96.8 | 68.7 | 57.2 | 68.8 |23.7|26.1|35.4| 59.5|66.9|
24
+ | E5-small |34M|57.0| 96.0| 56.8|55.9| 63.1|43.3| 42.0|27.6| 53.7|74.8|
25
+ |E5-base|110M| 61.1| 95.4|65.6|59.4|66.3| 47.3|44.0|32.0| 69.3|76.1|
26
+ |PEARL-small|34M| 62.5| 97.0|70.2|57.9|68.1| 48.1|44.5|42.4|59.3|75.2|
27
+ |PEARL-base|110M|64.8|97.3|72.2|59.7|72.6|50.7|45.8|39.3|69.4|77.1|
28
+
29
+ ## Usage
30
+
31
+ Below is an example of entity retrieval, and we reuse the code from E5.
32
+
33
+ ```python
34
+ import torch.nn.functional as F
35
+
36
+ from torch import Tensor
37
+ from transformers import AutoTokenizer, AutoModel
38
+
39
+
40
+ def average_pool(last_hidden_states: Tensor,
41
+ attention_mask: Tensor) -> Tensor:
42
+ last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
43
+ return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
44
+
45
+ def encode_text(model, input_texts):
46
+ # Tokenize the input texts
47
+ batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
48
+
49
+ outputs = model(**batch_dict)
50
+ embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
51
+
52
+ return embeddings
53
+
54
+
55
+ query_texts = ["The New York Times"]
56
+ doc_texts = [ "NYTimes", "New York Post", "New York"]
57
+ input_texts = query_texts + doc_texts
58
+
59
+ tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_base')
60
+ model = AutoModel.from_pretrained('Lihuchen/pearl_base')
61
+
62
+ # encode
63
+ embeddings = encode_text(model, input_texts)
64
+
65
+ # calculate similarity
66
+ embeddings = F.normalize(embeddings, p=2, dim=1)
67
+ scores = (embeddings[:1] @ embeddings[1:].T) * 100
68
+ print(scores.tolist())
69
+
70
+ # expected outputs
71
+ # [[85.61601257324219, 73.65624237060547, 70.36172485351562]]
72
+ ```
73
+
74
+ ## Training and Evaluation
75
+ Have a look at our code on [Github](https://github.com/tigerchen52/PEARL)
76
+
77
+
78
+
79
+ ## Citation
80
+
81
+ If you find our work useful, please give us a citation:
82
+
83
+ ```
84
+ @article{chen2024learning,
85
+ title={Learning High-Quality and General-Purpose Phrase Representations},
86
+ author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
87
+ journal={arXiv preprint arXiv:2401.10407},
88
+ year={2024}
89
+ }
90
+ ```