Lihuchen
/

pearl_small

 ---
 license: apache-2.0
+language:
+- en
+tags:
+- Phrase Representation
+- String Matching
+- Fuzzy Join
+pipeline_tag: sentence-similarity
 ---
+## PEARL-small
+[Learning High-Quality and General-Purpose Phrase Representations](https://arxiv.org/pdf/2401.10407.pdf). <br>
+[Lihu Chen](chenlihu.com), [Gaël Varoquaux](https://gael-varoquaux.info/), [Fabian M. Suchanek](https://suchanek.name/), EACL Findings 2024
+PEARL-small is finetuned on [E5-small](https://huggingface.co/intfloat/e5-small-v2),
+which can yield better representations for various downstream tasks such as entity clustering, entity retrieval and fuzzy join.
+| Model |Size| PPDB | PPDB filtered |Turney|BIRD|YAGO|UMLS|CoNLL|BC5CDR|AutoFJ|Avg|
+|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
+| FastText  |-|  94.4  | 61.2  |  59.6  | 58.9  |16.9|14.5|3.0|0.2| 53.6|40.3|
+| Sentence-BERT  |110M| 94.6  | 66.8  | 50.4  | 62.6  | 21.6|23.6|25.5|48.4| 57.2| 50.1|
+| Phrase-BERT  |110M|  96.8  |  68.7  | 57.2  |  68.8  |23.7|26.1|35.4| 59.5|66.9| 54.5|
+| E5-small  |34M|  96.0| 56.8|55.9| 63.1|43.3| 42.0|27.6| 53.7|74.8|57.0|
+|E5-base|110M|  95.4|65.6|59.4|66.3| 47.3|44.0|32.0| 69.3|76.1|61.1|
+|PEARL-small|34M|  97.0|70.2|57.9|68.1| 48.1|44.5|42.4|59.3|75.2|62.5|
+|PEARL-base|110M|97.3|72.2|59.7|72.6|50.7|45.8|39.3|69.4|77.1|64.8|
+## Usage
+Below is an example of entity retrieval, and we reuse the code from E5.
+```python
+import torch.nn.functional as F
+from torch import Tensor
+from transformers import AutoTokenizer, AutoModel
+def average_pool(last_hidden_states: Tensor,
+                 attention_mask: Tensor) -> Tensor:
+    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
+    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
+def encode_text(model, input_texts):
+    # Tokenize the input texts
+    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
+    outputs = model(**batch_dict)
+    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
+    return embeddings
+query_texts = ["The New York Times"]
+doc_texts = [ "NYTimes", "New York Post", "New York"]
+input_texts = query_texts + doc_texts
+tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
+model = AutoModel.from_pretrained('Lihuchen/pearl_small')
+# encode
+embeddings = encode_text(model, input_texts)
+# calculate similarity
+embeddings = F.normalize(embeddings, p=2, dim=1)
+scores = (embeddings[:1] @ embeddings[1:].T) * 100
+print(scores.tolist())
+# expected outputs
+# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]
+```
+## Training and Evaluation
+Have a look at our code on [Github](https://github.com/tigerchen52/PEARL)
+## Citation
+If you find our work useful, please give us a citation:
+```
+@article{chen2024learning,
+  title={Learning High-Quality and General-Purpose Phrase Representations},
+  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
+  journal={arXiv preprint arXiv:2401.10407},
+  year={2024}
+}
+```