license: apache-2.0
language:
- en
tags:
- Phrase Representation
- String Matching
- Fuzzy Join
PEARL-base
Learning High-Quality and General-Purpose Phrase Representations.
Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek.
Accepted by EACL Findings 2024
PEARL-base is a lightweight string embedding model. It is the tool of choice for semantic similarity computation for strings,
creating excellent embeddings for string matching, entity retrieval, entity clustering, fuzzy join...
It differs from typical sentence embedders because it incorporates phrase type information and morphological features,
allowing it to better capture variations in strings.
The model is a variant of E5-base finetuned on our constructed context-free dataset to yield better representations
for phrases and strings.
Model | Size | Avg | PPDB | PPDB filtered | Turney | BIRD | YAGO | UMLS | CoNLL | BC5CDR | AutoFJ |
---|---|---|---|---|---|---|---|---|---|---|---|
FastText | - | 40.3 | 94.4 | 61.2 | 59.6 | 58.9 | 16.9 | 14.5 | 3.0 | 0.2 | 53.6 |
Sentence-BERT | 110M | 50.1 | 94.6 | 66.8 | 50.4 | 62.6 | 21.6 | 23.6 | 25.5 | 48.4 | 57.2 |
Phrase-BERT | 110M | 54.5 | 96.8 | 68.7 | 57.2 | 68.8 | 23.7 | 26.1 | 35.4 | 59.5 | 66.9 |
E5-small | 34M | 57.0 | 96.0 | 56.8 | 55.9 | 63.1 | 43.3 | 42.0 | 27.6 | 53.7 | 74.8 |
E5-base | 110M | 61.1 | 95.4 | 65.6 | 59.4 | 66.3 | 47.3 | 44.0 | 32.0 | 69.3 | 76.1 |
PEARL-small | 34M | 62.5 | 97.0 | 70.2 | 57.9 | 68.1 | 48.1 | 44.5 | 42.4 | 59.3 | 75.2 |
PEARL-base | 110M | 64.8 | 97.3 | 72.2 | 59.7 | 72.6 | 50.7 | 45.8 | 39.3 | 69.4 | 77.1 |
Usage
Below is an example of entity retrieval, and we reuse the code from E5.
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
def encode_text(model, input_texts):
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
return embeddings
query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts
tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_base')
model = AutoModel.from_pretrained('Lihuchen/pearl_base')
# encode
embeddings = encode_text(model, input_texts)
# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
# expected outputs
# [[85.61601257324219, 73.65624237060547, 70.36172485351562]]
Training and Evaluation
Have a look at our code on Github
Citation
If you find our work useful, please give us a citation:
@article{chen2024learning,
title={Learning High-Quality and General-Purpose Phrase Representations},
author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
journal={arXiv preprint arXiv:2401.10407},
year={2024}
}