metadata
license: apache-2.0
language:
- en
tags:
- Phrase Representation
- String Matching
- Fuzzy Join
PEARL-base
Learning High-Quality and General-Purpose Phrase Representations.
Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek.
Accepted by EACL Findings 2024
PEARL-base is finetuned on E5-base,
which can yield better representations for phrases and strings.
If you require semantic similarity computation for strings, our PEARL model might be a helpful tool.
It offers powerful embeddings suitable for tasks like string matching, entity retrieval, entity clustering, and fuzzy join.
Model | Size | Avg | PPDB | PPDB filtered | Turney | BIRD | YAGO | UMLS | CoNLL | BC5CDR | AutoFJ |
---|---|---|---|---|---|---|---|---|---|---|---|
FastText | - | 40.3 | 94.4 | 61.2 | 59.6 | 58.9 | 16.9 | 14.5 | 3.0 | 0.2 | 53.6 |
Sentence-BERT | 110M | 50.1 | 94.6 | 66.8 | 50.4 | 62.6 | 21.6 | 23.6 | 25.5 | 48.4 | 57.2 |
Phrase-BERT | 110M | 54.5 | 96.8 | 68.7 | 57.2 | 68.8 | 23.7 | 26.1 | 35.4 | 59.5 | 66.9 |
E5-small | 34M | 57.0 | 96.0 | 56.8 | 55.9 | 63.1 | 43.3 | 42.0 | 27.6 | 53.7 | 74.8 |
E5-base | 110M | 61.1 | 95.4 | 65.6 | 59.4 | 66.3 | 47.3 | 44.0 | 32.0 | 69.3 | 76.1 |
PEARL-small | 34M | 62.5 | 97.0 | 70.2 | 57.9 | 68.1 | 48.1 | 44.5 | 42.4 | 59.3 | 75.2 |
PEARL-base | 110M | 64.8 | 97.3 | 72.2 | 59.7 | 72.6 | 50.7 | 45.8 | 39.3 | 69.4 | 77.1 |
Usage
Below is an example of entity retrieval, and we reuse the code from E5.
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
def encode_text(model, input_texts):
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
return embeddings
query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts
tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_base')
model = AutoModel.from_pretrained('Lihuchen/pearl_base')
# encode
embeddings = encode_text(model, input_texts)
# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
# expected outputs
# [[85.61601257324219, 73.65624237060547, 70.36172485351562]]
Training and Evaluation
Have a look at our code on Github
Citation
If you find our work useful, please give us a citation:
@article{chen2024learning,
title={Learning High-Quality and General-Purpose Phrase Representations},
author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
journal={arXiv preprint arXiv:2401.10407},
year={2024}
}