README.md · Lihuchen/pearl_small at 58ba0da62dedce78420f3aa7cc50c60124494edf

metadata

license: apache-2.0
language:
  - en
tags:
  - Phrase Representation
  - String Matching
  - Fuzzy Join

PEARL-small

Learning High-Quality and General-Purpose Phrase Representations.
Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek.
Accepted by EACL Findings 2024

PEARL-small is finetuned on E5-small, which can yield better representations for various downstream tasks such as entity clustering, entity retrieval and fuzzy join.

Model	Size	Avg	PPDB	PPDB filtered	Turney	BIRD	YAGO	UMLS	CoNLL	BC5CDR	AutoFJ
FastText	-	40.3	94.4	61.2	59.6	58.9	16.9	14.5	3.0	0.2	53.6
Sentence-BERT	110M	50.1	94.6	66.8	50.4	62.6	21.6	23.6	25.5	48.4	57.2
Phrase-BERT	110M	54.5	96.8	68.7	57.2	68.8	23.7	26.1	35.4	59.5	66.9
E5-small	34M	57.0	96.0	56.8	55.9	63.1	43.3	42.0	27.6	53.7	74.8
E5-base	110M	61.1	95.4	65.6	59.4	66.3	47.3	44.0	32.0	69.3	76.1
PEARL-small	34M	62.5	97.0	70.2	57.9	68.1	48.1	44.5	42.4	59.3	75.2
PEARL-base	110M	64.8	97.3	72.2	59.7	72.6	50.7	45.8	39.3	69.4	77.1

Usage

Below is an example of entity retrieval, and we reuse the code from E5.

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def encode_text(model, input_texts):
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    return embeddings


query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')

# encode
embeddings = encode_text(model, input_texts)

# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

# expected outputs
# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]

Training and Evaluation

Have a look at our code on Github

Citation

If you find our work useful, please give us a citation:

@article{chen2024learning,
  title={Learning High-Quality and General-Purpose Phrase Representations},
  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
  journal={arXiv preprint arXiv:2401.10407},
  year={2024}
}