pearl_small / README.md
Lihuchen's picture
Update README.md
2b02579 verified
|
raw
history blame
3.59 kB
metadata
license: apache-2.0
language:
  - en
tags:
  - Phrase Representation
  - String Matching
  - Fuzzy Join

PEARL-small

Learning High-Quality and General-Purpose Phrase Representations.
Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek.
Accepted by EACL Findings 2024

PEARL-small is a variant of E5-small finetuned on our constructed context-free dataset to yield better representations for phrases and strings.
If you require semantic similarity computation for strings, our PEARL model might be a helpful tool.
It offers powerful embeddings suitable for tasks like string matching, entity retrieval, entity clustering, and fuzzy join.

🤗 PEARL-small 🤗 PEARL-base

Model Size Avg PPDB PPDB filtered Turney BIRD YAGO UMLS CoNLL BC5CDR AutoFJ
FastText - 40.3 94.4 61.2 59.6 58.9 16.9 14.5 3.0 0.2 53.6
Sentence-BERT 110M 50.1 94.6 66.8 50.4 62.6 21.6 23.6 25.5 48.4 57.2
Phrase-BERT 110M 54.5 96.8 68.7 57.2 68.8 23.7 26.1 35.4 59.5 66.9
E5-small 34M 57.0 96.0 56.8 55.9 63.1 43.3 42.0 27.6 53.7 74.8
E5-base 110M 61.1 95.4 65.6 59.4 66.3 47.3 44.0 32.0 69.3 76.1
PEARL-small 34M 62.5 97.0 70.2 57.9 68.1 48.1 44.5 42.4 59.3 75.2
PEARL-base 110M 64.8 97.3 72.2 59.7 72.6 50.7 45.8 39.3 69.4 77.1

Usage

Below is an example of entity retrieval, and we reuse the code from E5.

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def encode_text(model, input_texts):
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    return embeddings


query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')

# encode
embeddings = encode_text(model, input_texts)

# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

# expected outputs
# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]

Training and Evaluation

Have a look at our code on Github

Citation

If you find our work useful, please give us a citation:

@article{chen2024learning,
  title={Learning High-Quality and General-Purpose Phrase Representations},
  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
  journal={arXiv preprint arXiv:2401.10407},
  year={2024}
}