pearl_base / README.md
Lihuchen's picture
Update README.md
c11a9f5 verified
|
raw
history blame
3.36 kB
metadata
license: apache-2.0
language:
  - en
tags:
  - Phrase Representation
  - String Matching
  - Fuzzy Join

PEARL-base

Learning High-Quality and General-Purpose Phrase Representations.
Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek.
Accepted by EACL Findings 2024

PEARL-base is finetuned on E5-base, which can yield better representations for phrases and strings.
If you are computing the semantic similarity of strings, you may need our PEARL model.
It can produce powerful embeddings for various tasks, such as string matching, entity retrieval, entity clustering and fuzzy join.

Model Size Avg PPDB PPDB filtered Turney BIRD YAGO UMLS CoNLL BC5CDR AutoFJ
FastText - 40.3 94.4 61.2 59.6 58.9 16.9 14.5 3.0 0.2 53.6
Sentence-BERT 110M 50.1 94.6 66.8 50.4 62.6 21.6 23.6 25.5 48.4 57.2
Phrase-BERT 110M 54.5 96.8 68.7 57.2 68.8 23.7 26.1 35.4 59.5 66.9
E5-small 34M 57.0 96.0 56.8 55.9 63.1 43.3 42.0 27.6 53.7 74.8
E5-base 110M 61.1 95.4 65.6 59.4 66.3 47.3 44.0 32.0 69.3 76.1
PEARL-small 34M 62.5 97.0 70.2 57.9 68.1 48.1 44.5 42.4 59.3 75.2
PEARL-base 110M 64.8 97.3 72.2 59.7 72.6 50.7 45.8 39.3 69.4 77.1

Usage

Below is an example of entity retrieval, and we reuse the code from E5.

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def encode_text(model, input_texts):
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    return embeddings


query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_base')
model = AutoModel.from_pretrained('Lihuchen/pearl_base')

# encode
embeddings = encode_text(model, input_texts)

# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

# expected outputs
# [[85.61601257324219, 73.65624237060547, 70.36172485351562]]

Training and Evaluation

Have a look at our code on Github

Citation

If you find our work useful, please give us a citation:

@article{chen2024learning,
  title={Learning High-Quality and General-Purpose Phrase Representations},
  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
  journal={arXiv preprint arXiv:2401.10407},
  year={2024}
}