--- license: apache-2.0 language: - en tags: - Phrase Representation - String Matching - Fuzzy Join --- ## PEARL-base [Learning High-Quality and General-Purpose Phrase Representations](https://arxiv.org/pdf/2401.10407.pdf).
[Lihu Chen](https://chenlihu.com), [Gaƫl Varoquaux](https://gael-varoquaux.info/), [Fabian M. Suchanek](https://suchanek.name/).
Accepted by EACL Findings 2024 PEARL-base is finetuned on [E5-base](https://huggingface.co/intfloat/e5-base-v2), which can yield better representations for phrases and strings.
If you are computing the semantic similarity of strings, you may need our PEARL model.
It can produce powerful embeddings for various tasks, such as string matching, entity retrieval, entity clustering and fuzzy join. | Model |Size|Avg| PPDB | PPDB filtered |Turney|BIRD|YAGO|UMLS|CoNLL|BC5CDR|AutoFJ| |-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------| | FastText |-| 40.3| 94.4 | 61.2 | 59.6 | 58.9 |16.9|14.5|3.0|0.2| 53.6| | Sentence-BERT |110M|50.1| 94.6 | 66.8 | 50.4 | 62.6 | 21.6|23.6|25.5|48.4| 57.2| | Phrase-BERT |110M|54.5| 96.8 | 68.7 | 57.2 | 68.8 |23.7|26.1|35.4| 59.5|66.9| | E5-small |34M|57.0| 96.0| 56.8|55.9| 63.1|43.3| 42.0|27.6| 53.7|74.8| |E5-base|110M| 61.1| 95.4|65.6|59.4|66.3| 47.3|44.0|32.0| 69.3|76.1| |PEARL-small|34M| 62.5| 97.0|70.2|57.9|68.1| 48.1|44.5|42.4|59.3|75.2| |PEARL-base|110M|64.8|97.3|72.2|59.7|72.6|50.7|45.8|39.3|69.4|77.1| ## Usage Below is an example of entity retrieval, and we reuse the code from E5. ```python import torch.nn.functional as F from torch import Tensor from transformers import AutoTokenizer, AutoModel def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] def encode_text(model, input_texts): # Tokenize the input texts batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt') outputs = model(**batch_dict) embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) return embeddings query_texts = ["The New York Times"] doc_texts = [ "NYTimes", "New York Post", "New York"] input_texts = query_texts + doc_texts tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_base') model = AutoModel.from_pretrained('Lihuchen/pearl_base') # encode embeddings = encode_text(model, input_texts) # calculate similarity embeddings = F.normalize(embeddings, p=2, dim=1) scores = (embeddings[:1] @ embeddings[1:].T) * 100 print(scores.tolist()) # expected outputs # [[85.61601257324219, 73.65624237060547, 70.36172485351562]] ``` ## Training and Evaluation Have a look at our code on [Github](https://github.com/tigerchen52/PEARL) ## Citation If you find our work useful, please give us a citation: ``` @article{chen2024learning, title={Learning High-Quality and General-Purpose Phrase Representations}, author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M}, journal={arXiv preprint arXiv:2401.10407}, year={2024} } ```