Lihuchen
/

pearl_small

Feature Extraction

sentence-transformers

Phrase Representation

String Matching

Entity Retrieval

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

pearl_small / README.md

Lihuchen's picture

Update README.md

2b02579 verified 9 months ago

|

3.59 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- Phrase Representation
	- String Matching
	- Fuzzy Join
	---
	## PEARL-small
	[Learning High-Quality and General-Purpose Phrase Representations](https://arxiv.org/pdf/2401.10407.pdf). <br>
	[Lihu Chen](https://chenlihu.com), [Gaël Varoquaux](https://gael-varoquaux.info/), [Fabian M. Suchanek](https://suchanek.name/).
	<br> Accepted by EACL Findings 2024 <br>

	PEARL-small is a variant of [E5-small](https://huggingface.co/intfloat/e5-small-v2) finetuned on our constructed context-free [dataset](https://zenodo.org/records/10676475) to yield better representations
	for phrases and strings. <br>
	If you require semantic similarity computation for strings, our PEARL model might be a helpful tool.<br>
	It offers powerful embeddings suitable for tasks like string matching, entity retrieval, entity clustering, and fuzzy join.

	🤗 [PEARL-small](https://huggingface.co/Lihuchen/pearl_small) 🤗 [PEARL-base](https://huggingface.co/Lihuchen/pearl_base)
	<br>


	\| Model \|Size\|Avg\| PPDB \| PPDB filtered \|Turney\|BIRD\|YAGO\|UMLS\|CoNLL\|BC5CDR\|AutoFJ\|
	\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|
	\| FastText \|-\| 40.3\| 94.4 \| 61.2 \| 59.6 \| 58.9 \|16.9\|14.5\|3.0\|0.2\| 53.6\|
	\| Sentence-BERT \|110M\|50.1\| 94.6 \| 66.8 \| 50.4 \| 62.6 \| 21.6\|23.6\|25.5\|48.4\| 57.2\|
	\| Phrase-BERT \|110M\|54.5\| 96.8 \| 68.7 \| 57.2 \| 68.8 \|23.7\|26.1\|35.4\| 59.5\|66.9\|
	\| E5-small \|34M\|57.0\| 96.0\| 56.8\|55.9\| 63.1\|43.3\| 42.0\|27.6\| 53.7\|74.8\|
	\|E5-base\|110M\| 61.1\| 95.4\|65.6\|59.4\|66.3\| 47.3\|44.0\|32.0\| 69.3\|76.1\|
	\|PEARL-small\|34M\| 62.5\| 97.0\|70.2\|57.9\|68.1\| 48.1\|44.5\|42.4\|59.3\|75.2\|
	\|PEARL-base\|110M\|64.8\|97.3\|72.2\|59.7\|72.6\|50.7\|45.8\|39.3\|69.4\|77.1\|

	## Usage

	Below is an example of entity retrieval, and we reuse the code from E5.

	```python
	import torch.nn.functional as F

	from torch import Tensor
	from transformers import AutoTokenizer, AutoModel


	def average_pool(last_hidden_states: Tensor,
	attention_mask: Tensor) -> Tensor:
	last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
	return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

	def encode_text(model, input_texts):
	# Tokenize the input texts
	batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

	outputs = model(**batch_dict)
	embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

	return embeddings


	query_texts = ["The New York Times"]
	doc_texts = [ "NYTimes", "New York Post", "New York"]
	input_texts = query_texts + doc_texts

	tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
	model = AutoModel.from_pretrained('Lihuchen/pearl_small')

	# encode
	embeddings = encode_text(model, input_texts)

	# calculate similarity
	embeddings = F.normalize(embeddings, p=2, dim=1)
	scores = (embeddings[:1] @ embeddings[1:].T) * 100
	print(scores.tolist())

	# expected outputs
	# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]
	```

	## Training and Evaluation
	Have a look at our code on [Github](https://github.com/tigerchen52/PEARL)



	## Citation

	If you find our work useful, please give us a citation:

	```
	@article{chen2024learning,
	title={Learning High-Quality and General-Purpose Phrase Representations},
	author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
	journal={arXiv preprint arXiv:2401.10407},
	year={2024}
	}
	```