File size: 2,784 Bytes
5e4ba1c 7e236b4 b4748e4 7e236b4 98a9445 4834a02 bdbde9b 5e4ba1c 7e236b4 26e48f5 98a9445 3c3e082 a72b282 797bde2 98a9445 c0e2c5b 03c14e5 98a9445 26e48f5 98a9445 726c902 26e48f5 726c902 07e7c7a 7e236b4 b411d6a a8fabbc b411d6a 26a63b8 b411d6a 26a63b8 b411d6a 6d96181 b411d6a 17786c6 bb367f6 17786c6 bb367f6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
license: mit
language:
- en
pipeline_tag: token-classification
inference: false
tags:
- token-classification
- entity-recognition
- foundation-model
- feature-extraction
- RoBERTa
- generic
datasets:
- numind/NuNER
---
# Entity Recognition English Foundation Model by NuMind 🔥
This model provides great token embedding for the Entity Recognition task in English.
We suggest using **newer version of this model: [NuNER v2.0](https://huggingface.co/numind/NuNER-v2.0)**
**Checkout other models by NuMind:**
* SOTA Multilingual Entity Recognition Foundation Model: [link](https://huggingface.co/numind/entity-recognition-multilingual-general-sota-v1)
* SOTA Sentiment Analysis Foundation Model: [English](https://huggingface.co/numind/generic-sentiment-v1), [Multilingual](https://huggingface.co/numind/generic-sentiment-multi-v1)
## About
[Roberta-base](https://huggingface.co/roberta-base) fine-tuned on [NuNER data](https://huggingface.co/datasets/numind/NuNER).
**Metrics:**
Read more about evaluation protocol & datasets in our [paper](https://arxiv.org/abs/2402.15343) and [blog post](https://www.numind.ai/blog/a-foundation-model-for-entity-recognition).
We suggest using **newer version of this model: [NuNER v2.0](https://huggingface.co/numind/NuNER-v2.0)**
| Model | k=1 | k=4 | k=16 | k=64 |
|----------|----------|----------|----------|----------|
| RoBERTa-base | 24.5 | 44.7 | 58.1 | 65.4
| RoBERTa-base + NER-BERT pre-training | 32.3 | 50.9 | 61.9 | 67.6 |
| NuNER v0.1 | 34.3 | 54.6 | 64.0 | 68.7 |
| NuNER v1.0 | 39.4 | 59.6 | 67.8 | 71.5 |
| **NuNER v2.0** | **43.6** | **61.0** | **68.2** | **72.0** |
## Usage
Embeddings can be used out of the box or fine-tuned on specific datasets.
Get embeddings:
```python
import torch
import transformers
model = transformers.AutoModel.from_pretrained(
'numind/NuNER-v0.1',
output_hidden_states=True
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
'numind/NuNER-v0.1'
)
text = [
"NuMind is an AI company based in Paris and USA.",
"See other models from us on https://huggingface.co/numind"
]
encoded_input = tokenizer(
text,
return_tensors='pt',
padding=True,
truncation=True
)
output = model(**encoded_input)
# for better quality
emb = torch.cat(
(output.hidden_states[-1], output.hidden_states[-7]),
dim=2
)
# for better speed
# emb = output.hidden_states[-1]
```
## Citation
```
@misc{bogdanov2024nuner,
title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data},
author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},
year={2024},
eprint={2402.15343},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |