File size: 2,784 Bytes
5e4ba1c
 
7e236b4
 
 
b4748e4
7e236b4
 
 
98a9445
4834a02
 
 
bdbde9b
 
5e4ba1c
7e236b4
26e48f5
98a9445
3c3e082
a72b282
797bde2
98a9445
 
c0e2c5b
03c14e5
98a9445
 
 
26e48f5
98a9445
726c902
 
26e48f5
726c902
07e7c7a
 
 
 
 
 
 
 
 
7e236b4
 
b411d6a
 
a8fabbc
 
 
 
 
b411d6a
 
 
 
 
 
26a63b8
b411d6a
 
 
26a63b8
b411d6a
 
6d96181
 
 
 
 
 
 
 
 
 
b411d6a
 
 
 
 
 
 
 
 
 
17786c6
 
 
bb367f6
17786c6
 
 
 
 
 
 
bb367f6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
license: mit
language:
- en
pipeline_tag: token-classification
inference: false
tags:
- token-classification
- entity-recognition
- foundation-model
- feature-extraction
- RoBERTa
- generic
datasets:
- numind/NuNER
---

# Entity Recognition English Foundation Model by NuMind 🔥

 This model provides great token embedding for the Entity Recognition task in English.

 We suggest using **newer version of this model: [NuNER v2.0](https://huggingface.co/numind/NuNER-v2.0)**

**Checkout other models by NuMind:**
* SOTA Multilingual Entity Recognition Foundation Model: [link](https://huggingface.co/numind/entity-recognition-multilingual-general-sota-v1)
* SOTA Sentiment Analysis Foundation Model: [English](https://huggingface.co/numind/generic-sentiment-v1), [Multilingual](https://huggingface.co/numind/generic-sentiment-multi-v1)

## About

[Roberta-base](https://huggingface.co/roberta-base) fine-tuned on [NuNER data](https://huggingface.co/datasets/numind/NuNER).

**Metrics:**

Read more about evaluation protocol & datasets in our [paper](https://arxiv.org/abs/2402.15343) and [blog post](https://www.numind.ai/blog/a-foundation-model-for-entity-recognition).

We suggest using **newer version of this model: [NuNER v2.0](https://huggingface.co/numind/NuNER-v2.0)**

| Model | k=1 | k=4 | k=16 | k=64 |
|----------|----------|----------|----------|----------|
|   RoBERTa-base  |  24.5   |   44.7 | 58.1 | 65.4
|   RoBERTa-base + NER-BERT pre-training | 32.3 | 50.9 | 61.9 | 67.6 |
|   NuNER v0.1  |   34.3 | 54.6 | 64.0 | 68.7 |
|   NuNER v1.0  |   39.4 | 59.6 | 67.8 | 71.5 |
|   **NuNER v2.0**  |   **43.6** | **61.0** | **68.2** | **72.0** |


## Usage

Embeddings can be used out of the box or fine-tuned on specific datasets. 

Get embeddings:


```python
import torch
import transformers


model = transformers.AutoModel.from_pretrained(
    'numind/NuNER-v0.1',
    output_hidden_states=True
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
    'numind/NuNER-v0.1'
)

text = [
    "NuMind is an AI company based in Paris and USA.",
    "See other models from us on https://huggingface.co/numind"
]
encoded_input = tokenizer(
    text,
    return_tensors='pt',
    padding=True,
    truncation=True
)
output = model(**encoded_input)

# for better quality
emb = torch.cat(
    (output.hidden_states[-1], output.hidden_states[-7]),
    dim=2
)

# for better speed
# emb = output.hidden_states[-1]
```

## Citation
```
@misc{bogdanov2024nuner,
      title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data}, 
      author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},
      year={2024},
      eprint={2402.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```