File size: 2,770 Bytes
e1a79ba 31d948a e1a79ba 31d948a e1a79ba 21a334b 26c777e 4692d8d 26c777e 4692d8d 7768300 4692d8d 21a334b 26c777e 21a334b 4692d8d df74daa f9051b1 4692d8d f9051b1 21a334b a16484e 21a334b 24775c8 21a334b 24775c8 21a334b f44bc14 44c069e 21a334b 3e8a16a 21a334b f9051b1 21a334b f9051b1 21a334b 52006a4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
---
license: mit
language:
- en
- fr
- de
- it
- es
- pt
- pl
- nl
- ru
pipeline_tag: token-classification
inference: false
tags:
- token-classification
- entity-recognition
- foundation-model
- feature-extraction
- mBERT
- Multilingual Bert
- BERT
- generic
---
# SOTA Entity Recognition Multilingual Foundation Model by NuMind 🔥
This model provides the best embedding for the Entity Recognition task and supports 9+ languages.
**Checkout other models by NuMind:**
* SOTA Entity Recognition Foundation Model in English: [link](https://huggingface.co/numind/entity-recognition-general-sota-v1)
* SOTA Sentiment Analysis Foundation Model: [English](https://huggingface.co/numind/generic-sentiment-v1), [Multilingual](https://huggingface.co/numind/generic-sentiment-multi-v1)
## About
[Multilingual BERT](https://huggingface.co/bert-base-multilingual-cased) finetunned on an artificially annotated multilingual subset of [Oscar dataset](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201). This model provides domain & language independent embedding for Entity Recognition Task. We fine-tunned it only on 9 languages but the model can generalize over other languages that are supported by the Multilingual BERT.
**Metrics:**
Read more about evaluation protocol & datasets in our [blog post](https://www.numind.ai/blog/a-foundation-model-for-entity-recognition)
| Model | F1 macro |
|----------|----------|
| bert-base-multilingual-cased | 0.5206 |
| ours | 0.5892 |
| ours + two emb | 0.6231 |
## Usage
Embeddings can be used out of the box or fine-tuned on specific datasets.
Get embeddings:
```python
import torch
import transformers
model = transformers.AutoModel.from_pretrained(
'numind/NuNER-multilingual-v0.1',
output_hidden_states=True,
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
'numind/NuNER-multilingual-v0.1',
)
text = [
"NuMind is an AI company based in Paris and USA.",
"NuMind est une entreprise d'IA basée à Paris et aux États-Unis.",
"See other models from us on https://huggingface.co/numind"
]
encoded_input = tokenizer(
text,
return_tensors='pt',
padding=True,
truncation=True
)
output = model(**encoded_input)
# two emb trick: for better quality
emb = torch.cat(
(output.hidden_states[-1], output.hidden_states[-7]),
dim=2
)
# single emb: for better speed
# emb = output.hidden_states[-1]
```
## Citation
@misc{bogdanov2024nuner,
title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data},
author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},
year={2024},
eprint={2402.15343},
archivePrefix={arXiv},
primaryClass={cs.CL}
} |