|
--- |
|
language: |
|
- english |
|
thumbnail: |
|
tags: |
|
- token classification |
|
license: |
|
datasets: |
|
- EMBO/sd-panels |
|
metrics: |
|
- |
|
--- |
|
|
|
# sd-ner |
|
|
|
## Model description |
|
|
|
This model is a [RoBERTa base model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang) and fine-tuned for token classification on the SourceData [sd-panels](https://huggingface.co/datasets/EMBO/sd-panels) dataset to perform Named Entity Recognition of bioentities. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
#### How to use |
|
|
|
The intended use of this model is for Named Entity Recognition of biological entitie used in SourceData annotations (https://sourcedata.embo.org), including small molecules, gene products (genes and proteins), subcellular components, cell line and cell types, organ and tissues, species as well as experimental methods. |
|
|
|
To have a quick check of the model: |
|
|
|
```python |
|
from transformers import pipeline, RobertaTokenizerFast, RobertaForTokenClassification |
|
example = """<s> F. Western blot of input and eluates of Upf1 domains purification in a Nmd4-HA strain. The band with the # might corresponds to a dimer of Upf1-CH, bands marked with a star correspond to residual signal with the anti-HA antibodies (Nmd4). Fragments in the eluate have a smaller size because the protein A part of the tag was removed by digestion with the TEV protease. G6PDH served as a loading control in the input samples </s>""" |
|
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512) |
|
model = RobertaForTokenClassification.from_pretrained('EMBO/sd-ner') |
|
ner = pipeline('ner', model, tokenizer=tokenizer) |
|
res = ner(example) |
|
for r in res: |
|
print(r['word'], r['entity']) |
|
``` |
|
|
|
#### Limitations and bias |
|
|
|
The model must be used with the `roberta-base` tokenizer. |
|
|
|
## Training data |
|
|
|
The model was trained for token classification using the [EMBO/sd-panels dataset](https://huggingface.co/datasets/EMBO/biolang) wich includes manually annotated examples. |
|
|
|
## Training procedure |
|
|
|
The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs. |
|
|
|
Training code is available at https://github.com/source-data/soda-roberta |
|
|
|
- Command: `python -m tokcl.train /data/json/sd_panels NER --num_train_epochs=3.5` |
|
- Tokenizer vocab size: 50265 |
|
- Training data: EMBO/biolang MLM |
|
- Training with 31410 examples. |
|
- Evaluating on 8861 examples. |
|
- Training on 15 features: O, I-SMALL_MOLECULE, B-SMALL_MOLECULE, I-GENEPROD, B-GENEPROD, I-SUBCELLULAR, B-SUBCELLULAR, I-CELL, B-CELL, I-TISSUE, B-TISSUE, I-ORGANISM, B-ORGANISM, I-EXP_ASSAY, B-EXP_ASSAY |
|
- Epochs: 3.5 |
|
- `per_device_train_batch_size`: 32 |
|
- `per_device_eval_batch_size`: 32 |
|
- `learning_rate`: 0.0001 |
|
- `weight_decay`: 0.0 |
|
- `adam_beta1`: 0.9 |
|
- `adam_beta2`: 0.999 |
|
- `adam_epsilon`: 1e-08 |
|
- `max_grad_norm`: 1.0 |
|
|
|
## Eval results |
|
|
|
On test set with `sklearn.metrics`: |
|
|
|
``` |
|
precision recall f1-score support |
|
|
|
CELL 0.77 0.81 0.79 3477 |
|
EXP_ASSAY 0.71 0.70 0.71 7049 |
|
GENEPROD 0.86 0.90 0.88 16140 |
|
ORGANISM 0.80 0.82 0.81 2759 |
|
SMALL_MOLECULE 0.78 0.82 0.80 4446 |
|
SUBCELLULAR 0.71 0.75 0.73 2125 |
|
TISSUE 0.70 0.75 0.73 1971 |
|
|
|
micro avg 0.79 0.82 0.81 37967 |
|
macro avg 0.76 0.79 0.78 37967 |
|
weighted avg 0.79 0.82 0.81 37967 |
|
``` |
|
|