|
--- |
|
language: |
|
- en |
|
license: cc-by-4.0 |
|
library_name: span-marker |
|
tags: |
|
- span-marker |
|
- token-classification |
|
- ner |
|
- named-entity-recognition |
|
- generated_from_span_marker_trainer |
|
datasets: |
|
- EMBO/SourceData |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
widget: |
|
- text: Comparison of ENCC-derived neurospheres treated with intestinal extract |
|
from hypoganglionosis rats, hypoganglionosis treated with Fecal microbiota transplantation |
|
(FMT) sham rat. Comparison of neuronal markers. (J) Immunofluorescence stain |
|
number of PGP9.5+. Nuclei were stained blue with DAPI; Triangles indicate |
|
PGP9.5+. |
|
- text: 'Histochemical (H & E) immunostaining (red) show T (CD3+) neutrophil |
|
(Ly6b+) infiltration in skin of mice in (A). Scale bar, 100 μm. (of CD3 |
|
Ly6b immunostaining from CsA treated mice represent seperate analyses performed |
|
on serial thin sections.) of epidermal thickness, T (CD3+) neutrophil (Ly6b+) |
|
infiltration (red) in skin thin sections from (C), (n = 6). Data |
|
information: Data represent mean ± SD. * P < 0.05, * * P < 0.01 by two |
|
-Mann-Whitney; two independent experiments.' |
|
- text: 'C African green monkey kidney epithelial (Vero) were transfected with NC, |
|
siMLKL, or miR-324-5p for 48 h. qPCR for expression of MLKL. Data information: |
|
data are represented as means ± SD of three biological replicates. Statistical |
|
analyses were performed using unpaired Student '' s t -. experiments were performed |
|
at least three times, representative data are shown.' |
|
- text: (F) Binding between FTCD p47 between p47 p97 is necessary for mitochondria |
|
aggregation mediated by FTCDwt-HA-MAO. HeLa Tet-off inducibly expressing |
|
FTCDwt-HA-MAO were transfected with mammalian expression constructs of |
|
siRNA-insensitive Flag-tagged p47wt / mutants at same time as treatment of p47 |
|
siRNA, cultured for 24 hrs. were further cultured in DOX-free medium for 48 hrs |
|
for induction of FTCD-HA-MAO. After fixation, were visualized with a monoclonal |
|
antibody to mitochondria polyclonal antibodies to HA Flag. Panels a-l display |
|
representative. Scale bar = 10 μm. (G) Binding between FTCD p97 is necessary |
|
for mitochondria aggregation mediated by FTCDwt-HA-MAO. HeLa Tet-off inducibly |
|
expressing FTCDwt-HA-MAO were transfected with mammalian expression construct |
|
of siRNA-insensitive Flag-tagged p97wt / mutant at same time as treatment |
|
with p97 siRNA. following procedures were same as in (F). Panels a-i display |
|
representative. Scale bar = 10 μm. (H) results of of (F) (G). Results |
|
are shown as mean ± SD of five sets of independent experiments, with 100 counted |
|
in each group in each independent experiment. Asterisks indicate a significant |
|
difference at P < 0.01 compared with siRNA treatment alone ('none') compared |
|
with mutant expression (Bonferroni method). |
|
- text: (b) Parkin is recruited selectively to depolarized mitochondria directs |
|
mitophagy. HeLa transfected with HA-Parkin were treated with CCCP for indicated |
|
times. Mitochondria were stained by anti-TOM20 (pseudo coloured; blue) a |
|
ΔΨm dependent MitoTracker (red). Parkin was stained with anti-HA (green). |
|
Without treatment, mitochondria are intact stained by both mitochondrial |
|
markers, whereas Parkin is equally distributed in cytoplasm. After 2 h of CCCP |
|
treatment, mitochondria are depolarized as shown by loss of MitoTracker. Parkin |
|
completely translocates to mitochondria clustering at perinuclear regions. After |
|
24h of CCCP treatment, massive loss of mitochondria is observed as shown by |
|
disappearance of mitochondrial marker. Only Parkin-positive show mitochondrial |
|
clustering clearance, in contrast to adjacent untransfected. Scale bars, 10 |
|
μm. |
|
pipeline_tag: token-classification |
|
base_model: bert-base-uncased |
|
model-index: |
|
- name: SpanMarker with bert-base-uncased on SourceData |
|
results: |
|
- task: |
|
type: token-classification |
|
name: Named Entity Recognition |
|
dataset: |
|
name: SourceData |
|
type: EMBO/SourceData |
|
split: test |
|
metrics: |
|
- type: f1 |
|
value: 0.8336481983993405 |
|
name: F1 |
|
- type: precision |
|
value: 0.8345368269032392 |
|
name: Precision |
|
- type: recall |
|
value: 0.8327614603348888 |
|
name: Recall |
|
--- |
|
|
|
# SpanMarker with bert-base-uncased on SourceData |
|
|
|
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [SourceData](https://huggingface.co/datasets/EMBO/SourceData) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-uncased](https://huggingface.co/bert-base-uncased) as the underlying encoder. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Model Type:** SpanMarker |
|
- **Encoder:** [bert-base-uncased](https://huggingface.co/bert-base-uncased) |
|
- **Maximum Sequence Length:** 256 tokens |
|
- **Maximum Entity Length:** 8 words |
|
- **Training Dataset:** [SourceData](https://huggingface.co/datasets/EMBO/SourceData) |
|
- **Language:** en |
|
- **License:** cc-by-4.0 |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) |
|
- **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) |
|
|
|
### Model Labels |
|
| Label | Examples | |
|
|:---------------|:--------------------------------------------------------| |
|
| CELL_LINE | "293T", "WM266.4 451Lu", "501mel" | |
|
| CELL_TYPE | "BMDMs", "protoplasts", "epithelial" | |
|
| DISEASE | "melanoma", "lung metastasis", "breast prostate cancer" | |
|
| EXP_ASSAY | "interactions", "Yeast two-hybrid", "BiFC" | |
|
| GENEPROD | "CPL1", "FREE1 CPL1", "FREE1" | |
|
| ORGANISM | "Arabidopsis", "yeast", "seedlings" | |
|
| SMALL_MOLECULE | "polyacrylamide", "CHX", "SDS polyacrylamide" | |
|
| SUBCELLULAR | "proteasome", "D-bodies", "plasma" | |
|
| TISSUE | "Colon", "roots", "serum" | |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
| Label | Precision | Recall | F1 | |
|
|:---------------|:----------|:-------|:-------| |
|
| **all** | 0.8345 | 0.8328 | 0.8336 | |
|
| CELL_LINE | 0.9060 | 0.8866 | 0.8962 | |
|
| CELL_TYPE | 0.7365 | 0.7746 | 0.7551 | |
|
| DISEASE | 0.6204 | 0.6531 | 0.6363 | |
|
| EXP_ASSAY | 0.7224 | 0.7096 | 0.7160 | |
|
| GENEPROD | 0.8944 | 0.8960 | 0.8952 | |
|
| ORGANISM | 0.8752 | 0.8902 | 0.8826 | |
|
| SMALL_MOLECULE | 0.8304 | 0.8223 | 0.8263 | |
|
| SUBCELLULAR | 0.7859 | 0.7699 | 0.7778 | |
|
| TISSUE | 0.8134 | 0.8056 | 0.8094 | |
|
|
|
## Uses |
|
|
|
### Direct Use for Inference |
|
|
|
```python |
|
from span_marker import SpanMarkerModel |
|
|
|
# Download from the 🤗 Hub |
|
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-sourcedata") |
|
# Run inference |
|
entities = model.predict("Comparison of ENCC-derived neurospheres treated with intestinal extract from hypoganglionosis rats, hypoganglionosis treated with Fecal microbiota transplantation (FMT) sham rat. Comparison of neuronal markers. (J) Immunofluorescence stain number of PGP9.5+. Nuclei were stained blue with DAPI; Triangles indicate PGP9.5+.") |
|
``` |
|
|
|
### Downstream Use |
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
```python |
|
from span_marker import SpanMarkerModel, Trainer |
|
|
|
# Download from the 🤗 Hub |
|
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-sourcedata") |
|
|
|
# Specify a Dataset with "tokens" and "ner_tag" columns |
|
dataset = load_dataset("conll2003") # For example CoNLL2003 |
|
|
|
# Initialize a Trainer using the pretrained model & dataset |
|
trainer = Trainer( |
|
model=model, |
|
train_dataset=dataset["train"], |
|
eval_dataset=dataset["validation"], |
|
) |
|
trainer.train() |
|
trainer.save_model("tomaarsen/span-marker-bert-base-uncased-sourcedata-finetuned") |
|
``` |
|
</details> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
<!-- |
|
## Bias, Risks and Limitations |
|
|
|
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
|
--> |
|
|
|
<!-- |
|
### Recommendations |
|
|
|
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
|
--> |
|
|
|
## Training Details |
|
|
|
### Training Set Metrics |
|
| Training set | Min | Median | Max | |
|
|:----------------------|:----|:--------|:-----| |
|
| Sentence length | 4 | 71.0253 | 2609 | |
|
| Entities per sentence | 0 | 8.3186 | 162 | |
|
|
|
### Training Hyperparameters |
|
- learning_rate: 5e-05 |
|
- train_batch_size: 32 |
|
- eval_batch_size: 32 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_ratio: 0.1 |
|
- num_epochs: 3 |
|
|
|
### Training Results |
|
| Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | |
|
|:------:|:-----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:| |
|
| 0.5237 | 3000 | 0.0162 | 0.7972 | 0.8162 | 0.8065 | 0.9520 | |
|
| 1.0473 | 6000 | 0.0155 | 0.8188 | 0.8251 | 0.8219 | 0.9560 | |
|
| 1.5710 | 9000 | 0.0155 | 0.8213 | 0.8324 | 0.8268 | 0.9563 | |
|
| 2.0946 | 12000 | 0.0163 | 0.8315 | 0.8347 | 0.8331 | 0.9581 | |
|
| 2.6183 | 15000 | 0.0167 | 0.8303 | 0.8378 | 0.8340 | 0.9582 | |
|
|
|
### Framework Versions |
|
|
|
- Python: 3.9.16 |
|
- SpanMarker: 1.3.1.dev |
|
- Transformers: 4.33.0 |
|
- PyTorch: 2.0.1+cu118 |
|
- Datasets: 2.14.0 |
|
- Tokenizers: 0.13.2 |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
``` |
|
@software{Aarsen_SpanMarker, |
|
author = {Aarsen, Tom}, |
|
license = {Apache-2.0}, |
|
title = {{SpanMarker for Named Entity Recognition}}, |
|
url = {https://github.com/tomaarsen/SpanMarkerNER} |
|
} |
|
``` |
|
|
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |