File size: 4,506 Bytes
9e83f05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d074ac6
9e83f05
d074ac6
9e83f05
 
d074ac6
 
9e83f05
d074ac6
 
9e83f05
d074ac6
 
 
 
9e83f05
 
d074ac6
9e83f05
d074ac6
 
 
 
 
9e83f05
d074ac6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e83f05
 
d074ac6
 
 
 
 
 
9e83f05
d074ac6
9e83f05
 
 
 
 
 
d074ac6
 
 
9e83f05
d074ac6
 
9e83f05
 
 
 
 
d074ac6
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
library_name: transformers
license: apache-2.0
base_model: distilbert/distilbert-base-uncased
tags:
- generated_from_trainer
datasets:
- conll2003
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: base-NER
  results:
  - task:
      name: Token Classification
      type: token-classification
    dataset:
      name: conll2003
      type: conll2003
      config: conll2003
      split: test
      args: conll2003
    metrics:
    - name: Precision
      type: precision
      value: 0.8845085098992705
    - name: Recall
      type: recall
      value: 0.9017351274787535
    - name: F1
      type: f1
      value: 0.8930387515342801
    - name: Accuracy
      type: accuracy
      value: 0.9782491655001615
---


# base-NER: A Named Entity Recognition (NER) Model

`base-NER` is a fine-tuned version of [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on the CoNLL2003 dataset, designed for the task of **Named Entity Recognition (NER)**. This model can identify entities like people, organizations, locations, and more from text. 


```python
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model = AutoModelForTokenClassification.from_pretrained("eddiegulay/base-NER")
tokenizer = AutoTokenizer.from_pretrained("eddiegulay/base-NER")

classifier = pipeline("ner", model=model, tokenizer=tokenizer)
result = classifier("My name is Edgar and I stay in Dar es Salaam")
print(result)
```


## Model Performance

The model achieved the following results on the CoNLL2003 test set:
- **Precision**: 0.8845
- **Recall**: 0.9017
- **F1-Score**: 0.8930
- **Accuracy**: 0.9782

The loss during training was 0.1129 on the validation set.

## Model Description

This model leverages the DistilBERT architecture, which is a smaller and faster version of BERT, designed for efficiency while maintaining strong performance. The model is specifically fine-tuned for NER tasks, making it ideal for entity extraction in various domains like finance, healthcare, or general text analytics.

## Intended Uses & Limitations

**Intended Uses**:  
- Text extraction tasks for recognizing names of people, organizations, locations, dates, and other named entities in a sentence.
- Suitable for use in production applications where lightweight models are preferred due to memory or speed constraints.

**Limitations**:  
- The model is limited to English texts, as it was trained on the CoNLL2003 dataset.
- Performance may degrade when used on domain-specific entities not present in the CoNLL2003 dataset (e.g., technical or biomedical domains).
- May struggle with ambiguous or context-dependent entity classifications.

## Training and Evaluation Data

The model was trained on the **CoNLL2003** dataset, which contains annotations for named entities in English text. It is a widely-used dataset for NER tasks, consisting of four entity types: **person**, **organization**, **location**, and **miscellaneous**.

### Dataset Configuration
- **Dataset**: CoNLL2003
- **Split**: Test set used for evaluation
- **Entity Types**: Person, Organization, Location, Miscellaneous

## Training Procedure

The model was fine-tuned for 2 epochs using a linear learning rate scheduler and an Adam optimizer.

### Training Hyperparameters

The following hyperparameters were used during training:
- **Learning Rate**: 2e-5
- **Batch Size**: 16 (train and eval)
- **Seed**: 42
- **Optimizer**: Adam (betas=(0.9,0.999), epsilon=1e-8)
- **Scheduler**: Linear
- **Epochs**: 2

### Training Results

| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| 0.0595        | 1.0   | 878  | 0.1046          | 0.8676    | 0.8909 | 0.8791 | 0.9762   |
| 0.0319        | 2.0   | 1756 | 0.1129          | 0.8845    | 0.9017 | 0.8930 | 0.9782   |

## Usage Example

You can use this model with Hugging Face's `transformers` library for token classification tasks:


## Framework Versions

- Transformers 4.44.2
- Pytorch 2.4.0+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1

## Future Improvements

- Fine-tuning the model on more domain-specific datasets for improved generalization.
- Implementing entity recognition for additional entity types, including products, dates, and technical terms.
  
Feel free to modify or add more details, especially for sections like model description, intended uses, and limitations.