File size: 6,690 Bytes
e3bdc35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a3d058c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0382b75
a3d058c
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
language: 
  - az
  - tr
thumbnail: "URL_to_thumbnail_image"  # Replace with an actual URL or remove this line if unavailable
tags:
  - NER
  - token-classification
  - Azerbaijani
  - Turkish
  - transformers
license: "mit"  # Adjust to the correct license you wish to use
datasets:
  - LocalDoc/azerbaijani-ner-dataset
metrics:
  - precision
  - recall
  - f1
base_model: "akdeniz27/bert-base-turkish-cased-ner"
pipeline_tag: "token-classification"
---

# Azeri-Turkish-BERT-NER

## Model Description

The **Azeri-Turkish-BERT-NER** model is a fine-tuned version of the `bert-base-turkish-cased-ner` model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities.

The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts.

## Model Details

- **Base Model**: `bert-base-turkish-cased-ner` (adapted from Hugging Face)
- **Task**: Named Entity Recognition (NER)
- **Languages**: Azerbaijani, Turkish
- **Fine-Tuned On**: Custom Azerbaijani NER dataset
- **Input Text Format**: Plain text with tokenized words
- **Model Type**: BERT-based transformer for token classification

## Training Details

The model was fine-tuned using the Hugging Face `transformers` library and `datasets`. Here is a brief summary of the fine-tuning configuration:

- **Tokenizer**: `AutoTokenizer` from the `bert-base-turkish-cased-ner` model
- **Max Sequence Length**: 128 tokens
- **Batch Size**: 128 (training and evaluation)
- **Learning Rate**: 2e-5
- **Number of Epochs**: 10
- **Weight Decay**: 0.005
- **Optimization Strategy**: Early stopping with a patience of 5 epochs based on the F1 metric

### Training Dataset

The training dataset is a custom Azerbaijani NER dataset sourced from [LocalDoc/azerbaijani-ner-dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset). The dataset was preprocessed to align tokens and NER tags accurately.

### Label Categories

The model supports the following entity categories:
- **Person (B-PERSON, I-PERSON)**
- **Location (B-LOCATION, I-LOCATION)**
- **Organization (B-ORGANISATION, I-ORGANISATION)**
- **Date (B-DATE, I-DATE)**
- **Time (B-TIME, I-TIME)**
- **Money (B-MONEY, I-MONEY)**
- **Percentage (B-PERCENTAGE, I-PERCENTAGE)**
- **Facility (B-FACILITY, I-FACILITY)**
- **Product (B-PRODUCT, I-PRODUCT)**
- ... (additional categories as specified in the training label list)

### Training Metrics

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1    |
|-------|---------------|-----------------|-----------|--------|-------|
| 1     | 0.433100      | 0.306711        | 0.739000  | 0.693282 | 0.715412 |
| 2     | 0.292700      | 0.275796        | 0.781565  | 0.688937 | 0.732334 |
| 3     | 0.250600      | 0.275115        | 0.758261  | 0.709425 | 0.733031 |
| 4     | 0.233700      | 0.273087        | 0.756184  | 0.716277 | 0.735689 |
| 5     | 0.214800      | 0.278477        | 0.756051  | 0.710996 | 0.732832 |
| 6     | 0.199200      | 0.286102        | 0.755068  | 0.717012 | 0.735548 |
| 7     | 0.192800      | 0.297157        | 0.742326  | 0.725802 | 0.733971 |
| 8     | 0.178900      | 0.304510        | 0.743206  | 0.723930 | 0.733442 |
| 9     | 0.171700      | 0.313845        | 0.743145  | 0.725535 | 0.734234 |

### Category-Wise Evaluation Metrics

| Category      | Precision | Recall | F1-Score | Support |
|---------------|-----------|--------|----------|---------|
| ART           | 0.49      | 0.14   | 0.21     | 1988    |
| DATE          | 0.49      | 0.48   | 0.49     | 844     |
| EVENT         | 0.88      | 0.36   | 0.51     | 84      |
| FACILITY      | 0.72      | 0.68   | 0.70     | 1146    |
| LAW           | 0.57      | 0.64   | 0.60     | 1103    |
| LOCATION      | 0.77      | 0.79   | 0.78     | 8806    |
| MONEY         | 0.62      | 0.57   | 0.59     | 532     |
| ORGANISATION  | 0.64      | 0.65   | 0.64     | 527     |
| PERCENTAGE    | 0.77      | 0.83   | 0.80     | 3679    |
| PERSON        | 0.87      | 0.81   | 0.84     | 6924    |
| PRODUCT       | 0.82      | 0.80   | 0.81     | 2653    |
| TIME          | 0.55      | 0.50   | 0.52     | 1634    |

- **Micro Average**: Precision: 0.76, Recall: 0.72, F1-Score: 0.74
- **Macro Average**: Precision: 0.68, Recall: 0.60, F1-Score: 0.62
- **Weighted Average**: Precision: 0.74, Recall: 0.72, F1-Score: 0.72

## Usage

### Loading the Model

To use the model for NER tasks, you can load it using the Hugging Face `transformers` library:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")

# Initialize the NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example text
text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."

# Run NER
results = ner_pipeline(text)
print(results)
```

### Inputs and Outputs

- **Input**: Plain text in Azerbaijani or Turkish.
- **Output**: List of detected entities with entity types and character offsets.

Example output:
```
[
  {'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98},
  {'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95}
]
```

### Evaluation Metrics

The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section.

## Limitations

- The model may have limited performance on texts that diverge significantly from the training data distribution.
- Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores.
- Further fine-tuning on larger and more diverse datasets may improve generalizability.

## Model Card

A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the [Hugging Face model page](https://huggingface.co/IsmatS/Azeri-Turkish-BERT-NER).

## Citation

If you use this model, please consider citing:
```
@misc{azeri-turkish-bert-ner,
  author = {Ismat Samadov},
  title = {Azeri-Turkish-BERT-NER},
  year = {2024},
  howpublished = {Hugging Face repository},
}
```