File size: 6,690 Bytes
e3bdc35 a3d058c 0382b75 a3d058c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
language:
- az
- tr
thumbnail: "URL_to_thumbnail_image" # Replace with an actual URL or remove this line if unavailable
tags:
- NER
- token-classification
- Azerbaijani
- Turkish
- transformers
license: "mit" # Adjust to the correct license you wish to use
datasets:
- LocalDoc/azerbaijani-ner-dataset
metrics:
- precision
- recall
- f1
base_model: "akdeniz27/bert-base-turkish-cased-ner"
pipeline_tag: "token-classification"
---
# Azeri-Turkish-BERT-NER
## Model Description
The **Azeri-Turkish-BERT-NER** model is a fine-tuned version of the `bert-base-turkish-cased-ner` model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities.
The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts.
## Model Details
- **Base Model**: `bert-base-turkish-cased-ner` (adapted from Hugging Face)
- **Task**: Named Entity Recognition (NER)
- **Languages**: Azerbaijani, Turkish
- **Fine-Tuned On**: Custom Azerbaijani NER dataset
- **Input Text Format**: Plain text with tokenized words
- **Model Type**: BERT-based transformer for token classification
## Training Details
The model was fine-tuned using the Hugging Face `transformers` library and `datasets`. Here is a brief summary of the fine-tuning configuration:
- **Tokenizer**: `AutoTokenizer` from the `bert-base-turkish-cased-ner` model
- **Max Sequence Length**: 128 tokens
- **Batch Size**: 128 (training and evaluation)
- **Learning Rate**: 2e-5
- **Number of Epochs**: 10
- **Weight Decay**: 0.005
- **Optimization Strategy**: Early stopping with a patience of 5 epochs based on the F1 metric
### Training Dataset
The training dataset is a custom Azerbaijani NER dataset sourced from [LocalDoc/azerbaijani-ner-dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset). The dataset was preprocessed to align tokens and NER tags accurately.
### Label Categories
The model supports the following entity categories:
- **Person (B-PERSON, I-PERSON)**
- **Location (B-LOCATION, I-LOCATION)**
- **Organization (B-ORGANISATION, I-ORGANISATION)**
- **Date (B-DATE, I-DATE)**
- **Time (B-TIME, I-TIME)**
- **Money (B-MONEY, I-MONEY)**
- **Percentage (B-PERCENTAGE, I-PERCENTAGE)**
- **Facility (B-FACILITY, I-FACILITY)**
- **Product (B-PRODUCT, I-PRODUCT)**
- ... (additional categories as specified in the training label list)
### Training Metrics
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|---------------|-----------------|-----------|--------|-------|
| 1 | 0.433100 | 0.306711 | 0.739000 | 0.693282 | 0.715412 |
| 2 | 0.292700 | 0.275796 | 0.781565 | 0.688937 | 0.732334 |
| 3 | 0.250600 | 0.275115 | 0.758261 | 0.709425 | 0.733031 |
| 4 | 0.233700 | 0.273087 | 0.756184 | 0.716277 | 0.735689 |
| 5 | 0.214800 | 0.278477 | 0.756051 | 0.710996 | 0.732832 |
| 6 | 0.199200 | 0.286102 | 0.755068 | 0.717012 | 0.735548 |
| 7 | 0.192800 | 0.297157 | 0.742326 | 0.725802 | 0.733971 |
| 8 | 0.178900 | 0.304510 | 0.743206 | 0.723930 | 0.733442 |
| 9 | 0.171700 | 0.313845 | 0.743145 | 0.725535 | 0.734234 |
### Category-Wise Evaluation Metrics
| Category | Precision | Recall | F1-Score | Support |
|---------------|-----------|--------|----------|---------|
| ART | 0.49 | 0.14 | 0.21 | 1988 |
| DATE | 0.49 | 0.48 | 0.49 | 844 |
| EVENT | 0.88 | 0.36 | 0.51 | 84 |
| FACILITY | 0.72 | 0.68 | 0.70 | 1146 |
| LAW | 0.57 | 0.64 | 0.60 | 1103 |
| LOCATION | 0.77 | 0.79 | 0.78 | 8806 |
| MONEY | 0.62 | 0.57 | 0.59 | 532 |
| ORGANISATION | 0.64 | 0.65 | 0.64 | 527 |
| PERCENTAGE | 0.77 | 0.83 | 0.80 | 3679 |
| PERSON | 0.87 | 0.81 | 0.84 | 6924 |
| PRODUCT | 0.82 | 0.80 | 0.81 | 2653 |
| TIME | 0.55 | 0.50 | 0.52 | 1634 |
- **Micro Average**: Precision: 0.76, Recall: 0.72, F1-Score: 0.74
- **Macro Average**: Precision: 0.68, Recall: 0.60, F1-Score: 0.62
- **Weighted Average**: Precision: 0.74, Recall: 0.72, F1-Score: 0.72
## Usage
### Loading the Model
To use the model for NER tasks, you can load it using the Hugging Face `transformers` library:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
# Initialize the NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Example text
text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."
# Run NER
results = ner_pipeline(text)
print(results)
```
### Inputs and Outputs
- **Input**: Plain text in Azerbaijani or Turkish.
- **Output**: List of detected entities with entity types and character offsets.
Example output:
```
[
{'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98},
{'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95}
]
```
### Evaluation Metrics
The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section.
## Limitations
- The model may have limited performance on texts that diverge significantly from the training data distribution.
- Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores.
- Further fine-tuning on larger and more diverse datasets may improve generalizability.
## Model Card
A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the [Hugging Face model page](https://huggingface.co/IsmatS/Azeri-Turkish-BERT-NER).
## Citation
If you use this model, please consider citing:
```
@misc{azeri-turkish-bert-ner,
author = {Ismat Samadov},
title = {Azeri-Turkish-BERT-NER},
year = {2024},
howpublished = {Hugging Face repository},
}
``` |