File size: 6,690 Bytes

---
language: 
  - az
  - tr
thumbnail: "URL_to_thumbnail_image"  # Replace with an actual URL or remove this line if unavailable
tags:
  - NER
  - token-classification
  - Azerbaijani
  - Turkish
  - transformers
license: "mit"  # Adjust to the correct license you wish to use
datasets:
  - LocalDoc/azerbaijani-ner-dataset
metrics:
  - precision
  - recall
  - f1
base_model: "akdeniz27/bert-base-turkish-cased-ner"
pipeline_tag: "token-classification"
---

# Azeri-Turkish-BERT-NER

## Model Description

The **Azeri-Turkish-BERT-NER** model is a fine-tuned version of the `bert-base-turkish-cased-ner` model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities.

The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts.

## Model Details

- **Base Model**: `bert-base-turkish-cased-ner` (adapted from Hugging Face)
- **Task**: Named Entity Recognition (NER)
- **Languages**: Azerbaijani, Turkish
- **Fine-Tuned On**: Custom Azerbaijani NER dataset
- **Input Text Format**: Plain text with tokenized words
- **Model Type**: BERT-based transformer for token classification

## Training Details

The model was fine-tuned using the Hugging Face `transformers` library and `datasets`. Here is a brief summary of the fine-tuning configuration:

- **Tokenizer**: `AutoTokenizer` from the `bert-base-turkish-cased-ner` model
- **Max Sequence Length**: 128 tokens
- **Batch Size**: 128 (training and evaluation)
- **Learning Rate**: 2e-5
- **Number of Epochs**: 10
- **Weight Decay**: 0.005
- **Optimization Strategy**: Early stopping with a patience of 5 epochs based on the F1 metric

### Training Dataset

The training dataset is a custom Azerbaijani NER dataset sourced from [LocalDoc/azerbaijani-ner-dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset). The dataset was preprocessed to align tokens and NER tags accurately.

### Label Categories

The model supports the following entity categories:
- **Person (B-PERSON, I-PERSON)**
- **Location (B-LOCATION, I-LOCATION)**
- **Organization (B-ORGANISATION, I-ORGANISATION)**
- **Date (B-DATE, I-DATE)**
- **Time (B-TIME, I-TIME)**
- **Money (B-MONEY, I-MONEY)**
- **Percentage (B-PERCENTAGE, I-PERCENTAGE)**
- **Facility (B-FACILITY, I-FACILITY)**
- **Product (B-PRODUCT, I-PRODUCT)**
- ... (additional categories as specified in the training label list)

### Training Metrics

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1    |
|-------|---------------|-----------------|-----------|--------|-------|
| 1     | 0.433100      | 0.306711        | 0.739000  | 0.693282 | 0.715412 |
| 2     | 0.292700      | 0.275796        | 0.781565  | 0.688937 | 0.732334 |
| 3     | 0.250600      | 0.275115        | 0.758261  | 0.709425 | 0.733031 |
| 4     | 0.233700      | 0.273087        | 0.756184  | 0.716277 | 0.735689 |
| 5     | 0.214800      | 0.278477        | 0.756051  | 0.710996 | 0.732832 |
| 6     | 0.199200      | 0.286102        | 0.755068  | 0.717012 | 0.735548 |
| 7     | 0.192800      | 0.297157        | 0.742326  | 0.725802 | 0.733971 |
| 8     | 0.178900      | 0.304510        | 0.743206  | 0.723930 | 0.733442 |
| 9     | 0.171700      | 0.313845        | 0.743145  | 0.725535 | 0.734234 |

### Category-Wise Evaluation Metrics

| Category      | Precision | Recall | F1-Score | Support |
|---------------|-----------|--------|----------|---------|
| ART           | 0.49      | 0.14   | 0.21     | 1988    |
| DATE          | 0.49      | 0.48   | 0.49     | 844     |
| EVENT         | 0.88      | 0.36   | 0.51     | 84      |
| FACILITY      | 0.72      | 0.68   | 0.70     | 1146    |
| LAW           | 0.57      | 0.64   | 0.60     | 1103    |
| LOCATION      | 0.77      | 0.79   | 0.78     | 8806    |
| MONEY         | 0.62      | 0.57   | 0.59     | 532     |
| ORGANISATION  | 0.64      | 0.65   | 0.64     | 527     |
| PERCENTAGE    | 0.77      | 0.83   | 0.80     | 3679    |
| PERSON        | 0.87      | 0.81   | 0.84     | 6924    |
| PRODUCT       | 0.82      | 0.80   | 0.81     | 2653    |
| TIME          | 0.55      | 0.50   | 0.52     | 1634    |

- **Micro Average**: Precision: 0.76, Recall: 0.72, F1-Score: 0.74
- **Macro Average**: Precision: 0.68, Recall: 0.60, F1-Score: 0.62
- **Weighted Average**: Precision: 0.74, Recall: 0.72, F1-Score: 0.72

## Usage

### Loading the Model

To use the model for NER tasks, you can load it using the Hugging Face `transformers` library:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")

# Initialize the NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example text
text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."

# Run NER
results = ner_pipeline(text)
print(results)
```

### Inputs and Outputs

- **Input**: Plain text in Azerbaijani or Turkish.
- **Output**: List of detected entities with entity types and character offsets.

Example output:
```
[
  {'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98},
  {'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95}
]
```

### Evaluation Metrics

The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section.

## Limitations

- The model may have limited performance on texts that diverge significantly from the training data distribution.
- Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores.
- Further fine-tuning on larger and more diverse datasets may improve generalizability.

## Model Card

A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the [Hugging Face model page](https://huggingface.co/IsmatS/Azeri-Turkish-BERT-NER).

## Citation

If you use this model, please consider citing:
```
@misc{azeri-turkish-bert-ner,
  author = {Ismat Samadov},
  title = {Azeri-Turkish-BERT-NER},
  year = {2024},
  howpublished = {Hugging Face repository},
}
```