Azeri-Turkish-BERT-NER

Model Description

The Azeri-Turkish-BERT-NER model is a fine-tuned version of the bert-base-turkish-cased-ner model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities.

The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts.

Model Details

Base Model: bert-base-turkish-cased-ner (adapted from Hugging Face)
Task: Named Entity Recognition (NER)
Languages: Azerbaijani, Turkish
Fine-Tuned On: Custom Azerbaijani NER dataset
Input Text Format: Plain text with tokenized words
Model Type: BERT-based transformer for token classification

Training Details

The model was fine-tuned using the Hugging Face transformers library and datasets. Here is a brief summary of the fine-tuning configuration:

Tokenizer: AutoTokenizer from the bert-base-turkish-cased-ner model
Max Sequence Length: 128 tokens
Batch Size: 128 (training and evaluation)
Learning Rate: 2e-5
Number of Epochs: 10
Weight Decay: 0.005
Optimization Strategy: Early stopping with a patience of 5 epochs based on the F1 metric

Training Dataset

The training dataset is a custom Azerbaijani NER dataset sourced from LocalDoc/azerbaijani-ner-dataset. The dataset was preprocessed to align tokens and NER tags accurately.

Label Categories

The model supports the following entity categories:

Person (B-PERSON, I-PERSON)
Location (B-LOCATION, I-LOCATION)
Organization (B-ORGANISATION, I-ORGANISATION)
Date (B-DATE, I-DATE)
Time (B-TIME, I-TIME)
Money (B-MONEY, I-MONEY)
Percentage (B-PERCENTAGE, I-PERCENTAGE)
Facility (B-FACILITY, I-FACILITY)
Product (B-PRODUCT, I-PRODUCT)
... (additional categories as specified in the training label list)

Training Metrics

Epoch	Training Loss	Validation Loss	Precision	Recall	F1
1	0.433100	0.306711	0.739000	0.693282	0.715412
2	0.292700	0.275796	0.781565	0.688937	0.732334
3	0.250600	0.275115	0.758261	0.709425	0.733031
4	0.233700	0.273087	0.756184	0.716277	0.735689
5	0.214800	0.278477	0.756051	0.710996	0.732832
6	0.199200	0.286102	0.755068	0.717012	0.735548
7	0.192800	0.297157	0.742326	0.725802	0.733971
8	0.178900	0.304510	0.743206	0.723930	0.733442
9	0.171700	0.313845	0.743145	0.725535	0.734234

Category-Wise Evaluation Metrics

Category	Precision	Recall	F1-Score	Support
ART	0.49	0.14	0.21	1988
DATE	0.49	0.48	0.49	844
EVENT	0.88	0.36	0.51	84
FACILITY	0.72	0.68	0.70	1146
LAW	0.57	0.64	0.60	1103
LOCATION	0.77	0.79	0.78	8806
MONEY	0.62	0.57	0.59	532
ORGANISATION	0.64	0.65	0.64	527
PERCENTAGE	0.77	0.83	0.80	3679
PERSON	0.87	0.81	0.84	6924
PRODUCT	0.82	0.80	0.81	2653
TIME	0.55	0.50	0.52	1634

Micro Average: Precision: 0.76, Recall: 0.72, F1-Score: 0.74
Macro Average: Precision: 0.68, Recall: 0.60, F1-Score: 0.62
Weighted Average: Precision: 0.74, Recall: 0.72, F1-Score: 0.72

Usage

Loading the Model

To use the model for NER tasks, you can load it using the Hugging Face transformers library:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")

# Initialize the NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example text
text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."

# Run NER
results = ner_pipeline(text)
print(results)

Inputs and Outputs

Input: Plain text in Azerbaijani or Turkish.
Output: List of detected entities with entity types and character offsets.

Example output:

[
  {'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98},
  {'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95}
]

Evaluation Metrics

The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section.

Limitations

The model may have limited performance on texts that diverge significantly from the training data distribution.
Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores.
Further fine-tuning on larger and more diverse datasets may improve generalizability.

Model Card

A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the Hugging Face model page.

Citation

If you use this model, please consider citing:

@misc{azeri-turkish-bert-ner,
  author = {Ismat Samadov},
  title = {Azeri-Turkish-BERT-NER},
  year = {2024},
  howpublished = {Hugging Face repository},
}

IsmatS
/

azeri-turkish-bert-ner