--- language: "tr" tags: - "bert" - "turkish" - "text-classification" license: "apache-2.0" datasets: - "custom" metrics: - "precision" - "recall" - "f1" - "accuracy" --- # BERT-based Organization Detection Model for Turkish Texts ## Model Description This model is fine-tuned on the `dbmdz/bert-base-turkish-uncased` architecture for detecting organization accounts within Turkish Twitter. This initiative is part of the Politus Project's efforts to analyze organizational presence in social media data. ## Model Architecture - **Base Model:** BERT (dbmdz/bert-base-turkish-uncased) - **Training Data:** Twitter data from 3,922 accounts with high organization-related activity as determined by m3inference scores above 0.7. The data was annotated based on user names, screen names, and descriptions by a human annotator. ## Training Setup - **Tokenization:** Used Hugging Face's AutoTokenizer, padding sequences to a maximum length of 128 tokens. - **Dataset Split:** 80% training, 20% validation. - **Training Parameters:** - Epochs: 3 - Training batch size: 8 - Evaluation batch size: 16 - Warmup steps: 500 - Weight decay: 0.01 ## Hyperparameter Tuning Performed using Optuna, achieving best settings: - **Learning rate:** 1.2323083424093641e-05 - **Batch size:** 32 - **Epochs:** 2 ## Evaluation Metrics - **Precision on Validation Set:** 0.94 (organization class) - **Recall on Validation Set:** 0.95 (organization class) - **F1-Score (Macro Average):** 0.95 - **Accuracy:** 0.95 - **Confusion Matrix on Validation Set:** ``` [[369, 22], [19, 375]] ``` - **Hand-coded Sample of 1000 Accounts:** - **Precision:** 0.91 - **F1-Score (Macro Average):** 0.947 - **Confusion Matrix:** ``` [[936, 3], [ 4, 31]] ``` ## How to Use ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer model = AutoModelForSequenceClassification.from_pretrained("atsizelti/atsizelti/turkish_org_classifier_hand_coded") tokenizer = AutoTokenizer.from_pretrained("atsizelti/atsizelti/turkish_org_classifier_hand_coded") text = "Örnek metin buraya girilir." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) predictions = outputs.logits.argmax(-1) ```