--- language: - ar - az - bg - de - el - en - es - fr - hi - it - ja - nl - pl - pt - ru - sw - th - tr - ur - vi - zh license: cc-by-nc-4.0 tags: - language detect pipeline_tag: text-classification --- # Multilingual Language Detection Model ## Model Description This repository contains a multilingual language detection model based on the XLM-RoBERTa base architecture. The model is capable of distinguishing between 21 different languages including Arabic, Azerbaijani, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, and Chinese. ## How to Use You can use this model directly with a pipeline for text classification, or you can use it with the `transformers` library for more custom usage, as shown in the example below. ### Quick Start First, install the transformers library if you haven't already: ```bash pip install transformers ``` ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained("LocalDoc/language_detection") model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/language_detection") # Prepare text text = "Əlqasım oğulları vorzakondu" encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512) # Prediction model.eval() with torch.no_grad(): outputs = model(**encoded_input) # Process the outputs logits = outputs.logits probabilities = torch.nn.functional.softmax(logits, dim=-1) predicted_class_index = probabilities.argmax().item() labels = ["az", "ar", "bg", "de", "el", "en", "es", "fr", "hi", "it", "ja", "nl", "pl", "pt", "ru", "sw", "th", "tr", "ur", "vi", "zh"] predicted_label = labels[predicted_class_index] print(f"Predicted Language: {predicted_label}") ``` Training Performance The model was trained over three epochs, showing consistent improvement in accuracy and loss: Epoch 1: Training Loss: 0.0127, Validation Loss: 0.0174, Accuracy: 0.9966, F1 Score: 0.9966 Epoch 2: Training Loss: 0.0149, Validation Loss: 0.0141, Accuracy: 0.9973, F1 Score: 0.9973 Epoch 3: Training Loss: 0.0001, Validation Loss: 0.0109, Accuracy: 0.9984, F1 Score: 0.9984 Test Results The model achieved the following results on the test set: Loss: 0.0133 Accuracy: 0.9975 F1 Score: 0.9975 Precision: 0.9975 Recall: 0.9975 Evaluation Time: 17.5 seconds Samples per Second: 599.685 Steps per Second: 9.424 License The dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license. This license allows you to freely share and redistribute the dataset with attribution to the source but prohibits commercial use and the creation of derivative works. Contact information If you have any questions or suggestions, please contact us at [v.resad.89@gmail.com].