File size: 2,811 Bytes
5d00664
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84e392f
5d00664
84e392f
5d00664
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84e392f
 
5d00664
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
license: cc-by-nc-4.0
language:
- ar
- az
- bg
- de
- el
- en
- es
- fr
- hi
- it
- ja
- nl
- pl
- pt
- ru
- sw
- th
- tr
- ur
- vi
- zh
pipeline_tag: text-classification
tags:
- language detect
---

# Multilingual Language Detection Model

## Model Description
This repository contains a multilingual language detection model based on the XLM-RoBERTa base architecture. The model is capable of distinguishing between 21 different languages including Arabic, Azerbaijani, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, and Chinese.

## How to Use
You can use this model directly with a pipeline for text classification, or you can use it with the `transformers` library for more custom usage, as shown in the example below.

### Quick Start
First, install the transformers library if you haven't already:
```bash
pip install transformers
```

```
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("LocalDoc/language_detection")
model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/language_detection")

# Prepare text
text = "Əlqasım oğulları vorzakondu"
encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)

# Prediction
model.eval()
with torch.no_grad():
    outputs = model(**encoded_input)

# Process the outputs
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=-1)
predicted_class_index = probabilities.argmax().item()
labels = ["az", "ar", "bg", "de", "el", "en", "es", "fr", "hi", "it", "ja", "nl", "pl", "pt", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
predicted_label = labels[predicted_class_index]
print(f"Predicted Language: {predicted_label}")
```



Training Performance

The model was trained over three epochs, showing consistent improvement in accuracy and loss:

    <b>Epoch 1:</b> Training Loss: 0.0127, Validation Loss: 0.0174, Accuracy: 0.9966, F1 Score: 0.9966
    <b>Epoch 2:</b> Training Loss: 0.0149, Validation Loss: 0.0141, Accuracy: 0.9973, F1 Score: 0.9973
    <b>Epoch 3:</b> Training Loss: 0.0001, Validation Loss: 0.0109, Accuracy: 0.9984, F1 Score: 0.9984

Test Results

The model achieved the following results on the test set:

    Loss: 0.0133
    Accuracy: 0.9975
    F1 Score: 0.9975
    Precision: 0.9975
    Recall: 0.9975
    Evaluation Time: 17.5 seconds
    Samples per Second: 599.685
    Steps per Second: 9.424

Licensing

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You are free to use, modify, and distribute this model non-commercially, provided you attribute the original creation.