LocalDoc
/

language_detection

Text Classification

language detect

Inference Endpoints

Model card Files Files and versions Community

language_detection / README.md

vrashad's picture

Upload XLMRobertaForSequenceClassification

177980f verified 6 months ago

|

2.94 kB

	---
	language:
	- ar
	- az
	- bg
	- de
	- el
	- en
	- es
	- fr
	- hi
	- it
	- ja
	- nl
	- pl
	- pt
	- ru
	- sw
	- th
	- tr
	- ur
	- vi
	- zh
	license: cc-by-nc-4.0
	tags:
	- language detect
	pipeline_tag: text-classification
	---

	# Multilingual Language Detection Model

	## Model Description
	This repository contains a multilingual language detection model based on the XLM-RoBERTa base architecture. The model is capable of distinguishing between 21 different languages including Arabic, Azerbaijani, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, and Chinese.

	## How to Use
	You can use this model directly with a pipeline for text classification, or you can use it with the `transformers` library for more custom usage, as shown in the example below.

	### Quick Start
	First, install the transformers library if you haven't already:
	```bash
	pip install transformers
	```

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("LocalDoc/language_detection")
	model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/language_detection")

	# Prepare text
	text = "Əlqasım oğulları vorzakondu"
	encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)

	# Prediction
	model.eval()
	with torch.no_grad():
	outputs = model(**encoded_input)

	# Process the outputs
	logits = outputs.logits
	probabilities = torch.nn.functional.softmax(logits, dim=-1)
	predicted_class_index = probabilities.argmax().item()
	labels = ["az", "ar", "bg", "de", "el", "en", "es", "fr", "hi", "it", "ja", "nl", "pl", "pt", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
	predicted_label = labels[predicted_class_index]
	print(f"Predicted Language: {predicted_label}")
	```



	Training Performance

	The model was trained over three epochs, showing consistent improvement in accuracy and loss:

	Epoch 1: Training Loss: 0.0127, Validation Loss: 0.0174, Accuracy: 0.9966, F1 Score: 0.9966
	Epoch 2: Training Loss: 0.0149, Validation Loss: 0.0141, Accuracy: 0.9973, F1 Score: 0.9973
	Epoch 3: Training Loss: 0.0001, Validation Loss: 0.0109, Accuracy: 0.9984, F1 Score: 0.9984

	Test Results

	The model achieved the following results on the test set:

	Loss: 0.0133
	Accuracy: 0.9975
	F1 Score: 0.9975
	Precision: 0.9975
	Recall: 0.9975
	Evaluation Time: 17.5 seconds
	Samples per Second: 599.685
	Steps per Second: 9.424


	License

	The dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license. This license allows you to freely share and redistribute the dataset with attribution to the source but prohibits commercial use and the creation of derivative works.



	Contact information

	If you have any questions or suggestions, please contact us at [v.resad.89@gmail.com].