Update README.md
#1
by
ShaidaMuhammad
- opened
README.md
CHANGED
@@ -1,3 +1,103 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- ur
|
5 |
+
---
|
6 |
+
----
|
7 |
+
mit
|
8 |
+
---
|
9 |
+
|
10 |
+
# ayeshasameer/xlm-roberta-roman-urdu-sentiment
|
11 |
+
|
12 |
+
## Model Description
|
13 |
+
|
14 |
+
The `ayeshasameer/xlm-roberta-roman-urdu-sentiment` model is a fine-tuned version of [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base), specifically adapted for sentiment analysis tasks on Roman Urdu text. XLM-RoBERTa is a multilingual variant of RoBERTa, pre-trained on a diverse set of languages, making it highly versatile for various NLP tasks across multiple languages.
|
15 |
+
|
16 |
+
This model is trained to classify Roman Urdu text into three sentiment categories:
|
17 |
+
- Positive
|
18 |
+
- Neutral
|
19 |
+
- Negative
|
20 |
+
|
21 |
+
## Model Architecture
|
22 |
+
|
23 |
+
- **Model Type:** XLM-RoBERTa
|
24 |
+
- **Number of Layers:** 12
|
25 |
+
- **Hidden Size:** 768
|
26 |
+
- **Number of Attention Heads:** 12
|
27 |
+
- **Intermediate Size:** 3072
|
28 |
+
- **Max Position Embeddings:** 514
|
29 |
+
- **Vocabulary Size:** 250002
|
30 |
+
- **Hidden Activation Function:** GELU
|
31 |
+
- **Hidden Dropout Probability:** 0.1
|
32 |
+
- **Attention Dropout Probability:** 0.1
|
33 |
+
- **Layer Norm Epsilon:** 1e-5
|
34 |
+
|
35 |
+
## Training Data
|
36 |
+
|
37 |
+
The model was fine-tuned on a dataset of Roman Urdu text, labeled for sentiment analysis. The dataset includes text from social media, news comments, and other sources where Roman Urdu is commonly used. The labels for the dataset were:
|
38 |
+
- Positive
|
39 |
+
- Neutral
|
40 |
+
- Negative
|
41 |
+
|
42 |
+
## Intended Use
|
43 |
+
|
44 |
+
The model is intended for sentiment analysis of Roman Urdu text, which is commonly used in informal settings like social media, chat applications, and user-generated content platforms. It can be used to understand the sentiment behind user comments, reviews, and other forms of text communication.
|
45 |
+
|
46 |
+
## Example Usage
|
47 |
+
|
48 |
+
Here is an example of how to use this model with the Hugging Face Transformers library in Python:
|
49 |
+
|
50 |
+
```python
|
51 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
52 |
+
import torch
|
53 |
+
from scipy.special import softmax
|
54 |
+
|
55 |
+
# Load the model and tokenizer
|
56 |
+
model_name = "ayeshasameer/xlm-roberta-roman-urdu-sentiment"
|
57 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
58 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
59 |
+
|
60 |
+
# Preprocess the input text
|
61 |
+
text = "Mein ek bahut acha insaan hon."
|
62 |
+
inputs = tokenizer(text, return_tensors="pt")
|
63 |
+
|
64 |
+
# Get model predictions
|
65 |
+
outputs = model(**inputs)
|
66 |
+
scores = outputs[0][0].detach().numpy()
|
67 |
+
scores = softmax(scores)
|
68 |
+
|
69 |
+
# Output the sentiment scores
|
70 |
+
sentiment = {
|
71 |
+
"Negative": scores[0],
|
72 |
+
"Neutral": scores[1],
|
73 |
+
"Positive": scores[2]
|
74 |
+
}
|
75 |
+
print(sentiment)
|
76 |
+
|
77 |
+
```
|
78 |
+
|
79 |
+
## Evaluation
|
80 |
+
|
81 |
+
The model was evaluated on a held-out test set of Roman Urdu text and achieved the following performance metrics:
|
82 |
+
- **Accuracy:** 0.XX
|
83 |
+
- **Precision:** 0.XX
|
84 |
+
- **Recall:** 0.XX
|
85 |
+
- **F1 Score:** 0.XX
|
86 |
+
|
87 |
+
These metrics indicate the model's ability to correctly classify the sentiment of Roman Urdu text.
|
88 |
+
|
89 |
+
## Limitations
|
90 |
+
|
91 |
+
While the model performs well on the provided dataset, there are some limitations:
|
92 |
+
- The model may not generalize well to domains or types of text that were not represented in the training data.
|
93 |
+
- Misclassifications can occur, especially with text that contains sarcasm, slang, or context-specific language that the model was not trained on.
|
94 |
+
- The model's performance is dependent on the quality and representativeness of the training data.
|
95 |
+
|
96 |
+
## Ethical Considerations
|
97 |
+
|
98 |
+
When using the model, it is essential to consider the ethical implications:
|
99 |
+
- Ensure that the text being analyzed does not contain sensitive or private information.
|
100 |
+
- Be mindful of potential biases in the training data, which could affect the model's predictions.
|
101 |
+
- Use the model responsibly, especially in applications that may impact individuals or communities.
|
102 |
+
|
103 |
+
---
|