updated readme
Browse files
README.md
CHANGED
@@ -4,11 +4,131 @@ language:
|
|
4 |
- ro
|
5 |
metrics:
|
6 |
- accuracy
|
7 |
-
base_model: google-bert/bert-base-
|
8 |
library_name: transformers.js
|
9 |
tags:
|
10 |
- romanian
|
11 |
- fake news
|
12 |
- BERT
|
13 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
|
|
4 |
- ro
|
5 |
metrics:
|
6 |
- accuracy
|
7 |
+
base_model: google-bert/bert-base-multilingual-cased
|
8 |
library_name: transformers.js
|
9 |
tags:
|
10 |
- romanian
|
11 |
- fake news
|
12 |
- BERT
|
13 |
---
|
14 |
+
# Model Card for BERT
|
15 |
+
|
16 |
+
This model classifies Romanian fake news in ["fake_news", "misinformation", "propaganda", "real_news", "satire"]
|
17 |
+
|
18 |
+
## Model Details
|
19 |
+
|
20 |
+
### Model Description
|
21 |
+
|
22 |
+
This model is a BERT-based model fine-tuned for the task of detecting fake news in Romanian text. It is designed to classify text into one of several predefined labels indicating the likelihood of the text being fake news.
|
23 |
+
|
24 |
+
|
25 |
+
- **Developed by:** Bogdan Mihalca
|
26 |
+
- **Model type:** BERT
|
27 |
+
- **License:** MIT
|
28 |
+
- **Finetuned from model:** bert-base-multilingual-cased
|
29 |
+
|
30 |
+
### Model Sources [optional]
|
31 |
+
|
32 |
+
<!-- Provide the basic links for the model. -->
|
33 |
+
|
34 |
+
- **Repository:** [More Information Needed]
|
35 |
+
- **Paper [optional]:** [More Information Needed]
|
36 |
+
- **Demo [optional]:** [More Information Needed]
|
37 |
+
|
38 |
+
## Uses
|
39 |
+
|
40 |
+
### Direct Use
|
41 |
+
This model can be directly used to classify Romanian text into five categories: fake news, misinformation, propaganda, real news, and satire.
|
42 |
+
It can be handy for media monitoring platforms, news verifiers, and research on misinformation.
|
43 |
+
|
44 |
+
|
45 |
+
### Downstream Use
|
46 |
+
|
47 |
+
Potential applications include integration into larger systems for real-time news validation or being part of a research toolkit for studying misinformation.
|
48 |
+
|
49 |
+
### Out-of-Scope Use
|
50 |
+
|
51 |
+
This model is not intended for classifying non-Romanian text, and its accuracy may diminish significantly if used on text outside of the Romanian language.
|
52 |
+
It is also not suitable for detailed sentiment analysis or non-news related content.
|
53 |
+
|
54 |
+
|
55 |
+
## Bias, Risks, and Limitations
|
56 |
+
|
57 |
+
This model may exhibit bias due to the nature of the training data, which could lead to overfitting on certain types of news or propaganda specific to Romanian contexts.
|
58 |
+
It might not generalize well to new, unseen forms of misinformation or satire.
|
59 |
+
|
60 |
+
|
61 |
+
### Recommendations
|
62 |
+
|
63 |
+
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
64 |
+
Users should evaluate the model's performance on their specific use case before deployment, particularly if applied in contexts different from those in which the training data was collected.
|
65 |
+
|
66 |
+
|
67 |
+
## How to Get Started with the Model
|
68 |
+
|
69 |
+
To get started with this model, load it using HuggingFace's Transformers library and ensure you have the appropriate tokenizer
|
70 |
+
. Example code for loading and using the model is provided in the Python script.
|
71 |
+
|
72 |
+
## Training Details
|
73 |
+
|
74 |
+
### Training Data
|
75 |
+
The model was trained on a combined dataset of Romanian fake news, including data scraped from Veridica.ro and the Fakerom dataset.
|
76 |
+
The dataset includes categories such as fake news, misinformation, propaganda, real news, and satire.
|
77 |
+
### Training Procedure
|
78 |
+
|
79 |
+
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
80 |
+
|
81 |
+
#### Preprocessing [optional]
|
82 |
+
|
83 |
+
[More Information Needed]
|
84 |
+
|
85 |
+
|
86 |
+
#### Training Hyperparameters
|
87 |
+
|
88 |
+
- **Training regime:** The model was trained using fp32 precision.
|
89 |
+
- **Number of epochs:** 3
|
90 |
+
- **Batch size:** 16
|
91 |
+
- **Warmup steps:** 500
|
92 |
+
- **Weight decay:** 0.01
|
93 |
+
- **Learning rate scheduler:** Warmup with cosine decay.
|
94 |
+
|
95 |
+
|
96 |
+
## Evaluation
|
97 |
+
|
98 |
+
|
99 |
+
### Testing Data, Factors & Metrics
|
100 |
+
|
101 |
+
#### Testing Data
|
102 |
+
|
103 |
+
The model was tested on a split of the combined dataset, with 20% of the data used for evaluation.
|
104 |
+
|
105 |
+
#### Factors
|
106 |
+
The evaluation was performed across all five classes of fake news categories, with results disaggregated by each category.
|
107 |
+
|
108 |
+
|
109 |
+
#### Metrics
|
110 |
+
|
111 |
+
The primary evaluation metrics were accuracy, precision, recall, F1 score, and log loss. ROC AUC scores were also computed for each class.
|
112 |
+
|
113 |
+
|
114 |
+
### Results
|
115 |
+
|
116 |
+
Results
|
117 |
+
|
118 |
+
Accuracy: 93.51%
|
119 |
+
Precision: 93.80%
|
120 |
+
Recall: 93.51%
|
121 |
+
F1 Score: 93.56%
|
122 |
+
Log Loss: 0.225
|
123 |
+
ROC AUC Per Class:
|
124 |
+
Fake News: 98.50%
|
125 |
+
Misinformation: 98.89%
|
126 |
+
Propaganda: 98.98%
|
127 |
+
Real News: 99.55%
|
128 |
+
Satire: 99.99%
|
129 |
+
|
130 |
+
|
131 |
+
|
132 |
+
#### Summary
|
133 |
+
The model demonstrates strong performance across all classes, with particularly high ROC AUC scores indicating good separability between classes.
|
134 |
|