Upload 10 files
Browse files- README.md +98 -1
- config.json +35 -0
- gitattributes +32 -0
- gitignore +1 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +7 -0
- tokenizer.json +0 -0
- tokenizer_config.json +17 -0
- training_args.bin +3 -0
- vocab.txt +0 -0
README.md
CHANGED
@@ -1,3 +1,100 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: en
|
3 |
+
license: afl-3.0
|
4 |
+
tags:
|
5 |
+
- generated_from_trainer
|
6 |
+
metrics:
|
7 |
+
- accuracy
|
8 |
+
- precision
|
9 |
+
- recall
|
10 |
+
- f1
|
11 |
+
model-index:
|
12 |
+
- name: covid-twitter-bert-v2-struth
|
13 |
+
results: []
|
14 |
+
widget:
|
15 |
+
- text: "COVID vaccines can prevent serious illness and death from COVID-19"
|
16 |
+
example_title: "Real Tweet"
|
17 |
+
- text: "COVID vaccines are not effective at protecting you from serious illness and death from COVID-19"
|
18 |
+
example_title: "Fake Tweet"
|
19 |
---
|
20 |
+
|
21 |
+
# covid-twitter-bert-v2-struth
|
22 |
+
|
23 |
+
This model is a fine-tuned version of [digitalepidemiologylab/covid-twitter-bert-v2](https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2) on the [COVID-19 Fake News Dataset NLP by Elvin Aghammadzada](https://www.kaggle.com/datasets/elvinagammed/covid19-fake-news-dataset-nlp?select=Constraint_Val.csv).
|
24 |
+
It achieves the following results on the evaluation set:
|
25 |
+
- Loss: 0.1171
|
26 |
+
- Accuracy: 0.9662
|
27 |
+
- Precision: 0.9813
|
28 |
+
- Recall: 0.9493
|
29 |
+
- F1: 0.9650
|
30 |
+
|
31 |
+
## Model description
|
32 |
+
|
33 |
+
This model is built on the work on Digital Epidemiology Lab and their COVID Twitter BERT model. We have extended their model by training it for Sequence Classification tasks. This is part of a wider project for True/Fake news by the [Struth Social Team](https://github.com/Struth-Social-UNSW/ITProject2).
|
34 |
+
|
35 |
+
## Intended uses & limitations
|
36 |
+
|
37 |
+
This model is intended to be used for the classification of Tweets as either true or fake (0 or 1). The model can also be used for relatively complex statements regarding COVID-19.
|
38 |
+
|
39 |
+
A known limitation of this model is basic statements (e.g. COVID is a hoax) as the Tweets used to train the model are not simplistic in nature.
|
40 |
+
|
41 |
+
## Training and evaluation data
|
42 |
+
Training and Testing data was split 80:20 for the results listed above.
|
43 |
+
|
44 |
+
Training/Testing Set:
|
45 |
+
- Samples Total: 8437
|
46 |
+
- Samples Train: 6749
|
47 |
+
- Samples Test: 1687
|
48 |
+
|
49 |
+
Evaluation Set:
|
50 |
+
- Samples Total: 100
|
51 |
+
|
52 |
+
## Training procedure
|
53 |
+
1. Data is preprocessed through custom scripts
|
54 |
+
2. Data is passed to the model training script
|
55 |
+
3. Training is conducted
|
56 |
+
4. Best model is retrieved at end of training and uploaded to the Hub
|
57 |
+
|
58 |
+
### Training hyperparameters
|
59 |
+
|
60 |
+
The following hyperparameters were used during training:
|
61 |
+
- learning_rate: 2e-05
|
62 |
+
- train_batch_size: 16
|
63 |
+
- eval_batch_size: 16
|
64 |
+
- seed: 42
|
65 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
66 |
+
- lr_scheduler_type: linear
|
67 |
+
- num_epochs: 20
|
68 |
+
|
69 |
+
### Training results
|
70 |
+
|
71 |
+
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision | Recall | F1 |
|
72 |
+
|:-------------:|:-----:|:----:|:---------------:|:--------:|:---------:|:------:|:------:|
|
73 |
+
| 0.1719 | 1.0 | 422 | 0.1171 | 0.9662 | 0.9813 | 0.9493 | 0.9650 |
|
74 |
+
| 0.0565 | 2.0 | 844 | 0.1595 | 0.9621 | 0.9421 | 0.9831 | 0.9622 |
|
75 |
+
| 0.0221 | 3.0 | 1266 | 0.2059 | 0.9585 | 0.9859 | 0.9287 | 0.9565 |
|
76 |
+
| 0.009 | 4.0 | 1688 | 0.1378 | 0.9722 | 0.9600 | 0.9843 | 0.9720 |
|
77 |
+
| 0.0021 | 5.0 | 2110 | 0.2013 | 0.9722 | 0.9863 | 0.9565 | 0.9712 |
|
78 |
+
| 0.0069 | 6.0 | 2532 | 0.2894 | 0.9615 | 0.9948 | 0.9263 | 0.9593 |
|
79 |
+
| 0.0054 | 7.0 | 2954 | 0.2692 | 0.9650 | 0.9949 | 0.9336 | 0.9632 |
|
80 |
+
| 0.0058 | 8.0 | 3376 | 0.2406 | 0.9639 | 0.9776 | 0.9481 | 0.9626 |
|
81 |
+
| 0.0017 | 9.0 | 3798 | 0.1877 | 0.9722 | 0.9654 | 0.9783 | 0.9718 |
|
82 |
+
| 0.0019 | 10.0 | 4220 | 0.2761 | 0.9686 | 0.9850 | 0.9505 | 0.9674 |
|
83 |
+
| 0.007 | 11.0 | 4642 | 0.1889 | 0.9722 | 0.9875 | 0.9553 | 0.9711 |
|
84 |
+
| 0.0007 | 12.0 | 5064 | 0.2774 | 0.9662 | 0.9837 | 0.9469 | 0.9649 |
|
85 |
+
| 0.0008 | 13.0 | 5486 | 0.2344 | 0.9722 | 0.9791 | 0.9638 | 0.9714 |
|
86 |
+
| 0.0 | 14.0 | 5908 | 0.2768 | 0.9662 | 0.9789 | 0.9517 | 0.9651 |
|
87 |
+
| 0.0 | 15.0 | 6330 | 0.2798 | 0.9662 | 0.9789 | 0.9517 | 0.9651 |
|
88 |
+
| 0.0 | 16.0 | 6752 | 0.2790 | 0.9668 | 0.9789 | 0.9529 | 0.9657 |
|
89 |
+
| 0.0 | 17.0 | 7174 | 0.2850 | 0.9668 | 0.9789 | 0.9529 | 0.9657 |
|
90 |
+
| 0.0 | 18.0 | 7596 | 0.2837 | 0.9668 | 0.9789 | 0.9529 | 0.9657 |
|
91 |
+
| 0.0 | 19.0 | 8018 | 0.2835 | 0.9674 | 0.9789 | 0.9541 | 0.9664 |
|
92 |
+
| 0.0 | 20.0 | 8440 | 0.2842 | 0.9674 | 0.9789 | 0.9541 | 0.9664 |
|
93 |
+
|
94 |
+
|
95 |
+
### Framework versions
|
96 |
+
|
97 |
+
- Transformers 4.22.2
|
98 |
+
- Pytorch 1.12.1+cu113
|
99 |
+
- Datasets 2.5.1
|
100 |
+
- Tokenizers 0.12.1
|
config.json
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "digitalepidemiologylab/covid-twitter-bert-v2",
|
3 |
+
"architectures": [
|
4 |
+
"BertForSequenceClassification"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"gradient_checkpointing": false,
|
9 |
+
"hidden_act": "gelu",
|
10 |
+
"hidden_dropout_prob": 0.1,
|
11 |
+
"hidden_size": 1024,
|
12 |
+
"id2label": {
|
13 |
+
"0": "real",
|
14 |
+
"1": "fake"
|
15 |
+
},
|
16 |
+
"label2id": {
|
17 |
+
"real": 0,
|
18 |
+
"fake": 1
|
19 |
+
},
|
20 |
+
"initializer_range": 0.02,
|
21 |
+
"intermediate_size": 4096,
|
22 |
+
"layer_norm_eps": 1e-12,
|
23 |
+
"max_position_embeddings": 512,
|
24 |
+
"model_type": "bert",
|
25 |
+
"num_attention_heads": 16,
|
26 |
+
"num_hidden_layers": 24,
|
27 |
+
"pad_token_id": 0,
|
28 |
+
"position_embedding_type": "absolute",
|
29 |
+
"problem_type": "single_label_classification",
|
30 |
+
"torch_dtype": "float32",
|
31 |
+
"transformers_version": "4.22.2",
|
32 |
+
"type_vocab_size": 2,
|
33 |
+
"use_cache": true,
|
34 |
+
"vocab_size": 30522
|
35 |
+
}
|
gitattributes
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
6 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
7 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
8 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
9 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
10 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
11 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
12 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
13 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
14 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
15 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
16 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
17 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
18 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
19 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
20 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
21 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
22 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
23 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
24 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
25 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
26 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
27 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
28 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
29 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
30 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
31 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
32 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
gitignore
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
checkpoint-*/
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f4c554f2870212633d6be09415d8e832bf3cffcf468beccc158106236c5731b5
|
3 |
+
size 1340711725
|
special_tokens_map.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"mask_token": "[MASK]",
|
4 |
+
"pad_token": "[PAD]",
|
5 |
+
"sep_token": "[SEP]",
|
6 |
+
"unk_token": "[UNK]"
|
7 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"do_basic_tokenize": true,
|
4 |
+
"do_lower_case": true,
|
5 |
+
"full_tokenizer_file": null,
|
6 |
+
"mask_token": "[MASK]",
|
7 |
+
"model_max_length": 128,
|
8 |
+
"name_or_path": "digitalepidemiologylab/covid-twitter-bert-v2",
|
9 |
+
"never_split": null,
|
10 |
+
"pad_token": "[PAD]",
|
11 |
+
"sep_token": "[SEP]",
|
12 |
+
"special_tokens_map_file": null,
|
13 |
+
"strip_accents": null,
|
14 |
+
"tokenize_chinese_chars": true,
|
15 |
+
"tokenizer_class": "BertTokenizer",
|
16 |
+
"unk_token": "[UNK]"
|
17 |
+
}
|
training_args.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:93e47cece7d97b6ad7b6d1c4d367c9283316641019143a13746823d55f2f7692
|
3 |
+
size 3375
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|