ZodiUOA commited on
Commit
fc2d1d8
1 Parent(s): 039b490

Upload 10 files

Browse files
README.md CHANGED
@@ -1,3 +1,100 @@
1
  ---
2
- license: unknown
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ license: afl-3.0
4
+ tags:
5
+ - generated_from_trainer
6
+ metrics:
7
+ - accuracy
8
+ - precision
9
+ - recall
10
+ - f1
11
+ model-index:
12
+ - name: covid-twitter-bert-v2-struth
13
+ results: []
14
+ widget:
15
+ - text: "COVID vaccines can prevent serious illness and death from COVID-19"
16
+ example_title: "Real Tweet"
17
+ - text: "COVID vaccines are not effective at protecting you from serious illness and death from COVID-19"
18
+ example_title: "Fake Tweet"
19
  ---
20
+
21
+ # covid-twitter-bert-v2-struth
22
+
23
+ This model is a fine-tuned version of [digitalepidemiologylab/covid-twitter-bert-v2](https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2) on the [COVID-19 Fake News Dataset NLP by Elvin Aghammadzada](https://www.kaggle.com/datasets/elvinagammed/covid19-fake-news-dataset-nlp?select=Constraint_Val.csv).
24
+ It achieves the following results on the evaluation set:
25
+ - Loss: 0.1171
26
+ - Accuracy: 0.9662
27
+ - Precision: 0.9813
28
+ - Recall: 0.9493
29
+ - F1: 0.9650
30
+
31
+ ## Model description
32
+
33
+ This model is built on the work on Digital Epidemiology Lab and their COVID Twitter BERT model. We have extended their model by training it for Sequence Classification tasks. This is part of a wider project for True/Fake news by the [Struth Social Team](https://github.com/Struth-Social-UNSW/ITProject2).
34
+
35
+ ## Intended uses & limitations
36
+
37
+ This model is intended to be used for the classification of Tweets as either true or fake (0 or 1). The model can also be used for relatively complex statements regarding COVID-19.
38
+
39
+ A known limitation of this model is basic statements (e.g. COVID is a hoax) as the Tweets used to train the model are not simplistic in nature.
40
+
41
+ ## Training and evaluation data
42
+ Training and Testing data was split 80:20 for the results listed above.
43
+
44
+ Training/Testing Set:
45
+ - Samples Total: 8437
46
+ - Samples Train: 6749
47
+ - Samples Test: 1687
48
+
49
+ Evaluation Set:
50
+ - Samples Total: 100
51
+
52
+ ## Training procedure
53
+ 1. Data is preprocessed through custom scripts
54
+ 2. Data is passed to the model training script
55
+ 3. Training is conducted
56
+ 4. Best model is retrieved at end of training and uploaded to the Hub
57
+
58
+ ### Training hyperparameters
59
+
60
+ The following hyperparameters were used during training:
61
+ - learning_rate: 2e-05
62
+ - train_batch_size: 16
63
+ - eval_batch_size: 16
64
+ - seed: 42
65
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
66
+ - lr_scheduler_type: linear
67
+ - num_epochs: 20
68
+
69
+ ### Training results
70
+
71
+ | Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision | Recall | F1 |
72
+ |:-------------:|:-----:|:----:|:---------------:|:--------:|:---------:|:------:|:------:|
73
+ | 0.1719 | 1.0 | 422 | 0.1171 | 0.9662 | 0.9813 | 0.9493 | 0.9650 |
74
+ | 0.0565 | 2.0 | 844 | 0.1595 | 0.9621 | 0.9421 | 0.9831 | 0.9622 |
75
+ | 0.0221 | 3.0 | 1266 | 0.2059 | 0.9585 | 0.9859 | 0.9287 | 0.9565 |
76
+ | 0.009 | 4.0 | 1688 | 0.1378 | 0.9722 | 0.9600 | 0.9843 | 0.9720 |
77
+ | 0.0021 | 5.0 | 2110 | 0.2013 | 0.9722 | 0.9863 | 0.9565 | 0.9712 |
78
+ | 0.0069 | 6.0 | 2532 | 0.2894 | 0.9615 | 0.9948 | 0.9263 | 0.9593 |
79
+ | 0.0054 | 7.0 | 2954 | 0.2692 | 0.9650 | 0.9949 | 0.9336 | 0.9632 |
80
+ | 0.0058 | 8.0 | 3376 | 0.2406 | 0.9639 | 0.9776 | 0.9481 | 0.9626 |
81
+ | 0.0017 | 9.0 | 3798 | 0.1877 | 0.9722 | 0.9654 | 0.9783 | 0.9718 |
82
+ | 0.0019 | 10.0 | 4220 | 0.2761 | 0.9686 | 0.9850 | 0.9505 | 0.9674 |
83
+ | 0.007 | 11.0 | 4642 | 0.1889 | 0.9722 | 0.9875 | 0.9553 | 0.9711 |
84
+ | 0.0007 | 12.0 | 5064 | 0.2774 | 0.9662 | 0.9837 | 0.9469 | 0.9649 |
85
+ | 0.0008 | 13.0 | 5486 | 0.2344 | 0.9722 | 0.9791 | 0.9638 | 0.9714 |
86
+ | 0.0 | 14.0 | 5908 | 0.2768 | 0.9662 | 0.9789 | 0.9517 | 0.9651 |
87
+ | 0.0 | 15.0 | 6330 | 0.2798 | 0.9662 | 0.9789 | 0.9517 | 0.9651 |
88
+ | 0.0 | 16.0 | 6752 | 0.2790 | 0.9668 | 0.9789 | 0.9529 | 0.9657 |
89
+ | 0.0 | 17.0 | 7174 | 0.2850 | 0.9668 | 0.9789 | 0.9529 | 0.9657 |
90
+ | 0.0 | 18.0 | 7596 | 0.2837 | 0.9668 | 0.9789 | 0.9529 | 0.9657 |
91
+ | 0.0 | 19.0 | 8018 | 0.2835 | 0.9674 | 0.9789 | 0.9541 | 0.9664 |
92
+ | 0.0 | 20.0 | 8440 | 0.2842 | 0.9674 | 0.9789 | 0.9541 | 0.9664 |
93
+
94
+
95
+ ### Framework versions
96
+
97
+ - Transformers 4.22.2
98
+ - Pytorch 1.12.1+cu113
99
+ - Datasets 2.5.1
100
+ - Tokenizers 0.12.1
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "digitalepidemiologylab/covid-twitter-bert-v2",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "id2label": {
13
+ "0": "real",
14
+ "1": "fake"
15
+ },
16
+ "label2id": {
17
+ "real": 0,
18
+ "fake": 1
19
+ },
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 4096,
22
+ "layer_norm_eps": 1e-12,
23
+ "max_position_embeddings": 512,
24
+ "model_type": "bert",
25
+ "num_attention_heads": 16,
26
+ "num_hidden_layers": 24,
27
+ "pad_token_id": 0,
28
+ "position_embedding_type": "absolute",
29
+ "problem_type": "single_label_classification",
30
+ "torch_dtype": "float32",
31
+ "transformers_version": "4.22.2",
32
+ "type_vocab_size": 2,
33
+ "use_cache": true,
34
+ "vocab_size": 30522
35
+ }
gitattributes ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ftz filter=lfs diff=lfs merge=lfs -text
6
+ *.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.h5 filter=lfs diff=lfs merge=lfs -text
8
+ *.joblib filter=lfs diff=lfs merge=lfs -text
9
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
10
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
11
+ *.model filter=lfs diff=lfs merge=lfs -text
12
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
13
+ *.npy filter=lfs diff=lfs merge=lfs -text
14
+ *.npz filter=lfs diff=lfs merge=lfs -text
15
+ *.onnx filter=lfs diff=lfs merge=lfs -text
16
+ *.ot filter=lfs diff=lfs merge=lfs -text
17
+ *.parquet filter=lfs diff=lfs merge=lfs -text
18
+ *.pb filter=lfs diff=lfs merge=lfs -text
19
+ *.pickle filter=lfs diff=lfs merge=lfs -text
20
+ *.pkl filter=lfs diff=lfs merge=lfs -text
21
+ *.pt filter=lfs diff=lfs merge=lfs -text
22
+ *.pth filter=lfs diff=lfs merge=lfs -text
23
+ *.rar filter=lfs diff=lfs merge=lfs -text
24
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
25
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
26
+ *.tflite filter=lfs diff=lfs merge=lfs -text
27
+ *.tgz filter=lfs diff=lfs merge=lfs -text
28
+ *.wasm filter=lfs diff=lfs merge=lfs -text
29
+ *.xz filter=lfs diff=lfs merge=lfs -text
30
+ *.zip filter=lfs diff=lfs merge=lfs -text
31
+ *.zst filter=lfs diff=lfs merge=lfs -text
32
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ checkpoint-*/
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4c554f2870212633d6be09415d8e832bf3cffcf468beccc158106236c5731b5
3
+ size 1340711725
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "do_basic_tokenize": true,
4
+ "do_lower_case": true,
5
+ "full_tokenizer_file": null,
6
+ "mask_token": "[MASK]",
7
+ "model_max_length": 128,
8
+ "name_or_path": "digitalepidemiologylab/covid-twitter-bert-v2",
9
+ "never_split": null,
10
+ "pad_token": "[PAD]",
11
+ "sep_token": "[SEP]",
12
+ "special_tokens_map_file": null,
13
+ "strip_accents": null,
14
+ "tokenize_chinese_chars": true,
15
+ "tokenizer_class": "BertTokenizer",
16
+ "unk_token": "[UNK]"
17
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93e47cece7d97b6ad7b6d1c4d367c9283316641019143a13746823d55f2f7692
3
+ size 3375
vocab.txt ADDED
The diff for this file is too large to render. See raw diff