KadriMufti
commited on
Commit
•
965c7e3
1
Parent(s):
477a4c1
Upload 9 files
Browse files- README.md +41 -0
- fingerprint.pb +3 -0
- keras_metadata.pb +3 -0
- saved_model.pb +3 -0
- special_tokens_map.json +7 -0
- tokenizer.json +0 -0
- tokenizer_config.json +16 -0
- training_args.bin +3 -0
- vocab.txt +0 -0
README.md
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Purpose:
|
2 |
+
|
3 |
+
This model is a query classifer for the Arabic Language. It returns a 0 for a query of words and 1 for a fully-formed question.
|
4 |
+
|
5 |
+
It was built in three steps.
|
6 |
+
|
7 |
+
1. Take the same useful Kaggle training data that Sharukh used, and only take the 'dev.csv' data, which is more than sufficient. Split that later into a new set of trian, val, and test sets. Translate it into Arabic using the Seq2Seq translation model "facebook/m2m100_1.2B". The priority was to have syntactially correct translations, and not necessarily semantically correct. In that sense, for word queries the words were translated individually and recombined into one string. The questions were translated as-is, and sometimes the results were a mix of Arabic and English (this is, I think, due to the details of the m2m model's vocab size and tokenizer). About 28% of the training data had question marks written explicitly.
|
8 |
+
|
9 |
+
2. Use the model [ARBERT](https://huggingface.co/UBC-NLP/ARBERT) as the base, and finetune on the above data.
|
10 |
+
|
11 |
+
3. Distill the above model into a smaller size. I was not very succesful in reducing the size significaly, although I reduced the hidden layers from 12 to 4.
|
12 |
+
|
13 |
+
|
14 |
+
Results of testing on distilled model:
|
15 |
+
|
16 |
+
{'accuracy': 0.9812329107631121,
|
17 |
+
'precision': 0.9833664349553128,
|
18 |
+
'recall': 0.9792336217552534,
|
19 |
+
'roc_auc': 0.98124390410432,
|
20 |
+
'f1': 0.9812956769478509,
|
21 |
+
'matthews': 0.9624741598127332,
|
22 |
+
'mse': 0.018767089236887895,
|
23 |
+
'brier': 0.018767089236887895}
|
24 |
+
|
25 |
+
|
26 |
+
## Thanks:
|
27 |
+
|
28 |
+
This model was inspired by this Github [thread]https://github.com/deepset-ai/haystack/issues/611) wherein making a query classifer model is discussed, and also [Sharukh Khan's] (https://github.com/shahrukhx01) resulting English model based on DistilBert.
|
29 |
+
|
30 |
+
Regarding the model distillation, I owe thanks to the following source:
|
31 |
+
|
32 |
+
[Knowledge Distillation article by Phil Schmid](https://www.philschmid.de/knowledge-distillation-bert-transformers)
|
33 |
+
|
34 |
+
Articles by Remi Reboul:
|
35 |
+
|
36 |
+
https://towardsdatascience.com/distillation-of-bert-like-models-the-theory-32e19a02641f
|
37 |
+
|
38 |
+
https://towardsdatascience.com/distillation-of-bert-like-models-the-code-73c31e8c2b0a
|
39 |
+
|
40 |
+
|
41 |
+
|
fingerprint.pb
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:06d7d1dc9501deb3e07d530bd9df67f51cf4f44836a78d6d72f6f7a1e7801936
|
3 |
+
size 54
|
keras_metadata.pb
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:bd017f0815a10788f3b9632699b0183138f3be61744f1a9870dc40af5774be58
|
3 |
+
size 65206
|
saved_model.pb
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c1e0f94554c9dfd0338dfd5227642814901c7c1dea9ee162a6fc9302942d0f55
|
3 |
+
size 2943124
|
special_tokens_map.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"mask_token": "[MASK]",
|
4 |
+
"pad_token": "[PAD]",
|
5 |
+
"sep_token": "[SEP]",
|
6 |
+
"unk_token": "[UNK]"
|
7 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"do_basic_tokenize": true,
|
4 |
+
"do_lower_case": true,
|
5 |
+
"mask_token": "[MASK]",
|
6 |
+
"model_max_length": 1000000000000000019884624838656,
|
7 |
+
"name_or_path": "UBC-NLP/ARBERT",
|
8 |
+
"never_split": null,
|
9 |
+
"pad_token": "[PAD]",
|
10 |
+
"sep_token": "[SEP]",
|
11 |
+
"special_tokens_map_file": null,
|
12 |
+
"strip_accents": null,
|
13 |
+
"tokenize_chinese_chars": true,
|
14 |
+
"tokenizer_class": "BertTokenizer",
|
15 |
+
"unk_token": "[UNK]"
|
16 |
+
}
|
training_args.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:67d320e32bfafbad2636883057fb568bda6bc2a821ba0370eee48c678655bff7
|
3 |
+
size 3707
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|