File size: 2,532 Bytes
1735567
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
965c7e3
 
1c399da
965c7e3
 
 
 
 
 
 
 
 
 
 
 
1735567
 
 
 
 
 
 
 
 
 
 
d8c7dc7
965c7e3
 
 
 
1c399da
965c7e3
1c399da
965c7e3
c7d3562
965c7e3
c7d3562
965c7e3
6a22122
965c7e3
6a22122
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
language:
- ar
metrics:
- accuracy
- f1
- recall
- precision
- brier_score
- matthews_correlation
- roc_auc
- mse
pipeline_tag: text-classification
tags:
- haystack
- arabic
- question classification
- query classification
- query classifier
---
## Purpose:

This model is a query classifer for the Arabic Language, which can be used alone or within a Haystack pipeline. It returns a 0 for a query of words and 1 for a fully-formed question.

It was built in three steps.

1. Take the same useful Kaggle training data that Sharukh used, and only take the 'dev.csv' data, which is more than sufficient. Split that later into a new set of trian, val, and test sets. Translate it into Arabic using the Seq2Seq translation model "facebook/m2m100_1.2B". The priority was to have syntactially correct translations, and not necessarily semantically correct. In that sense, for word queries the words were translated individually and recombined into one string. The questions were translated as-is, and sometimes the results were a mix of Arabic and English (this is, I think, due to the details of the m2m model's vocab size and tokenizer). About 28% of the training data had question marks written explicitly.

2. Use the model [ARBERT](https://huggingface.co/UBC-NLP/ARBERT) as the base, and finetune on the above data.

3. Distill the above model into a smaller size. I was not very succesful in reducing the size significaly, although I reduced the hidden layers from 12 to 4. 


Results of testing on distilled model:

| Measure | Score|
| :-----| :------
|'accuracy': | 0.981 |
|'precision': | 0.983 |
|'recall': | 0.979 |
|'roc_auc': | 0.981 |
|'f1': | 0.981 |
|'matthews': | 0.962 |
|'mse': | 0.01876 |
|'brier': | 0.01876 | 

(In this case Brier is the same as MSE because there are only 2 labels)


## Thanks:

This model was inspired by this Github [thread](https://github.com/deepset-ai/haystack/issues/611) wherein making a query classifer model is discussed, and also [Sharukh Khan's](https://github.com/shahrukhx01) resulting English model based on DistilBert.

Regarding the model distillation, I owe thanks to the following sources for the distillation:

[Knowledge Distillation article by Phil Schmid](https://www.philschmid.de/knowledge-distillation-bert-transformers)

Articles by Remi Reboul:

[Distillation Part 1](https://towardsdatascience.com/distillation-of-bert-like-models-the-theory-32e19a02641f)

[Distillation Part 2](https://towardsdatascience.com/distillation-of-bert-like-models-the-code-73c31e8c2b0a)