File size: 8,247 Bytes
8a65935
70ec7b2
 
8a65935
70ec7b2
8a65935
70ec7b2
 
 
 
8a65935
70ec7b2
 
 
357942a
70ec7b2
 
 
 
 
 
 
 
 
 
 
 
 
 
8a65935
 
70ec7b2
8a65935
70ec7b2
357942a
70ec7b2
8a65935
70ec7b2
8a65935
357942a
 
 
8a65935
70ec7b2
8a65935
 
 
 
 
 
 
70ec7b2
8a65935
 
 
 
 
 
 
70ec7b2
8a65935
70ec7b2
 
 
8a65935
 
 
 
 
70ec7b2
 
 
 
8a65935
70ec7b2
8a65935
70ec7b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a65935
70ec7b2
 
8a65935
 
70ec7b2
8a65935
70ec7b2
8a65935
70ec7b2
 
 
8a65935
 
70ec7b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a65935
70ec7b2
 
8a65935
70ec7b2
8a65935
70ec7b2
 
8a65935
70ec7b2
 
 
 
 
357942a
70ec7b2
8a65935
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70ec7b2
8a65935
 
 
70ec7b2
8a65935
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70ec7b2
8a65935
 
 
 
 
 
 
70ec7b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
language: fr
license: mit
library_name: sentence-transformers
pipeline_tag: feature-extraction
tags:
    - sentence-transformers
    - feature-extraction
    - sentence-similarity
    - transformers
datasets:
    - stsb_multi_mt
metrics:
    - pearsonr
base_model: cmarkea/distilcamembert-base
model-index:
  - name: sts-distilcamembert-base
    results:
      - task:
          name: Sentence Similarity
          type: sentence-similarity
        dataset:
          name: STSb French
          type: stsb_multi_mt
          args: fr
        metrics:
          - name: Pearson Correlation - stsb_multi_mt fr
            type: pearsonr
            value: 0.8165
---

## Description

Ce modèle [sentence-transformers](https://www.SBERT.net) a été obtenu en finetunant le modèle 
[`cmarkea/distilcamembert-base`](https://huggingface.co/cmarkea/distilcamembert-base) à l'aide de la librairie 
[sentence-transformers](https://www.SBERT.net).

Il permet d'encoder une phrase ou un pararaphe (514 tokens maximum) en un vecteur de dimension 768.

Le modèle [DistilCamemBERT](https://huggingface.co/papers/2205.11111) sur lequel il est basé est une distillation du
modèlel [CamemBERT](https://arxiv.org/abs/1911.03894) permettant de diviser par deux le nombre de paramètres du modèle
et améliorer le temps d'inférence.

## Utilisation via la librairie `sentence-transformers`

```
pip install -U sentence-transformers
```

```python
from sentence_transformers import SentenceTransformer
sentences = ["Ceci est un exemple", "deuxième exemple"]

model = SentenceTransformer('h4c5/sts-distilcamembert-base')
embeddings = model.encode(sentences)
print(embeddings)
```


## Utilisation via la librairie `transformers`

```
pip install -U transformers
```

```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-distilcamembert-base")
model = AutoModel.from_pretrained("h4c5/sts-distilcamembert-base")
model.eval()


# Mean Pooling
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[
        0
    ]  # First element of model_output contains all token embeddings
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )

# Tokenization et calcul des embeddings des tokens
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
model_output = model(**encoded_input)

# Mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])

print(sentence_embeddings)
```


## Evaluation

Le modèle a été évalué sur le jeu de données [STSb fr](https://huggingface.co/datasets/stsb_multi_mt) : 

```python
from datasets import load_dataset
from sentence_transformers import InputExample, evaluation


def dataset_to_input_examples(dataset):
    return [
        InputExample(
            texts=[example["sentence1"], example["sentence2"]],
            label=example["similarity_score"] / 5.0,
        )
        for example in dataset
    ]


sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
sts_test_examples = dataset_to_input_examples(sts_test_dataset)

sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    sts_test_examples, name="sts-test"
)

sts_test_evaluator(model, ".")
```

### Résultats

Ci-dessous, les résultats de l'évaluation du modèle sur le jeu données [`stsb_multi_mt`](https://huggingface.co/datasets/stsb_multi_mt)
(données `fr`, split `test`)

| Model                                                                                                                                          | Pearson Correlation | Paramètres |
| :--------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | ---------: |
| [`h4c5/sts-camembert-base`](https://huggingface.co/h4c5/sts-camembert-base)                                                                    |      **0.837**      |       110M |
| [`Lajavaness/sentence-camembert-base`](https://huggingface.co/Lajavaness/sentence-camembert-base)                                              |        0.835        |       110M |
| [`inokufu/flaubert-base-uncased-xnli-sts`](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts)                                      |        0.828        |       137M |
| [`h4c5/sts-distilcamembert-base`](https://huggingface.co/h4c5/sts-distilcamembert-base)                                                        |        0.817        |        68M |
| [`sentence-transformers/distiluse-base-multilingual-cased-v2`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) |        0.786        |       135M |



## Training
The model was trained with the parameters:

**DataLoader**:

`torch.utils.data.dataloader.DataLoader` of length 180 with parameters:
```
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```

**Loss**:

`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss` 

Parameters of the `fit()` method:
```
{
    "epochs": 10,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 500,
    "weight_decay": 0.01
}
```


## Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```

## Citing

    @inproceedings{reimers-2019-sentence-bert,
        title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
        author = "Reimers, Nils and Gurevych, Iryna",
        booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
        month = "11",
        year = "2019",
        publisher = "Association for Computational Linguistics",
        journal={"https://arxiv.org/abs/1908.10084"},
    }

    @inproceedings{sanh2019distilbert,
        title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
        author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
        booktitle={NeurIPS EMC^2 Workshop},
        journal={https://arxiv.org/abs/1910.01108},
        year={2019}
    }

    @inproceedings{martin2020camembert,
        title={CamemBERT: a Tasty French Language Model},
        author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
        booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
        journal={https://arxiv.org/abs/1911.03894},
        year={2020}
    }

    @inproceedings{delestre:hal-03674695,
        TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
        AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
        URL = {https://hal.archives-ouvertes.fr/hal-03674695},
        BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
        ADDRESS = {Vannes, France},
        YEAR = {2022},
        MONTH = Jul,
        KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
        PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
        HAL_ID = {hal-03674695},
        HAL_VERSION = {v1},
        journal={https://arxiv.org/abs/2205.11111},
    }