File size: 8,247 Bytes
8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 357942a 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 357942a 70ec7b2 8a65935 70ec7b2 8a65935 357942a 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 357942a 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 8a65935 70ec7b2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
---
language: fr
license: mit
library_name: sentence-transformers
pipeline_tag: feature-extraction
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
datasets:
- stsb_multi_mt
metrics:
- pearsonr
base_model: cmarkea/distilcamembert-base
model-index:
- name: sts-distilcamembert-base
results:
- task:
name: Sentence Similarity
type: sentence-similarity
dataset:
name: STSb French
type: stsb_multi_mt
args: fr
metrics:
- name: Pearson Correlation - stsb_multi_mt fr
type: pearsonr
value: 0.8165
---
## Description
Ce modèle [sentence-transformers](https://www.SBERT.net) a été obtenu en finetunant le modèle
[`cmarkea/distilcamembert-base`](https://huggingface.co/cmarkea/distilcamembert-base) à l'aide de la librairie
[sentence-transformers](https://www.SBERT.net).
Il permet d'encoder une phrase ou un pararaphe (514 tokens maximum) en un vecteur de dimension 768.
Le modèle [DistilCamemBERT](https://huggingface.co/papers/2205.11111) sur lequel il est basé est une distillation du
modèlel [CamemBERT](https://arxiv.org/abs/1911.03894) permettant de diviser par deux le nombre de paramètres du modèle
et améliorer le temps d'inférence.
## Utilisation via la librairie `sentence-transformers`
```
pip install -U sentence-transformers
```
```python
from sentence_transformers import SentenceTransformer
sentences = ["Ceci est un exemple", "deuxième exemple"]
model = SentenceTransformer('h4c5/sts-distilcamembert-base')
embeddings = model.encode(sentences)
print(embeddings)
```
## Utilisation via la librairie `transformers`
```
pip install -U transformers
```
```python
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-distilcamembert-base")
model = AutoModel.from_pretrained("h4c5/sts-distilcamembert-base")
model.eval()
# Mean Pooling
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[
0
] # First element of model_output contains all token embeddings
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
# Tokenization et calcul des embeddings des tokens
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
model_output = model(**encoded_input)
# Mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
print(sentence_embeddings)
```
## Evaluation
Le modèle a été évalué sur le jeu de données [STSb fr](https://huggingface.co/datasets/stsb_multi_mt) :
```python
from datasets import load_dataset
from sentence_transformers import InputExample, evaluation
def dataset_to_input_examples(dataset):
return [
InputExample(
texts=[example["sentence1"], example["sentence2"]],
label=example["similarity_score"] / 5.0,
)
for example in dataset
]
sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
sts_test_examples = dataset_to_input_examples(sts_test_dataset)
sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
sts_test_examples, name="sts-test"
)
sts_test_evaluator(model, ".")
```
### Résultats
Ci-dessous, les résultats de l'évaluation du modèle sur le jeu données [`stsb_multi_mt`](https://huggingface.co/datasets/stsb_multi_mt)
(données `fr`, split `test`)
| Model | Pearson Correlation | Paramètres |
| :--------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | ---------: |
| [`h4c5/sts-camembert-base`](https://huggingface.co/h4c5/sts-camembert-base) | **0.837** | 110M |
| [`Lajavaness/sentence-camembert-base`](https://huggingface.co/Lajavaness/sentence-camembert-base) | 0.835 | 110M |
| [`inokufu/flaubert-base-uncased-xnli-sts`](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 0.828 | 137M |
| [`h4c5/sts-distilcamembert-base`](https://huggingface.co/h4c5/sts-distilcamembert-base) | 0.817 | 68M |
| [`sentence-transformers/distiluse-base-multilingual-cased-v2`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 0.786 | 135M |
## Training
The model was trained with the parameters:
**DataLoader**:
`torch.utils.data.dataloader.DataLoader` of length 180 with parameters:
```
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```
**Loss**:
`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
Parameters of the `fit()` method:
```
{
"epochs": 10,
"evaluation_steps": 1000,
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 500,
"weight_decay": 0.01
}
```
## Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
## Citing
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
journal={"https://arxiv.org/abs/1908.10084"},
}
@inproceedings{sanh2019distilbert,
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
booktitle={NeurIPS EMC^2 Workshop},
journal={https://arxiv.org/abs/1910.01108},
year={2019}
}
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
journal={https://arxiv.org/abs/1911.03894},
year={2020}
}
@inproceedings{delestre:hal-03674695,
TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
URL = {https://hal.archives-ouvertes.fr/hal-03674695},
BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
ADDRESS = {Vannes, France},
YEAR = {2022},
MONTH = Jul,
KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
HAL_ID = {hal-03674695},
HAL_VERSION = {v1},
journal={https://arxiv.org/abs/2205.11111},
}
|