|
--- |
|
language: fr |
|
license: mit |
|
library_name: sentence-transformers |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
datasets: |
|
- stsb_multi_mt |
|
metrics: |
|
- pearsonr |
|
base_model: cmarkea/distilcamembert-base |
|
model-index: |
|
- name: sts-distilcamembert-base |
|
results: |
|
- task: |
|
name: Sentence Similarity |
|
type: sentence-similarity |
|
dataset: |
|
name: STSb French |
|
type: stsb_multi_mt |
|
args: fr |
|
metrics: |
|
- name: Pearson Correlation - stsb_multi_mt fr |
|
type: pearsonr |
|
value: 0.8165 |
|
--- |
|
|
|
## Description |
|
|
|
Ce modèle [sentence-transformers](https://www.SBERT.net) a été obtenu en finetunant le modèle |
|
[`cmarkea/distilcamembert-base`](https://huggingface.co/cmarkea/distilcamembert-base) à l'aide de la librairie |
|
[sentence-transformers](https://www.SBERT.net). |
|
|
|
Il permet d'encoder une phrase ou un pararaphe (514 tokens maximum) en un vecteur de dimension 768. |
|
|
|
Le modèle [DistilCamemBERT](https://huggingface.co/papers/2205.11111) sur lequel il est basé est une distillation du |
|
modèlel [CamemBERT](https://arxiv.org/abs/1911.03894) permettant de diviser par deux le nombre de paramètres du modèle |
|
et améliorer le temps d'inférence. |
|
|
|
## Utilisation via la librairie `sentence-transformers` |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sentences = ["Ceci est un exemple", "deuxième exemple"] |
|
|
|
model = SentenceTransformer('h4c5/sts-distilcamembert-base') |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
|
|
## Utilisation via la librairie `transformers` |
|
|
|
``` |
|
pip install -U transformers |
|
``` |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-distilcamembert-base") |
|
model = AutoModel.from_pretrained("h4c5/sts-distilcamembert-base") |
|
model.eval() |
|
|
|
|
|
# Mean Pooling |
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[ |
|
0 |
|
] # First element of model_output contains all token embeddings |
|
input_mask_expanded = ( |
|
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
) |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp( |
|
input_mask_expanded.sum(1), min=1e-9 |
|
) |
|
|
|
# Tokenization et calcul des embeddings des tokens |
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") |
|
model_output = model(**encoded_input) |
|
|
|
# Mean pooling |
|
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"]) |
|
|
|
print(sentence_embeddings) |
|
``` |
|
|
|
|
|
## Evaluation |
|
|
|
Le modèle a été évalué sur le jeu de données [STSb fr](https://huggingface.co/datasets/stsb_multi_mt) : |
|
|
|
```python |
|
from datasets import load_dataset |
|
from sentence_transformers import InputExample, evaluation |
|
|
|
|
|
def dataset_to_input_examples(dataset): |
|
return [ |
|
InputExample( |
|
texts=[example["sentence1"], example["sentence2"]], |
|
label=example["similarity_score"] / 5.0, |
|
) |
|
for example in dataset |
|
] |
|
|
|
|
|
sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test") |
|
sts_test_examples = dataset_to_input_examples(sts_test_dataset) |
|
|
|
sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples( |
|
sts_test_examples, name="sts-test" |
|
) |
|
|
|
sts_test_evaluator(model, ".") |
|
``` |
|
|
|
### Résultats |
|
|
|
Ci-dessous, les résultats de l'évaluation du modèle sur le jeu données [`stsb_multi_mt`](https://huggingface.co/datasets/stsb_multi_mt) |
|
(données `fr`, split `test`) |
|
|
|
| Model | Pearson Correlation | Paramètres | |
|
| :--------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | ---------: | |
|
| [`h4c5/sts-camembert-base`](https://huggingface.co/h4c5/sts-camembert-base) | **0.837** | 110M | |
|
| [`Lajavaness/sentence-camembert-base`](https://huggingface.co/Lajavaness/sentence-camembert-base) | 0.835 | 110M | |
|
| [`inokufu/flaubert-base-uncased-xnli-sts`](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 0.828 | 137M | |
|
| [`h4c5/sts-distilcamembert-base`](https://huggingface.co/h4c5/sts-distilcamembert-base) | 0.817 | 68M | |
|
| [`sentence-transformers/distiluse-base-multilingual-cased-v2`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 0.786 | 135M | |
|
|
|
|
|
|
|
## Training |
|
The model was trained with the parameters: |
|
|
|
**DataLoader**: |
|
|
|
`torch.utils.data.dataloader.DataLoader` of length 180 with parameters: |
|
``` |
|
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'} |
|
``` |
|
|
|
**Loss**: |
|
|
|
`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss` |
|
|
|
Parameters of the `fit()` method: |
|
``` |
|
{ |
|
"epochs": 10, |
|
"evaluation_steps": 1000, |
|
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator", |
|
"max_grad_norm": 1, |
|
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>", |
|
"optimizer_params": { |
|
"lr": 2e-05 |
|
}, |
|
"scheduler": "WarmupLinear", |
|
"steps_per_epoch": null, |
|
"warmup_steps": 500, |
|
"weight_decay": 0.01 |
|
} |
|
``` |
|
|
|
|
|
## Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel |
|
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
) |
|
``` |
|
|
|
## Citing |
|
|
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
journal={"https://arxiv.org/abs/1908.10084"}, |
|
} |
|
|
|
@inproceedings{sanh2019distilbert, |
|
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, |
|
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas}, |
|
booktitle={NeurIPS EMC^2 Workshop}, |
|
journal={https://arxiv.org/abs/1910.01108}, |
|
year={2019} |
|
} |
|
|
|
@inproceedings{martin2020camembert, |
|
title={CamemBERT: a Tasty French Language Model}, |
|
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, |
|
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, |
|
journal={https://arxiv.org/abs/1911.03894}, |
|
year={2020} |
|
} |
|
|
|
@inproceedings{delestre:hal-03674695, |
|
TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}}, |
|
AUTHOR = {Delestre, Cyrile and Amar, Abibatou}, |
|
URL = {https://hal.archives-ouvertes.fr/hal-03674695}, |
|
BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}}, |
|
ADDRESS = {Vannes, France}, |
|
YEAR = {2022}, |
|
MONTH = Jul, |
|
KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation}, |
|
PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf}, |
|
HAL_ID = {hal-03674695}, |
|
HAL_VERSION = {v1}, |
|
journal={https://arxiv.org/abs/2205.11111}, |
|
} |
|
|