|
--- |
|
license: mit |
|
datasets: |
|
- DDSC/dagw_no_twitter |
|
language: |
|
- da |
|
tags: |
|
- SimCSE |
|
--- |
|
|
|
A version of the chcaa/dfm-encoder-large-v1 trained using SimCSE. It was trained as a part of the [Scandinavian Embeddings Benchmark](https://kennethenevoldsen.github.io/scandinavian-embedding-benchmark/) to establish a naive baseline for SimCSE. |
|
|
|
**Note**: We do not recommend this model, but instead encourage the user to check out the current best model on [SEB](https://kennethenevoldsen.github.io/scandinavian-embedding-benchmark/) or check out the [recommendation](https://huggingface.co/collections/danish-foundation-models/state-of-the-art-danish-models-65f01d84a10842712e186172) by the Danish Foundation Models team. |
|
|
|
|
|
## Hyperparameters |
|
Trained using the [SimCSE](https://github.com/princeton-nlp/SimCSE) implementation with: |
|
|
|
``` |
|
CUDA_VISIBLE_DEVICES=0 python train.py \ |
|
--train_file data/dfm_paragraphs.txt \ # paragraphs extract from Danish Gigaword |
|
--model_name_or_path chcaa/dfm-encoder-large-v1 \ |
|
--num_train_epochs 1 \ |
|
--per_device_train_batch_size 128 \ |
|
--learning_rate 1e-5 \ |
|
--max_seq_length 32 \ |
|
--evaluation_strategy steps \ |
|
--metric_for_best_model stsb_spearman \ |
|
--load_best_model_at_end \ |
|
--pooler_type cls \ |
|
--mlp_only_train \ |
|
--do_mlm \ |
|
--overwrite_output_dir \ |
|
--temp 0.05 \ |
|
--do_train \ |
|
--fp16 |
|
``` |
|
|
|
|
|
## Citation |
|
|
|
To cite this work please refer to the following article: |
|
|
|
``` |
|
Enevoldsen, K., Kardos, M., Muennighoff, N., & Nielbo, K. (2024). The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding. https://openreview.net/forum?id=pJl_i7HIA72 |
|
``` |
|
|
|
or use the following BibTeX: |
|
``` |
|
@article{enevoldsenScandinavianEmbeddingBenchmarks2024, |
|
title = {The {Scandinavian} {Embedding} {Benchmarks}: {Comprehensive} {Assessment} of {Multilingual} and {Monolingual} {Text} {Embedding}}, |
|
shorttitle = {The {Scandinavian} {Embedding} {Benchmarks}}, |
|
url = {https://openreview.net/forum?id=pJl_i7HIA72}, |
|
language = {en}, |
|
urldate = {2024-04-12}, |
|
author = {Enevoldsen, Kenneth and Kardos, Márton and Muennighoff, Niklas and Nielbo, Kristoffer}, |
|
month = feb, |
|
year = {2024}, |
|
} |
|
|