Edit model card

SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-MiniLM-L6-v2
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 384 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'What does this text say about data privacy?',
    'information during GAI training and maintenance. \nHuman-AI Configuration; Obscene, \nDegrading, and/or Abusive \nContent; Value Chain and \nComponent Integration; \nDangerous, Violent, or Hateful \nContent \nMS-2.6-002 \nAssess existence or levels of harmful bias, intellectual property infringement, \ndata privacy violations, obscenity, extremism, violence, or CBRN information in \nsystem training data. \nData Privacy; Intellectual Property; \nObscene, Degrading, and/or \nAbusive Content; Harmful Bias and \nHomogenization; Dangerous, \nViolent, or Hateful Content; CBRN \nInformation or Capabilities \nMS-2.6-003 Re-evaluate safety features of fine-tuned models when the negative risk exceeds \norganizational risk tolerance. \nDangerous, Violent, or Hateful \nContent \nMS-2.6-004 Review GAI system outputs for validity and safety: Review generated code to \nassess risks that may arise from unreliable downstream decision-making. \nValue Chain and Component \nIntegration; Dangerous, Violent, or \nHateful Content',
    'Scheurer, J. et al. (2023) Technical report: Large language models can strategically deceive their users \nwhen put under pressure. arXiv. https://arxiv.org/abs/2311.07590 \nShelby, R. et al. (2023) Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm \nReduction. arXiv. https://arxiv.org/pdf/2210.05791 \nShevlane, T. et al. (2023) Model evaluation for extreme risks. arXiv. https://arxiv.org/pdf/2305.15324 \nShumailov, I. et al. (2023) The curse of recursion: training on generated data makes models forget. arXiv. \nhttps://arxiv.org/pdf/2305.17493v2 \nSmith, A. et al. (2023) Hallucination or Confabulation? Neuroanatomy as metaphor in Large Language \nModels. PLOS Digital Health. \nhttps://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000388 \nSoice, E. et al. (2023) Can large language models democratize access to dual-use biotechnology? arXiv. \nhttps://arxiv.org/abs/2306.03809',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 555 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 555 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 10 tokens
    • mean: 11.2 tokens
    • max: 12 tokens
    • min: 156 tokens
    • mean: 199.37 tokens
    • max: 256 tokens
  • Samples:
    sentence_0 sentence_1
    What does this text say about trustworthiness? other systems.
    Information Integrity; Value Chain
    and Component Integration
    MP-2.2-002
    Observe and analyze how the GAI system interacts with external networks, and
    identify any potential for negative externalities, particularly where content
    provenance might be compromised.
    Information Integrity
    AI Actor Tasks: End Users

    MAP 2.3: Scientific integrity and TEVV considerations are identified and documented, including those related to experimental
    design, data collection and selection (e.g., availability, representativeness, suitability), system trustworthiness, and construct
    validation
    Action ID
    Suggested Action
    GAI Risks
    MP-2.3-001
    Assess the accuracy, quality, reliability, and authenticity of GAI output by
    comparing it to a set of known ground truth data and by using a variety of
    evaluation methods (e.g., human oversight and automated evaluation, proven
    cryptographic techniques, review of content inputs).
    Information Integrity

    25
    What does this text say about unclassified? training and TEVV data; Filtering of hate speech or content in GAI system
    training data; Prevalence of GAI-generated data in GAI system training data.
    Harmful Bias and Homogenization


    15 Winogender Schemas is a sample set of paired sentences which differ only by gender of the pronouns used,
    which can be used to evaluate gender bias in natural language processing coreference resolution systems.

    37
    MS-2.11-005
    Assess the proportion of synthetic to non-synthetic training data and verify
    training data is not overly homogenous or GAI-produced to mitigate concerns of
    model collapse.
    Harmful Bias and Homogenization
    AI Actor Tasks: AI Deployment, AI Impact Assessment, Affected Individuals and Communities, Domain Experts, End-Users,
    Operation and Monitoring, TEVV

    MEASURE 2.12: Environmental impact and sustainability of AI model training and management activities – as identified in the MAP
    function – are assessed and documented.
    Action ID
    Suggested Action
    GAI Risks
    What does this text say about unclassified? Padmakumar, V. et al. (2024) Does writing with language models reduce content diversity? ICLR.
    https://arxiv.org/pdf/2309.05196
    Park, P. et. al. (2024) AI deception: A survey of examples, risks, and potential solutions. Patterns, 5(5).
    arXiv. https://arxiv.org/pdf/2308.14752
    Partnership on AI (2023) Building a Glossary for Synthetic Media Transparency Methods, Part 1: Indirect
    Disclosure. https://partnershiponai.org/glossary-for-synthetic-media-transparency-methods-part-1-
    indirect-disclosure/
    Qu, Y. et al. (2023) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-
    To-Image Models. arXiv. https://arxiv.org/pdf/2305.13873
    Rafat, K. et al. (2023) Mitigating carbon footprint for knowledge distillation based deep learning model
    compression. PLOS One. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0285668
    Said, I. et al. (2022) Nonconsensual Distribution of Intimate Images: Exploring the Role of Legal Attitudes
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Framework Versions

  • Python: 3.11.5
  • Sentence Transformers: 3.1.1
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cpu
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
5
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Model tree for svb01/fine-tuned-embedding-model

Finetuned
this model