SentenceTransformer based on BAAI/bge-m3

This is a sentence-transformers model finetuned from BAAI/bge-m3. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-m3
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("comet24082002/ft_bge_newLaw_CachedMultipleNegativeRankingLoss_V1_5epochs")
# Run inference
sentences = [
    'Công ty có quyền giảm lương khi người lao động không đảm bảo hiệu suất công việc?',
    '"Điều 94. Nguyên tắc trả lương\n1. Người sử dụng lao động phải trả lương trực tiếp, đầy đủ, đúng hạn cho người lao động. Trường hợp người lao động không thể nhận lương trực tiếp thì người sử dụng lao động có thể trả lương cho người được người lao động ủy quyền hợp pháp.\n2. Người sử dụng lao động không được hạn chế hoặc can thiệp vào quyền tự quyết chi tiêu lương của người lao động; không được ép buộc người lao động chi tiêu lương vào việc mua hàng hóa, sử dụng dịch vụ của người sử dụng lao động hoặc của đơn vị khác mà người sử dụng lao động chỉ định."',
    'Các biện pháp tăng cường an toàn hoạt động bay\nCục Hàng không Việt Nam áp dụng các biện pháp tăng cường sau:\n1. Phổ biến kinh nghiệm, bài học liên quan trên thế giới và tại Việt Nam cho các tổ chức, cá nhân liên quan trực tiếp đến hoạt động bay bằng các hình thức thích hợp.\n2. Tổ chức thực hiện, giám sát kết quả thực hiện khuyến cáo an toàn của các cuộc điều tra tai nạn tàu bay, sự cố trong lĩnh vực hoạt động bay.\n3. Tổng kết, đánh giá và phân tích định kỳ hàng năm việc thực hiện quản lý an toàn hoạt động bay; tổ chức khắc phục các hạn chế, yêu cầu, đề nghị liên quan nhằm hoàn thiện công tác quản lý an toàn và SMS.\n4. Tổ chức huấn luyện, đào tạo về an toàn hoạt động bay.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

Size: 10,524 training samples
Columns: anchor and positive
Approximate statistics based on the first 1000 samples:
anchor positive
type string string
details
min: 8 tokens
mean: 24.42 tokens
max: 49 tokens

min: 18 tokens
mean: 272.22 tokens
max: 512 tokens

	anchor	positive
type	string	string
details	min: 8 tokens mean: 24.42 tokens max: 49 tokens	min: 18 tokens mean: 272.22 tokens max: 512 tokens

Samples:

anchor	positive
`Người thừa kế theo di chúc gồm những ai?`	`"Điều 613. Người thừa kế Người thừa kế là cá nhân phải là người còn sống vào thời điểm mở thừa kế hoặc sinh ra và còn sống sau thời điểm mở thừa kế nhưng đã thành thai trước khi người để lại di sản chết. Trường hợp người thừa kế theo di chúc không là cá nhân thì phải tồn tại vào thời điểm mở thừa kế."`
`Đầu tư vốn nhà nước vào doanh nghiệp được thực hiện bằng những hình thức nào?`	Hình thức đầu tư vốn nhà nước vào doanh nghiệp 1. Đầu tư vốn nhà nước để thành lập doanh nghiệp do Nhà nước nắm giữ 100% vốn điều lệ. 2. Đầu tư bổ sung vốn điều lệ cho doanh nghiệp do Nhà nước nắm giữ 100% vốn điều lệ đang hoạt động. 3. Đầu tư bổ sung vốn nhà nước để tiếp tục duy trì tỷ lệ cổ phần, vốn góp của Nhà nước tại công ty cổ phần, công ty trách nhiệm hữu hạn hai thành viên trở lên. 4. Đầu tư vốn nhà nước để mua lại một phần hoặc toàn bộ doanh nghiệp.
`Thủ tục thành lập trung tâm hiến máu chữ thập đỏ có quy định như thế nào?`	Thủ tục thành lập cơ sở hiến máu chữ thập đỏ Thủ tục thành lập cơ sở hiến máu chữ thập đỏ thực hiện theo quy định tại Điều 4, Thông tư số 03/2013/TT-BNV ngày 16 tháng 4 năm 2013 của Bộ Nội vụ quy định chi tiết thi hành Nghị định số 45/2010/NĐ-CP ngày 21 tháng 4 năm 2010 của Chính phủ quy định về tổ chức, hoạt động và quản lý hội và Nghị định số 33/2012/NĐ-CP ngày 13 tháng 4 năm 2012 của Chính phủ sửa đổi, bổ sung một số điều của Nghị định số 45/2010/NĐ-CP

Loss: CachedMultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 256
learning_rate: 2e-05
num_train_epochs: 5
warmup_ratio: 0.1

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
prediction_loss_only: True
per_device_train_batch_size: 256
per_device_eval_batch_size: 8
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 5
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs