--- license: mit language: - ko - vi metrics: - bleu base_model: - facebook/mbart-large-50-many-to-many-mmt pipeline_tag: translation library_name: transformers tags: - mbart - mbart-50 - text2text-generation --- # Model Card for mbart-large-50-mmt-ko-vi This model is fine-tuned from mBART-large-50 using multilingual translation data of Korean legal documents for Korean-to-Vietnamese translation tasks. --- ## Table of Contents - [Model Card for mbart-large-50-mmt-ko-vi](#model-card-for-mbart-large-50-mmt-ko-vi) - [Table of Contents](#table-of-contents) - [Model Details](#model-details) - [Model Description](#model-description) - [Uses](#uses) - [Direct Use](#direct-use) - [Out-of-Scope Use](#out-of-scope-use) - [Bias, Risks, and Limitations](#bias-risks-and-limitations) - [Training Details](#training-details) - [Training Data](#training-data) - [Training Procedure](#training-procedure) - [Preprocessing](#preprocessing) - [Speeds, Sizes, Times](#speeds-sizes-times) - [Evaluation](#evaluation) - [Testing Data](#testing-data) - [Metrics](#metrics) - [Results](#results) - [Environmental Impact](#environmental-impact) - [Technical Specifications](#technical-specifications) - [Citation](#citation) - [Model Card Contact](#model-card-contact) --- ## Model Details ### Model Description - **Developed by:** Jaeyoon Myoung, Heewon Kwak - **Shared by:** ofu - **Model type:** Language model (Translation) - **Language(s) (NLP):** Korean, Vietnamese - **License:** Apache 2.0 - **Parent Model:** facebook/mbart-large-50-many-to-many-mmt --- ## Uses ### Direct Use This model is used for text translation from Korean to Vietnamese. ### Out-of-Scope Use This model is not suitable for translation tasks involving languages other than Korean. --- ## Bias, Risks, and Limitations The model may contain biases inherited from the training data and may produce inappropriate translations for sensitive topics. --- ## Training Details ### Training Data The model was trained using multilingual translation data of Korean legal documents provided by AI Hub. ### Training Procedure #### Preprocessing - Removed unnecessary whitespace, special characters, and line breaks. ### Speeds, Sizes, Times - **Training Time:** 1 hour 25 minutes (5,100 seconds) on Nvidia RTX 4090 - **Throughput:** ~3.51 samples/second - **Total Training Samples:** 17,922 - **Model Checkpoint Size:** Approximately 2.3GB - **Gradient Accumulation Steps:** 4 - **FP16 Mixed Precision Enabled:** Yes ### Training hyperparameters The following hyperparameters were used during training: - **learning_rate**: `0.0001` - **train_batch_size**: `8` (per device) - **eval_batch_size**: `8` (per device) - **seed**: `42` - **distributed_type**: `single-node` (since `_n_gpu=1` and no distributed training setup is indicated) - **num_devices**: `1` (single NVIDIA GPU: RTX 4090) - **gradient_accumulation_steps**: `4` - **total_train_batch_size**: `32` (calculated as `train_batch_size * gradient_accumulation_steps`) - **total_eval_batch_size**: `8` (evaluation does not use gradient accumulation) - **optimizer**: `AdamW` (indicated by `optim=OptimizerNames.ADAMW_TORCH`) - **lr_scheduler_type**: `linear` (indicated by `lr_scheduler_type=SchedulerType.LINEAR`) - **lr_scheduler_warmup_steps**: `100` - **num_epochs**: `3` --- ## Evaluation ### Testing Data The evaluation used a dataset partially extracted from Korean labor law precedents. ### Metrics - BLEU ### Results - **BLEU Score:** 29.69 - **Accuracy:** 95.65% --- ## Environmental Impact - **Hardware Type:** NVIDIA RTX 4090 - **Power Consumption:** ~450W - **Training Time:** 1 hour 25 minutes (1.42 hours) - **Electricity Consumption:** ~0.639 kWh - **Carbon Emission Factor (South Korea):** 0.459 kgCO₂/kWh - **Estimated Carbon Emissions:** ~0.293 kgCO₂ --- ## Technical Specifications - **Model Architecture:** Based on mBART-large-50, a multilingual sequence-to-sequence transformer model designed for translation tasks. The architecture includes 24 encoder and 24 decoder layers with 1,024 hidden units. - **Software:** - sacrebleu for evaluation - Hugging Face Transformers library for fine-tuning - Python 3.11.9 and PyTorch 2.4.0 - **Hardware:** NVIDIA RTX 4090 with 24GB VRAM was used for training and inference. - **Tokenization and Preprocessing:** The tokenization was performed using the SentencePiece model pre-trained with mBART-large-50. Text preprocessing included removing special characters, unnecessary whitespace, and normalizing line breaks. --- ## Citation Currently, there are no papers or blog posts available for this model. --- ## Model Card Contact - **Contact Email:** audwodbs492@ofu.co.kr | gguldanzi@ofu.co.kr