--- license: mit language: - ko - vi metrics: - bleu base_model: - facebook/mbart-large-50-many-to-many-mmt pipeline_tag: translation library_name: transformers tags: - mbart - mbart-50 - text2text-generation --- # Model Card for mbart-large-50-mmt-ko-vi This model is fine-tuned from mBART-large-50 using multilingual translation data of Korean legal documents for Korean-to-Vietnamese translation tasks. --- ## Table of Contents - [Model Card for mbart-large-50-mmt-ko-vi](#model-card-for-mbart-large-50-mmt-ko-vi) - [Table of Contents](#table-of-contents) - [Model Details](#model-details) - [Model Description](#model-description) - [Uses](#uses) - [Direct Use](#direct-use) - [Out-of-Scope Use](#out-of-scope-use) - [Bias, Risks, and Limitations](#bias-risks-and-limitations) - [Training Details](#training-details) - [Training Data](#training-data) - [Training Procedure](#training-procedure) - [Preprocessing](#preprocessing) - [Speeds, Sizes, Times](#speeds-sizes-times) - [Evaluation](#evaluation) - [Testing Data](#testing-data) - [Metrics](#metrics) - [Results](#results) - [Environmental Impact](#environmental-impact) - [Technical Specifications](#technical-specifications) - [Citation](#citation) - [Model Card Contact](#model-card-contact) --- ## Model Details ### Model Description - **Developed by:** Jaeyoon Myoung, Heewon Kwak - **Shared by:** ofu - **Model type:** Language model (Translation) - **Language(s) (NLP):** Korean, Vietnamese - **License:** Apache 2.0 - **Parent Model:** facebook/mbart-large-50-many-to-many-mmt --- ## Uses ### Direct Use This model is used for text translation from Korean to Vietnamese. ### Out-of-Scope Use This model is not suitable for translation tasks involving languages other than Korean. --- ## Bias, Risks, and Limitations The model may contain biases inherited from the training data and may produce inappropriate translations for sensitive topics. --- ## Training Details ### Training Data The model was trained using multilingual translation data of Korean legal documents provided by AI Hub. ### Training Procedure #### Preprocessing - Removed unnecessary whitespace, special characters, and line breaks. ### Speeds, Sizes, Times - **Training Time:** 1 hour 25 minutes (5,100 seconds) on Nvidia RTX 4090 - **Throughput:** ~3.51 samples/second - **Total Training Samples:** 17,922 - **Model Checkpoint Size:** Approximately 2.3GB - **Gradient Accumulation Steps:** 4 - **FP16 Mixed Precision Enabled:** Yes --- ## Evaluation ### Testing Data The evaluation used a dataset partially extracted from Korean labor law precedents. ### Metrics - BLEU ### Results - **BLEU Score:** 29.69 - **Accuracy:** 95.65% --- ## Environmental Impact - **Hardware Type:** NVIDIA RTX 4090 - **Power Consumption:** ~450W - **Training Time:** 1 hour 25 minutes (1.42 hours) - **Electricity Consumption:** ~0.639 kWh - **Carbon Emission Factor (South Korea):** 0.459 kgCO₂/kWh - **Estimated Carbon Emissions:** ~0.293 kgCO₂ --- ## Technical Specifications - **Model Architecture:** Based on mBART-large-50, a multilingual sequence-to-sequence transformer model designed for translation tasks. The architecture includes 24 encoder and 24 decoder layers with 1,024 hidden units. - **Software:** - sacrebleu for evaluation - Hugging Face Transformers library for fine-tuning - Python 3.11.9 and PyTorch 2.4.0 - **Hardware:** NVIDIA RTX 4090 with 24GB VRAM was used for training and inference. - **Tokenization and Preprocessing:** The tokenization was performed using the SentencePiece model pre-trained with mBART-large-50. Text preprocessing included removing special characters, unnecessary whitespace, and normalizing line breaks. - **Optimizer and Hyperparameters:** - Optimizer: AdamW - Learning Rate: 1e-4 - Batch Size: 8 (per device) - Gradient Accumulation Steps: 4 - Label Smoothing Factor: 0.1 - FP16 Mixed Precision Enabled: Yes --- ## Citation Currently, there are no papers or blog posts available for this model. --- ## Model Card Contact - **Contact Email:** audwodbs492@ofu.co.kr