Korean Character BERT Model (small)

Welcome to the repository of the Korean Character (syllable-level) BERT Model, a compact and efficient transformer-based model designed specifically for Korean language processing tasks. This model takes a unique approach by tokenizing text at the syllable level, catering to the linguistic characteristics of the Korean language.

Features

Vocabulary Size: The model utilizes a vocabulary of 7,477 tokens, focusing on Korean syllables. This streamlined vocabulary size allows for efficient processing while maintaining the ability to capture the nuances of the Korean language.
Transformer Encoder Layers: It employs a simplified architecture with only 3 transformer encoder layers. This design choice strikes a balance between model complexity and computational efficiency, making it suitable for a wide range of applications, from mobile devices to server environments.
License: This model is open-sourced under the Apache License 2.0, allowing for both academic and commercial use while ensuring that contributions and improvements are shared within the community.

Getting Started

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("MrBananaHuman/char_ko_bert_small")
model = AutoModelForMaskedLM.from_pretrained("MrBananaHuman/char_ko_bert_small")

Fine-tuning example

Named entity recognition

Contact

For any questions or inquiries, please reach out to me at mrbananahuman.kim@gmail.com
I'm always happy to discuss the model, potential collaborations, or any other inquiries related to this project.