Model Card for RedWhale-tv-10.8B-v1.0

Model Description

RedWhale은 전처리한 한국어 Corpus, 특화된 한국어 Tokenizer, 효과적인 Model initialization, Continuous Multi-Stage Pretraining strategy 등을 갖추고 있습니다. 이러한 접근 방식은 높은 정확도와 이해도를 유지하면서 Computational costs를 줄여 제한된 리소스에서 Pretraining을 가능하게 해줍니다. RedWhale 사용을 원하시면 repo access 요청해주세요.

About the Model

Name: TwinDoc/RedWhale-tv-10.8B-v1.0
Foundation Model: upstage/SOLAR-10.7B-v1.0
Train Corpus: preprocessed AI-Hub datasets
Developed by: 애자일소다 (AGILESODA)
Model type: llama
Language(s) (NLP): 한국어, 영어
License: cc-by-nc-sa-4.0
Paper: RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining

Load the Model

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

YOUR_HF_TOKEN_READ = "hf_..."
model_name_or_path = "TwinDoc/RedWhale-tv-10.8B-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)

Generate Text

text = "대한민국의 수도는"
encodings = tokenizer(text, return_tensors='pt')
terminators = [tokenizer.eos_token_id] + tokenizer("\n", add_special_tokens=False)["input_ids"]

outputs = model.generate(**encodings, eos_token_id=terminators)
generated_text = tokenizer.batch_decode(outputs)[0]
# '<s> 대한민국의 수도는 서울이다.\n'

License

The content of this project, created by AGILESODA, is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Citation

@misc{vo2024redwhaleadaptedkoreanllm,
      title={RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining}, 
      author={Anh-Dung Vo and Minseong Jung and Wonbeen Lee and Daewoo Choi},
      year={2024},
      eprint={2408.11294},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.11294}, 
}

Built with:

TwinDoc
/

RedWhale-tv-10.8B-v1.0

You need to agree to share your contact information to access this model