SakuraLLM
/

LN-Thai-14B-v0.1

Model card Files Files and versions Community

CjangCjengh commited on Aug 15

Commit

3989e9e

•

1 Parent(s): a6053dc

Upload files

Files changed (1) hide show

README.md +55 -0

README.md ADDED Viewed

	@@ -0,0 +1,55 @@

+---
+license: cc-by-nc-sa-4.0
+language:
+- th
+- zh
+---
+基于[Sakura-14B-Qwen2beta-Base-v2](https://huggingface.co/SakuraLLM/Sakura-14B-Qwen2beta-Base-v2)，在泰中翻译数据上微调（包含69MB日轻的泰翻中翻对照以及10MB中文网文的泰翻）
+模型仅支持泰文→简体中文的翻译
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+model_path = 'CjangCjengh/LN-Thai-14B-v0.1'
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_path, device_map='auto', trust_remote_code=True).eval()
+model.generation_config = GenerationConfig.from_pretrained(model_path, trust_remote_code=True)
+# 段落之间用\n分隔
+text = '''“อาจารย์คะ ช่วยรับหนูเป็นลูกศิษย์ด้วยนะคะ”
+มิยุจัง เด็กสาวได้รับรู้ความลับของ ชาลี เด็กหนุ่มว่า ตัวจริงของคือท่านอาจารย์ 007H นักเขียนนิยายลามกชื่อดังที่เธอคลั่งไคล้ เด็กสาวผู้อยากจะเขียนนิยายลามกแบบนี้บ้างจึงมาขอฝากตัวเป็นลูกศิษย์ของชาลี พร้อมกับเรื่องวุ่น ๆ ของเด็กหนุ่มที่อยากไล่เธอออกไปก่อนที่ชีวิตส่วนตัวของตัวเองจะพินาศไปในพริบตา ทว่า นานวันเข้าความสัมพันธ์ของอาจารย์หนุ่มกับลูกศิษย์ตัวน้อยก็เริ่มแน่นแฟ้นมากขึ้น
+นิยายลามกเรื่องใหม่ครั้งนี้ชาลีจะเขียนเสร็จก่อนหรือเข้าไปนอนในดาวหมีก่อนกันนะ ?'''
+# 去除零宽空格
+text = text.replace('\u200b','')
+# 文本长度控制在2048以内
+assert len(text) < 2048
+messages = [
+    {'role': 'system', 'content': '你是一个轻小说译者，善于将外文轻小说翻译成中文'},
+    {'role': 'user', 'content': f'翻译成中文：\n{text}'}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors='pt').to('cuda')
+generated_ids = model.generate(
+    model_inputs.input_ids,
+    max_new_tokens=1024
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
+```