|
--- |
|
datasets: |
|
- TigerResearch/pretrain_zh |
|
base_model: |
|
- Qwen/Qwen2.5-14B |
|
tags: |
|
- character |
|
- generation |
|
license: apache-2.0 |
|
--- |
|
**Qwen2.5-14B-Character** |
|
|
|
**Introduction:** |
|
|
|
**Qwen2.5-14B-Character** is the Character version of [Qwen2.5-14B](https://huggingface.co/Qwen/Qwen2.5-14B) model. It is developed based on the [Qwen2.5-14B](https://huggingface.co/Qwen/Qwen2.5-14B) model. It is specifically designed for character-to-character transformation and generation tasks. |
|
|
|
**Core Contributions:** |
|
|
|
1. **Modified Token Vocabulary:** The original model's token vocabulary has been revised to remove tokens representing phrases and multiple characters. This refinement enhances the model's focus on individual character processing. |
|
|
|
2. **Continued Pre-training:** Based on the modified vocabulary, the model has undergone further pre-training to optimize its performance and adaptability for character-level tasks. |
|
|
|
|
|
**Training Dataset:** |
|
|
|
The model has been trained using the `TigerResearch/pretrain_zh` dataset, a comprehensive Chinese pre-training dataset provided by **TigerResearch**. For more information about the dataset, please visit: [TigerResearch/pretrain_zh](https://huggingface.co/datasets/TigerResearch/pretrain_zh). |
|
|
|
|
|
**Training Code:** |
|
|
|
The training process for this model was facilitated by the **LLaMA-Factory**, an open-source project that provides tools and frameworks for training language models. The LLaMa-factory codebase is available at: [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory). |
|
|
|
|
|
**Results** |
|
|
|
To assess the efficacy of the Qwen2.5-14B-Character, we evaluated its performance on three widely utilized benchmarks: C-Evel, CMMLU, and MMLU. The results are tabulated as follows: |
|
|
|
| Model | ceval| cmmlu| mmlu| |
|
| :---: | :---: | :---: | :---: | |
|
| Qwen2.5-14B | 85.29| 85.84| 79.86| |
|
| Qwen2.5-14B-filter | 83.43| 83.72| 79.75| |
|
| Qwen2.5-14B-Character | 84.99| 84.60| 79.61| |
|
|
|
In the table, to discern the model performance more distinctly, we have presented the test results for both the original Qwen2.5-14B (Qwen2.5-14B) and the token-modified Qwen2.5-14B (Qwen2.5-14B-filter). |
|
|
|
|
|
**Quickstart** |
|
|
|
The latest version of transformers is recommended (at least 4.37.0). Here we show a code snippet to show you how to use the chat model with transformers: |
|
|
|
```shell |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer |
|
|
|
model_name = 'Henry94/Qwen2.5-14B-Character' |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") |
|
|
|
|
|
prompt = "请简单介绍一下大型语言模型." |
|
messages = [ |
|
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, |
|
{"role": "user", "content": prompt} |
|
] |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
generated_ids = model.generate( |
|
**model_inputs, |
|
max_new_tokens=512 |
|
) |
|
generated_ids = [ |
|
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
] |
|
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
|
print(response) |
|
``` |