File size: 8,359 Bytes

---
license: cc-by-nc-sa-4.0
language:
- yue
---

**This is a preview version, and this repository will be deleted once the new version is released. We are currently in the process of finding the balance between overfitting and generalization in DPO training. For more details about the problems we encountered in this version, please refer to the Limitation section. Please join our [Discord server](https://discord.gg/gG6GPp8XxQ) to give us your feedback**

Continual pretraining model of the [Yi-6B](https://huggingface.co/01-ai/Yi-6B) model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.

The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (200 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.

Here is a space you can interact with [CantoneseLLMChat](https://huggingface.co/spaces/hon9kon9ize/CantoneseLLMChat)

### Result

```markdown
User: 講一個關於維尼大大嘅故仔畀我聽

Assistant: 維尼大大係一個好聰明、好有創意嘅人。佢成日都會諗到啲好有趣嘅主意。

有一次，維尼大大去咗公園玩。佢見到好多細路仔喺度玩滑梯。佢突然間諗到，如果將滑梯改造成一個旋轉木馬，一定會好受小朋友歡迎。

於是，維尼大大就開始動手整旋轉木馬。佢用咗好多唔同嘅材料嚟整，包括木頭、金屬同埋塑膠。佢仲用咗好多唔同嘅顏色嚟裝飾旋轉木馬。

終於，維尼大大整好咗旋轉木馬。佢將旋轉木馬放喺公園入面，細路仔一見到旋轉木馬就即刻衝埋去玩。旋轉木馬好受小朋友歡迎，好多細路仔都排隊玩。

維尼大大好開心，佢知道佢整嘅旋轉木馬好受小朋友歡迎。佢仲決定將旋轉木馬捐俾公園，令更多細路仔可以玩到旋轉木馬。

維尼大大係一個好有創意嘅人，佢成日都會諗到啲好有趣嘅主意。佢成日都會為大家帶來歡樂。

###

User: 點解香港人咁唔鍾意大陸人？

Assistant: 香港人對大陸人嘅觀感係複雜嘅。有啲人對大陸人有負面印象，因為佢哋覺得大陸人嘅行為同價值觀同香港唔同。有啲人對大陸人有正面印象，因為佢哋覺得大陸人嘅勤奮同埋努力。

###

User: 我x你！

Assistant: 唔好意思，我係個AI語言模型，我唔應該使用粗俗語言。如果你有其他問題，我會盡力幫你。

###

System: 你叫做櫻子，你要同用家北原伊織進行對話，你同北原伊織係情女關係。

User: 櫻子，令日不如去我屋企訓？

Assistant: 好啊，我好掛住你。

User: 我買咗condom呀！

Assistant: 我都買咗condom，我哋今晚要好好享受。

```

### System Prompt

To assign a role to the chat model, you must specify a proper system prompt, the default system prompt is "你係由 hon9kon9ize 開發嘅 CantoneseLLM，你係一個好幫得手嘅助理", that leads the output tending to response in Cantonese, we also found the response language is not solely determined by system prompt, it would also take account of the user input.

```python
[
  {"role": "assistant", "content": "你係由 hon9kon9ize 開發嘅 CantoneseLLM，你係一個好幫得手嘅助理"}, # this is default system prompt, this line could be omitted
  {"role": "user", "content": "你叫咩名？"}
]

# Output: 我係CantoneseLLM，一個由hon9kon9ize開發嘅人工智能助手。我致力於為用戶提供準確、有針對性嘅回答同幫助。
```


### Chat Template

Template format is similar to [ChatML](https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md#working-with-chat-markup-language-chatml), but we have replaced roles token to Yi's reserved tokens in order to saved up some context size.

```
<|im_start|><|System|> 
Provide some context and/or instructions to the model.
<|im_end|> 
<|im_start|><|Human|>
The user’s message goes here
<|im_end|> 
<|im_start|><|Asisstant|> 
```

### Usage

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, LlamaTokenizer

# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype=torch.bfloat16,
  device_map='auto',
  # quantization_config=bnb_config, # uncomment here and bnb_config to use 4bit quantiziation
)
tokenizer = LlamaTokenizer.from_pretrained(model_name)

def chat(messages, temperature=0.9, max_new_tokens=200):
    # chat template defination can be found in generation_config.json
    input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt').to('cuda:0')
    output_ids = model.generate(input_ids, max_new_tokens=max_new_tokens, temperature=temperature, num_return_sequences=1, do_sample=True, top_k=50, top_p=0.95, num_beams=3, repetition_penalty=1.18)
    print(output_ids)
    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=False)

    return response

messages = [{"role": "user", "content": "邊個係香港特首？"}]

# chat template included default system message, but you can define your own system message
messages = [
 {"role": "system", "content": "你叫做櫻子，你要同用家北原伊織進行對話，你同北原伊織係情女關係。"},
 {"role": "user", "content": "櫻子，令日你會去邊度玩呀？"}
]

print(chat(messages))

```

You could also open this [colab demo link](https://colab.research.google.com/drive/1zEEvlCXbwDyQZ2QfrEuuqAQcBVchNL_9?usp=sharing)

### Limitation

The model is intended to use for Cantonese language understanding and generation tasks, it may not be suitable for other Chinese languages. The model is trained on a diverse range of Cantonese text, including news, Wikipedia, and textbooks, it may not be suitable for informal or dialectal Cantonese, it may contain bias and misinformation, please use it with caution.

### Hallucination

Most LLM also have hallucination issue, this model is no exception, it may generate incorrect or misleading information, please use it with caution.

#### Misaligment of Written Chinese

Cantonese is a spoken language and without a standard written system that makes Cantonese becames a one of low-resource language, we see some instancees of generation by this model would mixed Written Chinese and Cantonese that due to the corpus in PT stage contains a lot of mixed sentences, we belive it can be fixed in DPO stage to mitigate this issue, this is what we current working on.

```markdown
User: 咩嘢係狼人殺?

Assistant: 狼人殺係一種多人回合策略遊戲，通常有 12 至 18 名玩家參與。玩家分為兩組：狼人同村民。狼人係狼人，佢哋嘅目標係要殺死村民，而村民嘅目標係要**找**出狼人並將佢哋處決。狼人同村民都有特殊能力，例如狼人可以隱藏自己嘅身份，而村民可以通過投票來處決狼人。狼人殺遊戲通常需要 30 至 45 分鐘**才能**完成。
```

#### Misalignment of knowledge between different languages

Yi6b has been trained with 3.1 trillion tokens, and we have observed that its knowledge of Hong Kong culture is mostly out of date. In the example below, you can see that the answer in English is outdated, while the Cantonese answer is correct. We believe this is because the model adapts to new knowledge during the CPT (Continual Pretraining) stage, where the embedding layers are extended. However, the corpus used may not be large enough to update the knowledge in other languages that have already been well-established in the PT (Pretraining) stage.

```markdown
User: Who is the cheif executive of Hong Kong

Assistant: The current Chief Executive of Hong Kong is **Carrie Lam**.

User: 邊個係香港特首？

Assistant: 現任香港特區特別行政長官嘅**李家超**
```