indiejoseph
commited on
Commit
•
5b980d8
1
Parent(s):
52f94c1
Update README.md
Browse files
README.md
CHANGED
@@ -4,9 +4,15 @@ language:
|
|
4 |
- yue
|
5 |
---
|
6 |
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
-
Continual pretraining model of
|
10 |
|
11 |
The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (200 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.
|
12 |
|
@@ -126,6 +132,10 @@ print(chat(messages))
|
|
126 |
|
127 |
The model is intended to use for Cantonese language understanding and generation tasks, it may not be suitable for other Chinese languages. The model is trained on a diverse range of Cantonese text, including news, Wikipedia, and textbooks, it may not be suitable for informal or dialectal Cantonese, it may contain bias and misinformation, please use it with caution.
|
128 |
|
|
|
|
|
|
|
|
|
129 |
#### Misaligment of Written Chinese
|
130 |
|
131 |
Cantonese is a spoken language and without a standard written system that makes Cantonese becames a one of low-resource language, we see some instancees of generation by this model would mixed Written Chinese and Cantonese that due to the corpus in PT stage contains a lot of mixed sentences, we belive it can be fixed in DPO stage to mitigate this issue, this is what we current working on.
|
@@ -133,7 +143,7 @@ Cantonese is a spoken language and without a standard written system that makes
|
|
133 |
```
|
134 |
User: 咩嘢係狼人殺?
|
135 |
|
136 |
-
Assistant:
|
137 |
```
|
138 |
|
139 |
#### Misalignment of knowledge between different languages
|
|
|
4 |
- yue
|
5 |
---
|
6 |
|
7 |
+
---
|
8 |
+
license: cc-by-nc-sa-4.0
|
9 |
+
language:
|
10 |
+
- yue
|
11 |
+
---
|
12 |
+
|
13 |
+
**This is a preview version, and this repository will be deleted once the new version is released. We are currently in the process of finding the balance between overfitting and generalization in DPO training. For more details about the problems we encountered in this version, please refer to the Limitation section.**
|
14 |
|
15 |
+
Continual pretraining model of the [Yi-6B](https://huggingface.co/01-ai/Yi-6B) model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.
|
16 |
|
17 |
The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (200 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.
|
18 |
|
|
|
132 |
|
133 |
The model is intended to use for Cantonese language understanding and generation tasks, it may not be suitable for other Chinese languages. The model is trained on a diverse range of Cantonese text, including news, Wikipedia, and textbooks, it may not be suitable for informal or dialectal Cantonese, it may contain bias and misinformation, please use it with caution.
|
134 |
|
135 |
+
### Hallucination
|
136 |
+
|
137 |
+
Most LLM also have hallucination issue, this model is no exception, it may generate incorrect or misleading information, please use it with caution.
|
138 |
+
|
139 |
#### Misaligment of Written Chinese
|
140 |
|
141 |
Cantonese is a spoken language and without a standard written system that makes Cantonese becames a one of low-resource language, we see some instancees of generation by this model would mixed Written Chinese and Cantonese that due to the corpus in PT stage contains a lot of mixed sentences, we belive it can be fixed in DPO stage to mitigate this issue, this is what we current working on.
|
|
|
143 |
```
|
144 |
User: 咩嘢係狼人殺?
|
145 |
|
146 |
+
Assistant: 狼人殺係一種多人回合策略遊戲,通常有 12 至 18 名玩家參與。玩家分為兩組:狼人同村民。狼人係狼人,佢哋嘅目標係要殺死村民,而村民嘅目標係要**找**出狼人並將佢哋處決。狼人同村民都有特殊能力,例如狼人可以隱藏自己嘅身份,而村民可以通過投票來處決狼人。狼人殺遊戲通常需要 30 至 45 分鐘才能完成。
|
147 |
```
|
148 |
|
149 |
#### Misalignment of knowledge between different languages
|