|
--- |
|
datasets: |
|
- HuggingFaceFW/fineweb |
|
- erhwenkuo/c4-chinese-zhtw |
|
- erhwenkuo/wikipedia-zhtw |
|
- p208p2002/wudao |
|
- p208p2002/NDLTD-T10-90-111 |
|
- codeparrot/github-code-clean |
|
language: |
|
- en |
|
- zh |
|
license: llama3 |
|
--- |
|
# Llama 3 zhtw |
|
|
|
在 Llama 3 上試驗中文 Continue Pretraining (CP),共計訓練 800M tokens。 |
|
|
|
由於中文預訓練語料品質還有改進空間,CP 後表現未能超越原版 Llama 3,我們比較幾個開源社群訓練的中文 Llama 3 也有類似狀況。 |
|
|
|
在英文方面 LLaMA 3 zhtw 使用 FineWeb,使得 MMLU 表現高於其他中文CP模型,能力與原版 LLaMA 3 持平。 |
|
|
|
## Benchmarks |
|
| Models | | ↑ TMMLU+ (ACC) | CMMLU (ACC) | MMLU (ACC) | |
|
| ---------------------------- | --- | -------------- | ------------- | ------------- | |
|
| | | TC, Knowledge | CN, Knowledge | EN, Knowledge | |
|
| | | 5 shot | 5 shot | 5 shot | |
|
| Yi-6B | 6B | 49.63 | 75.53 | 65.35 | |
|
| Qwen-7B | 7B | 42.84 | 73.1 | 61.00 | |
|
| Meta-Llama-3-8B | 8B | 41.97 | 50.8 | 65.17 | |
|
| **p208p2002/llama-3-zhtw-8B** | 8B | 41.84 | 50.6 | 65.31 | |
|
| Breeze-7B-Base-v0_1 | 7B | 40.35 | 44.05 | 61.63 | |
|
| hfl/llama-3-chinese-8b | 8B | 39.64 | 50.9 | 61.1 | |
|
|
|
## Recipe |
|
|
|
### Datasets |
|
| Dataset | Lang | Weight | |
|
|----------------|-------------|--------| |
|
| FineWeb | en | 0.35 | |
|
| Wudao | zh-cn | 0.1 | |
|
| C4Tw | zh-tw | 0.1 | |
|
| WikiZhTw | zh-tw | 0.15 | |
|
| NdltdT10 | zh-tw | 0.1 | |
|
| GitHubMarkDown | code | 0.1 | |
|
| GitHubPython | code | 0.1 | |
|
|
|
### Hyper Parameters |
|
|
|
- Learning Rate: 1e-7 |
|
- Global Batch Size: 60 |
|
- Sequence Length: 8192 |