File size: 3,828 Bytes
57f1fd6
428b731
751936e
 
 
57f1fd6
d3c1316
57f1fd6
 
2bd606a
 
57f1fd6
 
428b731
 
7d2062e
814ee6b
 
988921c
814ee6b
988921c
 
 
 
 
 
 
 
 
480ae5d
 
 
814ee6b
 
 
 
988921c
 
 
 
 
 
 
 
 
 
 
 
814ee6b
480ae5d
 
814ee6b
988921c
 
 
 
 
814ee6b
480ae5d
814ee6b
 
 
 
 
 
 
 
97354e0
2bd606a
 
 
 
 
97354e0
 
2bd606a
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
title: Tokenizer Arena
emoji: 
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 4.31.4
app_file: app.py
pinned: false
datasets:
  - cc100
---



## 压缩率 Compress Rate


在 [cc-100](https://huggingface.co/datasets/cc100) 数据集,每个语言取1万条数据,测试不同tokenizer的压缩率。

> 压缩率示例:
llama3扩充了词典,具有更高的压缩比。同样1T字节的简体中文语料,llama分词后是 0.56万亿个token,llama3只需要0.31万亿个token。

| tokenizer                    |   vocab_size |    t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
|:-----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
| llama                        |        32000 |               1.8  |               0.56 |               0.7  |
| llama3                       |       128000 |               3.2  |               0.31 |               1.24 |

可通过以下脚本进行复现 
```sh
python utils/compress_rate_util.py 
```




<details> <summary>英文压缩率</summary>
在英文数据集 cc100-en 计算压缩率 

| tokenizer                   |   vocab_size |   g_bytes/b_tokens |   b_tokens/g_bytes |   t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
| amber                       |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
| aya_101                     |       250100 |               3.3  |               0.3  |               3.22 |               0.31 |               3.53 |
| baichuan                    |        64000 |               3.74 |               0.27 |               3.65 |               0.27 |               4    |
| baichuan2                   |       125696 |               3.89 |               0.26 |               3.8  |               0.26 |               4.17 |

</details>


<details> <summary>简体中文压缩率</summary>
在简体中文数据集 cc100-zh-Hans 计算压缩率 

| tokenizer                   |   vocab_size |   g_bytes/b_tokens |   b_tokens/g_bytes |   t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
| amber                       |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
| aya_101                     |       250100 |               3.89 |               0.26 |               3.79 |               0.26 |               1.47 |
| baichuan                    |        64000 |               3.92 |               0.26 |               3.82 |               0.26 |               1.48 |

</details>




## Reference

- Getting the most out of your tokenizer for pre-training and domain adaptation
- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
- blog 
  - https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them 
  - https://huggingface.co/docs/transformers/tokenizer_summary#sentencepiece
  - https://www.huaxiaozhuan.com/%E5%B7%A5%E5%85%B7/huggingface_transformer/chapters/1_tokenizer.html
  - https://zhuanlan.zhihu.com/p/652520262
  - https://github.com/QwenLM/Qwen/blob/main/tokenization_note_zh.md
  - https://tonybaloney.github.io/posts/cjk-chinese-japanese-korean-llm-ai-best-practices.html
  - 
- demo
  - https://huggingface.co/spaces/Xenova/the-tokenizer-playground 
  - https://github.com/dqbd/tiktokenizer
  - https://chat.lmsys.org/?leaderboard
  - https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- paper
  - ss
-