Spaces:
Running
Running
File size: 3,828 Bytes
57f1fd6 428b731 751936e 57f1fd6 d3c1316 57f1fd6 2bd606a 57f1fd6 428b731 7d2062e 814ee6b 988921c 814ee6b 988921c 480ae5d 814ee6b 988921c 814ee6b 480ae5d 814ee6b 988921c 814ee6b 480ae5d 814ee6b 97354e0 2bd606a 97354e0 2bd606a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
title: Tokenizer Arena
emoji: ⚡
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 4.31.4
app_file: app.py
pinned: false
datasets:
- cc100
---
## 压缩率 Compress Rate
在 [cc-100](https://huggingface.co/datasets/cc100) 数据集,每个语言取1万条数据,测试不同tokenizer的压缩率。
> 压缩率示例:
llama3扩充了词典,具有更高的压缩比。同样1T字节的简体中文语料,llama分词后是 0.56万亿个token,llama3只需要0.31万亿个token。
| tokenizer | vocab_size | t_bytes/t_tokens | t_tokens/t_bytes | n_chars/n_tokens |
|:-----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
| llama | 32000 | 1.8 | 0.56 | 0.7 |
| llama3 | 128000 | 3.2 | 0.31 | 1.24 |
可通过以下脚本进行复现
```sh
python utils/compress_rate_util.py
```
<details> <summary>英文压缩率</summary>
在英文数据集 cc100-en 计算压缩率
| tokenizer | vocab_size | g_bytes/b_tokens | b_tokens/g_bytes | t_bytes/t_tokens | t_tokens/t_bytes | n_chars/n_tokens |
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
| amber | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
| aya_101 | 250100 | 3.3 | 0.3 | 3.22 | 0.31 | 3.53 |
| baichuan | 64000 | 3.74 | 0.27 | 3.65 | 0.27 | 4 |
| baichuan2 | 125696 | 3.89 | 0.26 | 3.8 | 0.26 | 4.17 |
</details>
<details> <summary>简体中文压缩率</summary>
在简体中文数据集 cc100-zh-Hans 计算压缩率
| tokenizer | vocab_size | g_bytes/b_tokens | b_tokens/g_bytes | t_bytes/t_tokens | t_tokens/t_bytes | n_chars/n_tokens |
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
| amber | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
| aya_101 | 250100 | 3.89 | 0.26 | 3.79 | 0.26 | 1.47 |
| baichuan | 64000 | 3.92 | 0.26 | 3.82 | 0.26 | 1.48 |
</details>
## Reference
- Getting the most out of your tokenizer for pre-training and domain adaptation
- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
- blog
- https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
- https://huggingface.co/docs/transformers/tokenizer_summary#sentencepiece
- https://www.huaxiaozhuan.com/%E5%B7%A5%E5%85%B7/huggingface_transformer/chapters/1_tokenizer.html
- https://zhuanlan.zhihu.com/p/652520262
- https://github.com/QwenLM/Qwen/blob/main/tokenization_note_zh.md
- https://tonybaloney.github.io/posts/cjk-chinese-japanese-korean-llm-ai-best-practices.html
-
- demo
- https://huggingface.co/spaces/Xenova/the-tokenizer-playground
- https://github.com/dqbd/tiktokenizer
- https://chat.lmsys.org/?leaderboard
- https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- paper
- ss
- |