Spaces:

xu-song
/

tokenizer-arena

Running

App Files Files Community

xu-song commited on Apr 23

Commit

988921c

•

1 Parent(s): 7d2062e

update compress rate

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +174 -81
app.py +2 -2
stats/README.md +0 -0
stats/compress_rate/amber.en.json +1 -0
stats/compress_rate/amber.zh-Hans.json +1 -0
stats/compress_rate/aya_101.en.json +1 -0
stats/compress_rate/aya_101.zh-Hans.json +1 -0
stats/compress_rate/baichuan.en.json +1 -0
stats/compress_rate/baichuan.zh-Hans.json +1 -0
stats/compress_rate/baichuan2.en.json +1 -0
stats/compress_rate/baichuan2.zh-Hans.json +1 -0
stats/compress_rate/bert_base_cased.en.json +1 -0
stats/compress_rate/bert_base_cased.zh-Hans.json +1 -0
stats/compress_rate/bert_base_chinese.en.json +1 -0
stats/compress_rate/bert_base_chinese.zh-Hans.json +1 -0
stats/compress_rate/bert_base_uncased.en.json +1 -0
stats/compress_rate/bert_base_uncased.zh-Hans.json +1 -0
stats/compress_rate/bloom.en.json +1 -0
stats/compress_rate/bloom.zh-Hans.json +1 -0
stats/compress_rate/byt5_small.en.json +1 -0
stats/compress_rate/byt5_small.zh-Hans.json +1 -0
stats/compress_rate/character_glm_6b.en.json +1 -0
stats/compress_rate/character_glm_6b.zh-Hans.json +1 -0
stats/compress_rate/chatglm2_6b.en.json +1 -0
stats/compress_rate/chatglm2_6b.zh-Hans.json +1 -0
stats/compress_rate/chatglm3_6b.en.json +1 -0
stats/compress_rate/chatglm3_6b.zh-Hans.json +1 -0
stats/compress_rate/chatglm_6b.en.json +1 -0
stats/compress_rate/chatglm_6b.zh-Hans.json +1 -0
stats/compress_rate/chatyuan_large_v2.en.json +1 -0
stats/compress_rate/chatyuan_large_v2.zh-Hans.json +1 -0
stats/compress_rate/chinese_llama.en.json +1 -0
stats/compress_rate/chinese_llama.zh-Hans.json +1 -0
stats/compress_rate/chinese_llama2.en.json +1 -0
stats/compress_rate/chinese_llama2.zh-Hans.json +1 -0
stats/compress_rate/code_davinci_002.en.json +1 -0
stats/compress_rate/code_davinci_002.zh-Hans.json +1 -0
stats/compress_rate/crystal_coder.en.json +1 -0
stats/compress_rate/crystal_coder.zh-Hans.json +1 -0
stats/compress_rate/dbrx_instruct.en.json +1 -0
stats/compress_rate/dbrx_instruct.zh-Hans.json +1 -0
stats/compress_rate/deepseek_coder_33b_instruct.en.json +1 -0
stats/compress_rate/deepseek_coder_33b_instruct.zh-Hans.json +1 -0
stats/compress_rate/deepseek_llm_7b_base.en.json +1 -0
stats/compress_rate/deepseek_llm_7b_base.zh-Hans.json +1 -0
stats/compress_rate/falcon_180b.en.json +1 -0
stats/compress_rate/falcon_180b.zh-Hans.json +1 -0
stats/compress_rate/falcon_7b.en.json +1 -0
stats/compress_rate/falcon_7b.zh-Hans.json +1 -0
stats/compress_rate/fastchat_t5_3b.en.json +1 -0

README.md CHANGED Viewed

@@ -14,9 +14,17 @@ pinned: false
 ## 压缩率 Compress Rate
-在 [cc-100](https://huggingface.co/datasets/cc100) 数据集，每个语言取1万条数据，测试不同tokenizer的压缩率。压缩率指标 `g_bytes/b_tokens`
-您可通过以下脚本进行复现
 ```sh
 python utils/compress_rate_util.py
 ```
@@ -24,92 +32,177 @@ python utils/compress_rate_util.py
 <details> <summary>简体中文压缩率</summary>
 在简体中文数据集 cc100-zh-Hans 计算压缩率
 </details>
-| tokenizer                   |   vocab_size |   g_bytes/b_tokens |   t_bytes/t_tokens |   b_tokens/g_bytes |
-|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
-| amber                       |        32000 |               1.84 |               1.8  |               0.54 |
-| aya_101                     |       250100 |               3.89 |               3.79 |               0.26 |
-| baichuan                    |        64000 |               3.92 |               3.82 |               0.26 |
-| baichuan2                   |       125696 |               4.53 |               4.42 |               0.22 |
-| bert_base_cased             |        28996 |               2.73 |               2.66 |               0.37 |
-| bert_base_chinese           |        21128 |               2.74 |               2.67 |               0.37 |
-| bert_base_uncased           |        30522 |               2.73 |               2.67 |               0.37 |
-| bloom                       |       250680 |               4.28 |               4.18 |               0.23 |
-| byt5_small                  |          256 |               0.93 |               0.91 |               1.08 |
-| character_glm_6b            |        64794 |               4.2  |               4.1  |               0.24 |
-| chatglm2_6b                 |        64794 |               4.2  |               4.1  |               0.24 |
-| chatglm3_6b                 |        64798 |               4.2  |               4.1  |               0.24 |
-| chatglm_6b                  |       150344 |               4.65 |               4.54 |               0.22 |
-| chatyuan_large_v2           |        32128 |               4.34 |               4.24 |               0.23 |
-| chinese_llama               |        49953 |               3.93 |               3.84 |               0.25 |
-| chinese_llama2              |        55296 |               3.92 |               3.83 |               0.26 |
-| code_davinci_002            |        50281 |               1.31 |               1.28 |               0.77 |
-| crystal_coder               |        32000 |               1.86 |               1.81 |               0.54 |
-| deepseek_coder_33b_instruct |        32000 |               3.4  |               3.32 |               0.29 |
-| deepseek_llm_7b_base        |       100000 |               4.05 |               3.96 |               0.25 |
-| falcon_180b                 |        65024 |               2.18 |               2.13 |               0.46 |
-| falcon_7b                   |        65024 |               2.18 |               2.13 |               0.46 |
-| fastchat_t5_3b              |        32000 |              13.7  |              13.38 |               0.07 |
-| flan_t5_base                |        32100 |              14.13 |              13.8  |               0.07 |
-| gemma_7b                    |       256000 |               3.82 |               3.73 |               0.26 |
-| gpt2                        |        50257 |               1.31 |               1.28 |               0.77 |
-| gpt2_chinese                |        21128 |               2.73 |               2.66 |               0.37 |
-| gpt_35_turbo                |       100277 |               2.26 |               2.21 |               0.44 |
-| gpt_4                       |       100277 |               2.26 |               2.21 |               0.44 |
-| gpt_nexo_20b                |        50254 |               2.01 |               1.96 |               0.5  |
-| internlm2_chat_7b           |        92544 |               4.23 |               4.13 |               0.24 |
-| internlm2_math_7b           |        92544 |               4.23 |               4.13 |               0.24 |
-| internlm_chat_7b            |       103168 |               4.23 |               4.14 |               0.24 |
-| internlm_xcomposer_7b       |       103168 |               4.23 |               4.14 |               0.24 |
-| kplug                       |        10261 |               2.72 |               2.65 |               0.37 |
-| llama                       |        32000 |               1.84 |               1.8  |               0.54 |
-| llama2                      |        32000 |               1.84 |               1.8  |               0.54 |
-| mistral_7b                  |        32000 |               2.36 |               2.3  |               0.42 |
-| mixtral_8_7b                |        32000 |               2.36 |               2.3  |               0.42 |
-| mobilebert_uncased          |        30522 |               2.73 |               2.67 |               0.37 |
-| moss                        |       106029 |               4.4  |               4.3  |               0.23 |
-| mt5_large                   |       250100 |               3.89 |               3.79 |               0.26 |
-| olmo_7b                     |        50280 |               2.01 |               1.96 |               0.5  |
-| orion_14b_chat              |        84608 |               4.63 |               4.52 |               0.22 |
-| phi_1                       |        50257 |               1.31 |               1.28 |               0.77 |
-| phi_2                       |        50257 |               1.31 |               1.28 |               0.77 |
-| pko_t5_large                |        50258 |               0.97 |               0.95 |               1.03 |
-| prompt_clue                 |        32128 |               4.34 |               4.24 |               0.23 |
-| qwen1_5_14b_chat            |       151643 |               4.16 |               4.06 |               0.24 |
-| qwen_1_8b_chat              |       151851 |               4.16 |               4.06 |               0.24 |
-| qwen_72b_chat               |       151851 |               4.16 |               4.06 |               0.24 |
-| qwen_7b_chat                |       151851 |               4.16 |               4.06 |               0.24 |
-| roberta_chinese_clue        |         8021 |               2.7  |               2.64 |               0.37 |
-| skywork_13b_base            |        65519 |               3.69 |               3.61 |               0.27 |
-| skywork_13b_math            |        65519 |               3.69 |               3.61 |               0.27 |
-| solar_10_7b                 |        32000 |               2.36 |               2.3  |               0.42 |
-| starchat_alpha              |        49152 |               2.78 |               2.72 |               0.36 |
-| switch_c_2048               |        32100 |              14.13 |              13.8  |               0.07 |
-| t5_base                     |        32100 |              14.13 |              13.8  |               0.07 |
-| t5_large                    |        32100 |              14.13 |              13.8  |               0.07 |
-| t5_small                    |        32100 |              14.13 |              13.8  |               0.07 |
-| text_davinci_003            |        50281 |               1.31 |               1.28 |               0.77 |
-| tigerbot_13b_chat_v2        |        60512 |               4.25 |               4.15 |               0.24 |
-| tigerbot_70b_chat_v4_4k     |        65107 |               4.25 |               4.15 |               0.24 |
-| wizardcoder_15b_v1          |        49152 |               2.78 |               2.72 |               0.36 |
-| wizardcoder_python_7b_v1    |        32000 |               1.84 |               1.8  |               0.54 |
-| wizardlm_7b_v1              |        32000 |               1.84 |               1.8  |               0.54 |
-| wizardmath_70b_v1           |        32000 |               1.84 |               1.8  |               0.54 |
-| xlm_roberta                 |       250002 |               3.96 |               3.86 |               0.25 |
-| yi_34b                      |        64000 |               4.17 |               4.07 |               0.24 |
-| yi_6b                       |        64000 |               4.17 |               4.07 |               0.24 |
-| yi_vl34b                    |        64000 |               4.11 |               4.02 |               0.24 |
-| zephyr_7b_beta              |        32000 |               2.36 |               2.3  |               0.42 |
-**结论**
-larger vocabulary sizes

 ## 压缩率 Compress Rate
+在 [cc-100](https://huggingface.co/datasets/cc100) 数据集，每个语言取1万条数据，测试不同tokenizer的压缩率。
+> 压缩率示例：
+llama3扩充了词典，具有更高的压缩比。同样1T字节的简体中文语料，llama分词后是 0.56万亿个token，llama3只需要0.31万亿个token。
+| tokenizer                    |   vocab_size |    t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
+|:-----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
+| llama                        |        32000 |               1.8  |               0.56 |               0.7  |
+| llama3                       |       128000 |               3.2  |               0.31 |               1.24 |
+可通过以下脚本进行复现
 ```sh
 python utils/compress_rate_util.py
 ```
+<details> <summary>英文压缩率</summary>
+在英文数据集 cc100-en 计算压缩率
+| tokenizer                   |   vocab_size |   g_bytes/b_tokens |   b_tokens/g_bytes |   t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
+|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
+| amber                       |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
+| aya_101                     |       250100 |               3.3  |               0.3  |               3.22 |               0.31 |               3.53 |
+| baichuan                    |        64000 |               3.74 |               0.27 |               3.65 |               0.27 |               4    |
+| baichuan2                   |       125696 |               3.89 |               0.26 |               3.8  |               0.26 |               4.17 |
+| bert_base_cased             |        28996 |               3.64 |               0.27 |               3.55 |               0.28 |               3.89 |
+| bert_base_chinese           |        21128 |               2.78 |               0.36 |               2.71 |               0.37 |               2.97 |
+| bert_base_uncased           |        30522 |               3.73 |               0.27 |               3.65 |               0.27 |               4    |
+| bloom                       |       250680 |               4.07 |               0.25 |               3.97 |               0.25 |               4.36 |
+| byt5_small                  |          256 |               0.92 |               1.08 |               0.9  |               1.11 |               0.99 |
+| character_glm_6b            |        64794 |               3.62 |               0.28 |               3.54 |               0.28 |               3.88 |
+| chatglm2_6b                 |        64794 |               3.62 |               0.28 |               3.54 |               0.28 |               3.88 |
+| chatglm3_6b                 |        64798 |               3.62 |               0.28 |               3.54 |               0.28 |               3.88 |
+| chatglm_6b                  |       150344 |               3.68 |               0.27 |               3.59 |               0.28 |               3.94 |
+| chatyuan_large_v2           |        32128 |               1.95 |               0.51 |               1.91 |               0.52 |               2.09 |
+| chinese_llama               |        49953 |               3.59 |               0.28 |               3.51 |               0.28 |               3.85 |
+| chinese_llama2              |        55296 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
+| code_davinci_002            |        50281 |               4.05 |               0.25 |               3.96 |               0.25 |               4.34 |
+| crystal_coder               |        32000 |               3.68 |               0.27 |               3.59 |               0.28 |               3.94 |
+| dbrx_instruct               |       100277 |               4.11 |               0.24 |               4.01 |               0.25 |               4.4  |
+| deepseek_coder_33b_instruct |        32000 |               3.64 |               0.27 |               3.56 |               0.28 |               3.9  |
+| deepseek_llm_7b_base        |       100000 |               3.85 |               0.26 |               3.76 |               0.27 |               4.12 |
+| falcon_180b                 |        65024 |               3.99 |               0.25 |               3.9  |               0.26 |               4.27 |
+| falcon_7b                   |        65024 |               3.99 |               0.25 |               3.9  |               0.26 |               4.27 |
+| fastchat_t5_3b              |        32000 |               2.16 |               0.46 |               2.11 |               0.47 |               2.31 |
+| flan_t5_base                |        32100 |               3.61 |               0.28 |               3.53 |               0.28 |               3.87 |
+| gemma_7b                    |       256000 |               3.91 |               0.26 |               3.82 |               0.26 |               4.18 |
+| gpt2                        |        50257 |               4.05 |               0.25 |               3.96 |               0.25 |               4.34 |
+| gpt2_chinese                |        21128 |               2.67 |               0.37 |               2.61 |               0.38 |               2.86 |
+| gpt_35_turbo                |       100277 |               4.11 |               0.24 |               4.01 |               0.25 |               4.4  |
+| gpt_4                       |       100277 |               4.11 |               0.24 |               4.01 |               0.25 |               4.4  |
+| gpt_nexo_20b                |        50254 |               4.04 |               0.25 |               3.94 |               0.25 |               4.32 |
+| grok_1                      |       131072 |               4.06 |               0.25 |               3.96 |               0.25 |               4.35 |
+| internlm2_chat_7b           |        92544 |               3.86 |               0.26 |               3.77 |               0.27 |               4.13 |
+| internlm2_math_7b           |        92544 |               3.86 |               0.26 |               3.77 |               0.27 |               4.13 |
+| internlm_chat_7b            |       103168 |               3.86 |               0.26 |               3.77 |               0.27 |               4.13 |
+| internlm_xcomposer_7b       |       103168 |               3.86 |               0.26 |               3.77 |               0.27 |               4.13 |
+| jamba_v0_1                  |        65536 |               3.82 |               0.26 |               3.73 |               0.27 |               4.09 |
+| kplug                       |        10261 |               2.66 |               0.38 |               2.6  |               0.38 |               2.85 |
+| llama                       |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
+| llama2                      |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
+| llama3                      |       128000 |               4.11 |               0.24 |               4.01 |               0.25 |               4.4  |
+| mistral_7b                  |        32000 |               3.67 |               0.27 |               3.58 |               0.28 |               3.92 |
+| mixtral_8_7b                |        32000 |               3.67 |               0.27 |               3.58 |               0.28 |               3.92 |
+| mobilebert_uncased          |        30522 |               3.73 |               0.27 |               3.65 |               0.27 |               4    |
+| moss                        |       106029 |               4.08 |               0.25 |               3.98 |               0.25 |               4.36 |
+| mt5_large                   |       250100 |               3.3  |               0.3  |               3.22 |               0.31 |               3.53 |
+| olmo_7b                     |        50280 |               4.04 |               0.25 |               3.94 |               0.25 |               4.32 |
+| orion_14b_chat              |        84608 |               3.94 |               0.25 |               3.85 |               0.26 |               4.22 |
+| phi_1                       |        50257 |               4.05 |               0.25 |               3.96 |               0.25 |               4.34 |
+| phi_2                       |        50257 |               4.05 |               0.25 |               3.96 |               0.25 |               4.34 |
+| pko_t5_large                |        50258 |               1.59 |               0.63 |               1.55 |               0.64 |               1.7  |
+| prompt_clue                 |        32128 |               1.95 |               0.51 |               1.91 |               0.52 |               2.09 |
+| qwen1_5_14b_chat            |       151643 |               4.06 |               0.25 |               3.97 |               0.25 |               4.35 |
+| qwen_1_8b_chat              |       151851 |               4.06 |               0.25 |               3.97 |               0.25 |               4.35 |
+| qwen_72b_chat               |       151851 |               4.06 |               0.25 |               3.97 |               0.25 |               4.35 |
+| qwen_7b_chat                |       151851 |               4.06 |               0.25 |               3.97 |               0.25 |               4.35 |
+| roberta_chinese_clue        |         8021 |               1.8  |               0.56 |               1.75 |               0.57 |               1.92 |
+| skywork_13b_base            |        65519 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
+| skywork_13b_math            |        65519 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
+| solar_10_7b                 |        32000 |               3.67 |               0.27 |               3.58 |               0.28 |               3.92 |
+| starchat_alpha              |        49152 |               3.63 |               0.28 |               3.54 |               0.28 |               3.88 |
+| switch_c_2048               |        32100 |               3.61 |               0.28 |               3.53 |               0.28 |               3.87 |
+| t5_base                     |        32100 |               3.61 |               0.28 |               3.53 |               0.28 |               3.87 |
+| t5_large                    |        32100 |               3.61 |               0.28 |               3.53 |               0.28 |               3.87 |
+| t5_small                    |        32100 |               3.61 |               0.28 |               3.53 |               0.28 |               3.87 |
+| text_davinci_003            |        50281 |               4.05 |               0.25 |               3.96 |               0.25 |               4.34 |
+| tigerbot_13b_chat_v2        |        60512 |               3.67 |               0.27 |               3.58 |               0.28 |               3.93 |
+| tigerbot_70b_chat_v4_4k     |        65107 |               3.65 |               0.27 |               3.57 |               0.28 |               3.91 |
+| wizardcoder_15b_v1          |        49152 |               3.63 |               0.28 |               3.54 |               0.28 |               3.88 |
+| wizardcoder_python_7b_v1    |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
+| wizardlm_7b_v1              |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
+| wizardmath_70b_v1           |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
+| xlm_roberta                 |       250002 |               3.49 |               0.29 |               3.41 |               0.29 |               3.74 |
+| yi_34b                      |        64000 |               3.87 |               0.26 |               3.78 |               0.26 |               4.15 |
+| yi_6b                       |        64000 |               3.87 |               0.26 |               3.78 |               0.26 |               4.15 |
+| yi_vl34b                    |        64000 |               3.88 |               0.26 |               3.79 |               0.26 |               4.16 |
+| zephyr_7b_beta              |        32000 |               3.67 |               0.27 |               3.58 |               0.28 |               3.92 |
+</details>
 <details> <summary>简体中文压缩率</summary>
 在简体中文数据集 cc100-zh-Hans 计算压缩率
+| tokenizer                   |   vocab_size |   g_bytes/b_tokens |   b_tokens/g_bytes |   t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
+|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
+| amber                       |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
+| aya_101                     |       250100 |               3.89 |               0.26 |               3.79 |               0.26 |               1.47 |
+| baichuan                    |        64000 |               3.92 |               0.26 |               3.82 |               0.26 |               1.48 |
+| baichuan2                   |       125696 |               4.53 |               0.22 |               4.42 |               0.23 |               1.71 |
+| bert_base_cased             |        28996 |               2.73 |               0.37 |               2.66 |               0.38 |               1.03 |
+| bert_base_chinese           |        21128 |               2.74 |               0.37 |               2.67 |               0.37 |               1.03 |
+| bert_base_uncased           |        30522 |               2.73 |               0.37 |               2.67 |               0.38 |               1.03 |
+| bloom                       |       250680 |               4.28 |               0.23 |               4.18 |               0.24 |               1.62 |
+| byt5_small                  |          256 |               0.93 |               1.08 |               0.91 |               1.1  |               0.35 |
+| character_glm_6b            |        64794 |               4.2  |               0.24 |               4.1  |               0.24 |               1.59 |
+| chatglm2_6b                 |        64794 |               4.2  |               0.24 |               4.1  |               0.24 |               1.59 |
+| chatglm3_6b                 |        64798 |               4.2  |               0.24 |               4.1  |               0.24 |               1.59 |
+| chatglm_6b                  |       150344 |               4.65 |               0.22 |               4.54 |               0.22 |               1.76 |
+| chatyuan_large_v2           |        32128 |               4.34 |               0.23 |               4.24 |               0.24 |               1.64 |
+| chinese_llama               |        49953 |               3.93 |               0.25 |               3.84 |               0.26 |               1.49 |
+| chinese_llama2              |        55296 |               3.92 |               0.26 |               3.83 |               0.26 |               1.48 |
+| code_davinci_002            |        50281 |               1.31 |               0.77 |               1.28 |               0.78 |               0.49 |
+| crystal_coder               |        32000 |               1.86 |               0.54 |               1.81 |               0.55 |               0.7  |
+| dbrx_instruct               |       100277 |               2.26 |               0.44 |               2.21 |               0.45 |               0.85 |
+| deepseek_coder_33b_instruct |        32000 |               3.4  |               0.29 |               3.32 |               0.3  |               1.29 |
+| deepseek_llm_7b_base        |       100000 |               4.05 |               0.25 |               3.96 |               0.25 |               1.53 |
+| falcon_180b                 |        65024 |               2.18 |               0.46 |               2.13 |               0.47 |               0.82 |
+| falcon_7b                   |        65024 |               2.18 |               0.46 |               2.13 |               0.47 |               0.82 |
+| fastchat_t5_3b              |        32000 |              13.7  |               0.07 |              13.38 |               0.07 |               5.18 |
+| flan_t5_base                |        32100 |              14.13 |               0.07 |              13.8  |               0.07 |               5.34 |
+| gemma_7b                    |       256000 |               3.82 |               0.26 |               3.73 |               0.27 |               1.44 |
+| gpt2                        |        50257 |               1.31 |               0.77 |               1.28 |               0.78 |               0.49 |
+| gpt2_chinese                |        21128 |               2.73 |               0.37 |               2.66 |               0.38 |               1.03 |
+| gpt_35_turbo                |       100277 |               2.26 |               0.44 |               2.21 |               0.45 |               0.85 |
+| gpt_4                       |       100277 |               2.26 |               0.44 |               2.21 |               0.45 |               0.85 |
+| gpt_nexo_20b                |        50254 |               2.01 |               0.5  |               1.96 |               0.51 |               0.76 |
+| grok_1                      |       131072 |               1.73 |               0.58 |               1.69 |               0.59 |               0.66 |
+| internlm2_chat_7b           |        92544 |               4.23 |               0.24 |               4.13 |               0.24 |               1.6  |
+| internlm2_math_7b           |        92544 |               4.23 |               0.24 |               4.13 |               0.24 |               1.6  |
+| internlm_chat_7b            |       103168 |               4.23 |               0.24 |               4.14 |               0.24 |               1.6  |
+| internlm_xcomposer_7b       |       103168 |               4.23 |               0.24 |               4.14 |               0.24 |               1.6  |
+| jamba_v0_1                  |        65536 |               2.3  |               0.44 |               2.24 |               0.45 |               0.87 |
+| kplug                       |        10261 |               2.72 |               0.37 |               2.65 |               0.38 |               1.03 |
+| llama                       |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
+| llama2                      |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
+| llama3                      |       128000 |               3.28 |               0.3  |               3.2  |               0.31 |               1.24 |
+| mistral_7b                  |        32000 |               2.36 |               0.42 |               2.3  |               0.43 |               0.89 |
+| mixtral_8_7b                |        32000 |               2.36 |               0.42 |               2.3  |               0.43 |               0.89 |
+| mobilebert_uncased          |        30522 |               2.73 |               0.37 |               2.67 |               0.38 |               1.03 |
+| moss                        |       106029 |               4.4  |               0.23 |               4.3  |               0.23 |               1.66 |
+| mt5_large                   |       250100 |               3.89 |               0.26 |               3.79 |               0.26 |               1.47 |
+| olmo_7b                     |        50280 |               2.01 |               0.5  |               1.96 |               0.51 |               0.76 |
+| orion_14b_chat              |        84608 |               4.63 |               0.22 |               4.52 |               0.22 |               1.75 |
+| phi_1                       |        50257 |               1.31 |               0.77 |               1.28 |               0.78 |               0.49 |
+| phi_2                       |        50257 |               1.31 |               0.77 |               1.28 |               0.78 |               0.49 |
+| pko_t5_large                |        50258 |               0.97 |               1.03 |               0.95 |               1.06 |               0.37 |
+| prompt_clue                 |        32128 |               4.34 |               0.23 |               4.24 |               0.24 |               1.64 |
+| qwen1_5_14b_chat            |       151643 |               4.16 |               0.24 |               4.06 |               0.25 |               1.57 |
+| qwen_1_8b_chat              |       151851 |               4.16 |               0.24 |               4.06 |               0.25 |               1.57 |
+| qwen_72b_chat               |       151851 |               4.16 |               0.24 |               4.06 |               0.25 |               1.57 |
+| qwen_7b_chat                |       151851 |               4.16 |               0.24 |               4.06 |               0.25 |               1.57 |
+| roberta_chinese_clue        |         8021 |               2.7  |               0.37 |               2.64 |               0.38 |               1.02 |
+| skywork_13b_base            |        65519 |               3.69 |               0.27 |               3.61 |               0.28 |               1.4  |
+| skywork_13b_math            |        65519 |               3.69 |               0.27 |               3.61 |               0.28 |               1.4  |
+| solar_10_7b                 |        32000 |               2.36 |               0.42 |               2.3  |               0.43 |               0.89 |
+| starchat_alpha              |        49152 |               2.78 |               0.36 |               2.72 |               0.37 |               1.05 |
+| switch_c_2048               |        32100 |              14.13 |               0.07 |              13.8  |               0.07 |               5.34 |
+| t5_base                     |        32100 |              14.13 |               0.07 |              13.8  |               0.07 |               5.34 |
+| t5_large                    |        32100 |              14.13 |               0.07 |              13.8  |               0.07 |               5.34 |
+| t5_small                    |        32100 |              14.13 |               0.07 |              13.8  |               0.07 |               5.34 |
+| text_davinci_003            |        50281 |               1.31 |               0.77 |               1.28 |               0.78 |               0.49 |
+| tigerbot_13b_chat_v2        |        60512 |               4.25 |               0.24 |               4.15 |               0.24 |               1.61 |
+| tigerbot_70b_chat_v4_4k     |        65107 |               4.25 |               0.24 |               4.15 |               0.24 |               1.61 |
+| wizardcoder_15b_v1          |        49152 |               2.78 |               0.36 |               2.72 |               0.37 |               1.05 |
+| wizardcoder_python_7b_v1    |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
+| wizardlm_7b_v1              |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
+| wizardmath_70b_v1           |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
+| xlm_roberta                 |       250002 |               3.96 |               0.25 |               3.86 |               0.26 |               1.5  |
+| yi_34b                      |        64000 |               4.17 |               0.24 |               4.07 |               0.25 |               1.58 |
+| yi_6b                       |        64000 |               4.17 |               0.24 |               4.07 |               0.25 |               1.58 |
+| yi_vl34b                    |        64000 |               4.11 |               0.24 |               4.02 |               0.25 |               1.56 |
+| zephyr_7b_beta              |        32000 |               2.36 |               0.42 |               2.3  |               0.43 |               0.89 |
 </details>

app.py CHANGED Viewed

@@ -78,13 +78,13 @@ with gr.Blocks(css="css/style.css", title="Tokenizer Arena") as demo:
         gr.Markdown("Please select corpus and unit of compress rate, get more details at [github](https://github.com/xu-song/tokenizer-arena/). ")
         with gr.Row():
             compress_rate_corpus = gr.CheckboxGroup(
-                ["cc100-en", "cc100-zh-Hans", "cc100-es", "code"],
                 value=["cc100-en", "cc100-zh-Hans"],
                 label="corpus",
                 # info=""
             )
             compress_rate_unit = gr.Radio(
-                ["b_tokens/g_bytes", "g_bytes/b_tokens", "t_tokens/t_bytes", "t_bytes/t_tokens"],
                 value="b_tokens/g_bytes",
                 label="unit",
             )

         gr.Markdown("Please select corpus and unit of compress rate, get more details at [github](https://github.com/xu-song/tokenizer-arena/). ")
         with gr.Row():
             compress_rate_corpus = gr.CheckboxGroup(
+                ["cc100-en", "cc100-zh-Hans", "cc100-es"],  # , "code"
                 value=["cc100-en", "cc100-zh-Hans"],
                 label="corpus",
                 # info=""
             )
             compress_rate_unit = gr.Radio(
+                ["b_tokens/g_bytes", "g_bytes/b_tokens", "t_tokens/t_bytes", "t_bytes/t_tokens", "n_chars/n_tokens"],
                 value="b_tokens/g_bytes",
                 label="unit",
             )

stats/README.md ADDED Viewed

File without changes

stats/compress_rate/amber.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 294627, "n_chars": 1121360}

stats/compress_rate/amber.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 32000, "n_bytes": 2633047, "n_tokens": 1330093, "n_chars": 927311}

stats/compress_rate/aya_101.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 250100, "n_bytes": 1124813, "n_tokens": 317881, "n_chars": 1121360}

stats/compress_rate/aya_101.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 250100, "n_bytes": 2633047, "n_tokens": 631182, "n_chars": 927311}

stats/compress_rate/baichuan.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 64000, "n_bytes": 1124813, "n_tokens": 280108, "n_chars": 1121360}

stats/compress_rate/baichuan.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 64000, "n_bytes": 2633047, "n_tokens": 626117, "n_chars": 927311}

stats/compress_rate/baichuan2.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 125696, "n_bytes": 1124813, "n_tokens": 269011, "n_chars": 1121360}

stats/compress_rate/baichuan2.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 125696, "n_bytes": 2633047, "n_tokens": 541464, "n_chars": 927311}

stats/compress_rate/bert_base_cased.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 28996, "n_bytes": 1124813, "n_tokens": 288022, "n_chars": 1121360}

stats/compress_rate/bert_base_cased.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 28996, "n_bytes": 2633047, "n_tokens": 899709, "n_chars": 927311}

stats/compress_rate/bert_base_chinese.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 21128, "n_bytes": 1124813, "n_tokens": 377068, "n_chars": 1121360}

stats/compress_rate/bert_base_chinese.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 21128, "n_bytes": 2633047, "n_tokens": 896599, "n_chars": 927311}

stats/compress_rate/bert_base_uncased.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 30522, "n_bytes": 1124813, "n_tokens": 280575, "n_chars": 1121360}

stats/compress_rate/bert_base_uncased.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 30522, "n_bytes": 2633047, "n_tokens": 898554, "n_chars": 927311}

stats/compress_rate/bloom.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 250680, "n_bytes": 1124813, "n_tokens": 257405, "n_chars": 1121360}

stats/compress_rate/bloom.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 250680, "n_bytes": 2633047, "n_tokens": 573008, "n_chars": 927311}

stats/compress_rate/byt5_small.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 256, "n_bytes": 1124813, "n_tokens": 1134813, "n_chars": 1121360}

stats/compress_rate/byt5_small.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 256, "n_bytes": 2633047, "n_tokens": 2643047, "n_chars": 927311}

stats/compress_rate/character_glm_6b.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 64794, "n_bytes": 1124813, "n_tokens": 289347, "n_chars": 1121360}

stats/compress_rate/character_glm_6b.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 64794, "n_bytes": 2633047, "n_tokens": 583646, "n_chars": 927311}

stats/compress_rate/chatglm2_6b.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 64794, "n_bytes": 1124813, "n_tokens": 289329, "n_chars": 1121360}

stats/compress_rate/chatglm2_6b.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 64794, "n_bytes": 2633047, "n_tokens": 583646, "n_chars": 927311}

stats/compress_rate/chatglm3_6b.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 64798, "n_bytes": 1124813, "n_tokens": 289347, "n_chars": 1121360}

stats/compress_rate/chatglm3_6b.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 64798, "n_bytes": 2633047, "n_tokens": 583646, "n_chars": 927311}

stats/compress_rate/chatglm_6b.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 150344, "n_bytes": 1124813, "n_tokens": 284761, "n_chars": 1121360}

stats/compress_rate/chatglm_6b.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 150344, "n_bytes": 2633047, "n_tokens": 527384, "n_chars": 927311}

stats/compress_rate/chatyuan_large_v2.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 32128, "n_bytes": 1124813, "n_tokens": 536033, "n_chars": 1121360}

stats/compress_rate/chatyuan_large_v2.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 32128, "n_bytes": 2633047, "n_tokens": 564905, "n_chars": 927311}

stats/compress_rate/chinese_llama.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 49953, "n_bytes": 1124813, "n_tokens": 291514, "n_chars": 1121360}

stats/compress_rate/chinese_llama.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 49953, "n_bytes": 2633047, "n_tokens": 623219, "n_chars": 927311}

stats/compress_rate/chinese_llama2.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 55296, "n_bytes": 1124813, "n_tokens": 294627, "n_chars": 1121360}

stats/compress_rate/chinese_llama2.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 55296, "n_bytes": 2633047, "n_tokens": 625766, "n_chars": 927311}

stats/compress_rate/code_davinci_002.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 50281, "n_bytes": 1124813, "n_tokens": 258403, "n_chars": 1121360}

stats/compress_rate/code_davinci_002.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 50281, "n_bytes": 2633047, "n_tokens": 1876809, "n_chars": 927311}

stats/compress_rate/crystal_coder.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 284627, "n_chars": 1121360}

stats/compress_rate/crystal_coder.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 32000, "n_bytes": 2633047, "n_tokens": 1320093, "n_chars": 927311}

stats/compress_rate/dbrx_instruct.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 100277, "n_bytes": 1124813, "n_tokens": 254985, "n_chars": 1121360}

stats/compress_rate/dbrx_instruct.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 100277, "n_bytes": 2633047, "n_tokens": 1084939, "n_chars": 927311}

stats/compress_rate/deepseek_coder_33b_instruct.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 287408, "n_chars": 1121360}

stats/compress_rate/deepseek_coder_33b_instruct.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 32000, "n_bytes": 2633047, "n_tokens": 720577, "n_chars": 927311}

stats/compress_rate/deepseek_llm_7b_base.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 100000, "n_bytes": 1124813, "n_tokens": 272324, "n_chars": 1121360}

stats/compress_rate/deepseek_llm_7b_base.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 100000, "n_bytes": 2633047, "n_tokens": 605081, "n_chars": 927311}

stats/compress_rate/falcon_180b.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 65024, "n_bytes": 1124813, "n_tokens": 262509, "n_chars": 1121360}

stats/compress_rate/falcon_180b.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 65024, "n_bytes": 2633047, "n_tokens": 1124681, "n_chars": 927311}

stats/compress_rate/falcon_7b.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 65024, "n_bytes": 1124813, "n_tokens": 262509, "n_chars": 1121360}

stats/compress_rate/falcon_7b.zh-Hans.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 65024, "n_bytes": 2633047, "n_tokens": 1124681, "n_chars": 927311}

stats/compress_rate/fastchat_t5_3b.en.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 484941, "n_chars": 1121360}