Update README.md

ec61a06 verified 11 months ago

16.7 kB

	---
	pipeline_tag: text-generation
	license: apache-2.0
	language:
	- zh
	---

	# Model Card for Breeze-7B-Instruct-v0.1

	Breeze-7B is a language model that builds upon the foundation of [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), specifically enhanced for Traditional Chinese.

	[Breeze-7B-Base-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Base-v0.1) introduces an expanded vocabulary with additional 30,000 Traditional Chinese tokens and
	is pre-trained on a substantial dataset of 250GB of Traditional Chinese content.
	With the expanded vocabulary, the base model operates at twice the inference speed for Traditional Chinese characters compared to Mistral-7B. [See [Inference Performance](#inference-performance).]
	This achievement marks a significant milestone as it is the first instance of vocabulary expansion in a model tailored for Traditional Chinese.

	[Breeze-7B-Instruct-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) derives from the base model Breeze-7B-Base-v0.1
	and has undergone supervised fine-tuning with over 1 million instances to
	sharpen its capabilities. This fine-tuned model demonstrates impressive performance in benchmarks for both English and Traditional Chinese, surpassing the results of
	Taiwan-LLM-7B-v2.1-chat, Taiwan-LLM-13B-v2.0-chat and Qwen-7B-chat in Traditional Chinese assessments. It also excels in some benchmarks against Yi-6B-Chat.
	In English evaluations, Breeze-7B-Instruct-v0.1 shows comparable results to Mistral-7B-Instruct-v0.1 on the MMLU and MT-Bench benchmarks. [See [Chat Model Performance](#chat-model-performance).]


	[Breeze-7B-Instruct-64k-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1) is an extension to Breeze-7B-Instruct-v0.1
	to enable 64k
	context length, which is equivalent to 88k Traditional Chinese characters. With minimal sacrifice in the performance of the regular benchmarks,
	Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summarization on document-level inputs. [See [Long-context Performance](#long-context-performance).]


	A project by the members (in alphabetical order): Chan-Jan Hsu 許湛然, Chang-Le Liu 劉昶樂, Feng-Ting Liao 廖峰挺, Po-Chun Hsu 許博竣, Yi-Chang Chen 陳宜昌, and the supervisor Da-Shan Shiu 許大山.

	## Features

	- Breeze-7B-Base-v0.1
	- Expanding the vocabulary dictionary size from 32k to 62k to better support Traditional Chinese
	- 8k tokens context length
	- Breeze-7B-Instruct-v0.1
	- Expanding the vocabulary dictionary size from 32k to 62k to better support Traditional Chinese
	- 8k tokens context length
	- Multi-turn dialogue (without special handling for harmfulness)
	- Breeze-7B-Instruct-64k-v0.1
	- Expanding the vocabulary dictionary size from 32k to 62k to better support Traditional Chinese
	- 64k tokens context length
	- Multi-turn dialogue (without special handling for harmfulness)

	## Model Details

	- Breeze-7B-Base-v0.1
	- Finetuned from: [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
	- Model type: Causal decoder-only transformer language model
	- Language: English and Traditional Chinese (zh-tw)
	- Breeze-7B-Instruct-v0.1
	- Finetuned from: [MediaTek-Research/Breeze-7B-Base-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Base-v0.1)
	- Model type: Causal decoder-only transformer language model
	- Language: English and Traditional Chinese (zh-tw)
	- Breeze-7B-Instruct-64k-v0.1
	- Finetuned from: [MediaTek-Research/Breeze-7B-Instruct-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1)
	- Model type: Causal decoder-only transformer language model
	- Language: English and Traditional Chinese (zh-tw)

	## Base Model Performance

	TMMLU+, DRCD, and Table source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
	[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
	and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). MMLU sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
	We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate TMMLU+, DRCD, Table, and MMLU.


	\| Models \| \|↑ TMMLU+ (ACC) \| DRCD (EM) \| Table (ACC) \| MMLU (ACC) \|
	\|----------------------------------------------\|--------\|--------------\|-------------\|-------------\|------------\|
	\| \| \|TC, Knowledge \|TC, Reasoning\|TC, Reasoning\|EN, Knowledge\|
	\| \| \| 5 shot \| 3 shot \| 5 shot \| 5 shot \|
	\| [Yi-34B](https://huggingface.co/01-ai/Yi-34B)\| 34B \| 63.10 \| 84.57 \| 49.31 \| 77.42 \|
	\| [Qwen-14B](https://huggingface.co/01-ai/Qwen/Qwen-14B)\| 14B \| 51.30 \| 16.95 * \| 50.69 \| 68.83 \|
	\| [Yi-6B](https://huggingface.co/01-ai/Yi-6B) \| 6B \| 49.63 \| 76.61 \| 34.72 \| 65.35 \|
	\| [Qwen-7B](https://huggingface.co/01-ai/Qwen/Qwen-7B)\| 7B \| 42.84 \| 0.0 * \| 39.58 \| 61.00 \|
	\| [Breeze-7B-Base-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Base-v0.1) \| 7B \| 40.35 \| 81.13 \| 28.47 \| 61.63 \|
	\| [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)\| 7B \| 36.93 \| 79.27 \| 27.78 \| 64.89 \|


	\* Few-shot learning cannot effectively guide the model to generate the proper answer.

	Category ACC of TMMLU+ (5 shot)

	\| Models \| STEM \| Social Science \| Humanities \| Other \| ↑ AVG \|
	\|----------------------------------\|--------------\|----------------\|------------\|------------\|-------\|
	\| Yi-34B \| 56.03 \| 73.06 \| 61.12 \| 62.19 \| 63.10 \|
	\| Qwen-14B \| 46.51 \| 58.20 \| 51.12 \| 49.38 \| 51.30 \|
	\| Yi-6B \| 41.14 \| 57.77 \| 50.22 \| 49.39 \| 49.63 \|
	\| Qwen-7B \| 28.25 \| 47.80 \| 43.14 \| 42.17 \| 42.84 \|
	\| Breeze-7B-Base-v0.1 \| 35.74 \| 46.08 \| 40.29 \| 39.27 \| 40.35 \|
	\| Mistral-7B-v0.1 \| 33.01 \| 42.23 \| 35.86 \| 37.63 \| 36.93 \|




	## Chat Model Performance

	TMMLU+, DRCD, Table, and MT-Bench-tw source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
	[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
	and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). MMLU sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
	MT-Bench source from [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
	We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate TMMLU+, DRCD, Table, and MMLU.
	We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) (GPT4 as judge) to evaluate MT-Bench-tw and MT-Bench.


	\| Models \| \|↑ MT-Bench-tw (Score)\| TMMLU+ (ACC) \| TMMLU+ (ACC) \| DRCD (EM) \| Table (ACC) \| MT-Bench (Score) \| MMLU (ACC) \| MMLU (ACC) \|
	\|---------------------------------------------------------------------------------------------------------\|--------\|--------------------\|--------------\|--------------\|-------------\|-------------\|------------------\|-------------\|-------------\|
	\| \| \|TC, Chat \|TC, Knowledge \|TC, Knowledge \|TC, Reasoning\|TC, Reasoning\|EN, Chat \|EN, Knowledge\|EN, Knowledge\|
	\| \| \|0 shot \| 0 shot \| 5 shot \| 3 shot \| 0 shot \|0 shot \| 0 shot \| 5 shot \|
	\| [gpt-3.5-turbo](https://openai.com) \| \|7.1 \| 41.76 \| \| \| \|7.9 \| 70.00 \| \|
	\| [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) \| 34B \|6.9 \| 54.87 \| \| \| 36.81 \|7.6 \| 71.04 \| \|
	\| [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) \| 14B \|6.4 \| 48.41 \| \| \| 41.67 \|7.2 \| 64.91 \| \|
	\| [Breeze-7B-Instruct-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) \| 7B \|5.7 \| 41.61 \| \| \| 45.83 \|7.1 \| 63.26 \| \|
	\| [Breeze-7B-Instruct-64k-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1) \| 7B \|5.5 \| 40.99 \| \| \| 36.11 \|7.1 \| 63.68 \| \|
	\| [Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat) \| 7B \|5.4 \| 40.02 \| \| \| 33.33 \|6.2 \| 55.94 \| \|
	\| [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) \| 6B \|5.0 \| 44.79 \| \| \| 25.69 \|6.0 \| 59.45 \| \|
	\| [Taiwan-LLM-13B-v2.0-chat](https://huggingface.co/yentinglin/Taiwan-LLM-13B-v2.0-chat) \| 13B \|5.0 \| 29.47 \| \| \| 23.61 \|-* \| 50.50 \| \|
	\| [Taiwan-LLM-7B-v2.1-chat](https://huggingface.co/yentinglin/Taiwan-LLM-7B-v2.1-chat) \| 7B \|4.2 \| 28.08 \| \| \| 31.25 \| -* \| 42.72 \| \|

	\* Taiwan-LLM models responds to multi-turn questions (English) in Traditional Chinese.

	Category Score of MT-Bench-tw (0 shot)

	\| Models \| STEM \|Extraction\|Reasoning\| Math \| Coding \| Roleplay\| Writing \|Humanities\|↑ AVG \|
	\|-----------------------------------------------------\|---------\|---------\|---------\|---------\|---------\|---------\|---------\|---------\|---------\|
	\| gpt-3.5-turbo \| \| \| \| \| \| \| \| \| \|
	\| Yi-34B-Chat \| \| \| \| \| \| \| \| \| \|
	\| Qwen-14B-Chat \| \| \| \| \| \| \| \| \| \|
	\| Breeze-7B-Instruct-v0.1 \| \| \| \| \| \| \| \| \| \|
	\| Breeze-7B-Instruct-64k-v0.1 \| \| \| \| \| \| \| \| \| \|
	\| Qwen-7B-Chat \| \| \| \| \| \| \| \| \| \|
	\| Yi-6B-Chat \| \| \| \| \| \| \| \| \| \|
	\| Taiwan-LLM-13B-v2.0-chat \| \| \| \| \| \| \| \| \| \|
	\| Taiwan-LLM-7B-v2.1-chat \| \| \| \| \| \| \| \| \| \|

	Category ACC of TMMLU+ (0 shot)

	\| Model \| STEM \| Social Science \| Humanities \| Other \| ↑ AVG \|
	\|-----------------------------------------------------\|--------------\|----------------\|------------\|------------\|---------\|
	\| Yi-34B-Chat \| 47.65 \| 64.25 \| 52.73 \| 54.91 \| 54.87 \|
	\| Qwen-14B-Chat \| 43.83 \| 55.00 \| 48.55 \| 46.22 \| 48.41 \|
	\| Yi-6B-Chat \| 37.80 \| 51.74 \| 45.36 \| 44.25 \| 44.79 \|
	\| gpt-3.5-turbo \| 41.56 \| 46.72 \| 36.73 \| 42.03 \| 41.76 \|
	\| Breeze-7B-Instruct-v0.1 \| 37.41 \| 46.81 \| 42.06 \| 40.16 \| 41.61 \|
	\| Breeze-7B-Instruct-64k-v0.1 \| 37.88 \| 46.35 \| 40.31 \| 39.40 \| 40.99 \|
	\| Qwen-7B-Chat \| 35.44 \| 46.22 \| 38.35 \| 40.06 \| 40.02 \|
	\| Taiwan-LLM-13B-v2.0-chat \| 27.74 \| 33.69 \| 27.03 \| 29.43 \| 29.47 \|
	\| Taiwan-LLM-7B-v2.1-chat \| 25.58 \| 31.76 \| 27.36 \| 27.61 \| 28.08 \|



	## Inference Performance
	In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
	All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).

	\| Models \| ↓ Inference Time (sec)\|Estimated Max Input Length (Char)\|
	\|--------------------------------------------------------------------\|-------------------\|--------------------------\|
	\| Yi-6B \| 10.62 \| 5.2k \|
	\| Breeze-7B-Instruct-v0.1 \| 10.74 \| 11.1k \|
	\| Breeze-7B-Instruct-64k-v0.1 \| 10.74 \| 88.8k \|
	\| Qwen-7B \| 10.86 \| 9.8k \|
	\| Qwen-14B \| 18.89 \| 9.8k \|
	\| Mistral-7B-v0.1 \| 20.48 \| 5.1k \|
	\| Taiwan-LLM-7B-v2.1-base \| 26.26 \| 2.2k \|
	\| Taiwan-LLM-13B-v2.0-base \| 36.80 \| 2.2k \|
	\| Yi-34B \| 43.71 \| 4.5k \|

	## Long-context Performance

	TBD

	## Examples

	TBD

	## Use in Transformers

	First install direct dependencies:
	```
	pip install transformers torch accelerate
	```
	If you want faster inference using flash-attention2, you need to install these dependencies:
	```bash
	pip install packaging ninja
	pip install flash-attn
	```
	Then load the model in transformers:
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	model="MediaTek-Research/Breeze-7B-Instruct-v0.1",
	device_map="auto",
	torch_dtype=torch.bfloat16,
	use_flash_attn_2=True # optional
	)
	```

	The structure of the query template follows that of Mistral-7B-Instruct, as shown below.
	```txt
	<s> SYS_PROMPT [INST] QUERY1 [/INST] RESPONSE1 [INST] QUERY2 [/INST]
	```
	where `SYS_PROMPT`, `QUERY1`, `RESPONSE1`, and `QUERY2` can be provided by the user.

	The suggested default `SYS_PROMPT` is
	```txt
	You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.
	```

	## Citation

	```
	@article{breeze7b2024,
	title={},
	author={},
	journal={arXiv},
	year={2024}
	}
	```