--- language: - ja - en license: cc-by-4.0 datasets: - cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental base_model: cyberagent/calm2-7b-chat model-index: - name: calm2-7b-chat-dpo-experimental results: - task: type: text-generation name: Text Generation dataset: name: AI2 Reasoning Challenge (25-Shot) type: ai2_arc config: ARC-Challenge split: test args: num_few_shot: 25 metrics: - type: acc_norm value: 41.04 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: HellaSwag (10-Shot) type: hellaswag split: validation args: num_few_shot: 10 metrics: - type: acc_norm value: 68.99 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MMLU (5-Shot) type: cais/mmlu config: all split: test args: num_few_shot: 5 metrics: - type: acc value: 39.82 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: TruthfulQA (0-shot) type: truthful_qa config: multiple_choice split: validation args: num_few_shot: 0 metrics: - type: mc2 value: 43.13 source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: Winogrande (5-shot) type: winogrande config: winogrande_xl split: validation args: num_few_shot: 5 metrics: - type: acc value: 65.67 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: GSM8k (5-shot) type: gsm8k config: main split: test args: num_few_shot: 5 metrics: - type: acc value: 5.53 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental name: Open LLM Leaderboard --- # Model Card for "calm2-7b-chat-dpo-experimental" [cyberagent/calm2-7b-chat](https://huggingface.co/cyberagent/calm2-7b-chat)に[cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental](https://huggingface.co/datasets/cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental)データセットを用いて[Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290)をしたモデルです。 DPOには[Low-Rank Adaptation (LoRA)](https://huggingface.co/docs/peft/conceptual_guides/lora)を用いました。 ## Requirements, Usage, Chat Template [cyberagent/calm2-7b-chat](https://huggingface.co/cyberagent/calm2-7b-chat)と同様です。 同様のコード・プロンプトで動かすことができます。 ```python import transformers from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer assert transformers.__version__ >= "4.34.1" model = AutoModelForCausalLM.from_pretrained("cyberagent/calm2-7b-chat-dpo-experimental", device_map="auto", torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained("cyberagent/calm2-7b-chat-dpo-experimental") streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) prompt = """USER: AIによって私達の暮らしはどのように変わりますか? ASSISTANT: """ token_ids = tokenizer.encode(prompt, return_tensors="pt") output_ids = model.generate( input_ids=token_ids.to(model.device), max_new_tokens=300, do_sample=True, temperature=0.8, streamer=streamer, ) ``` ## 実験結果 ### ELYZA-tasks-100 (GPT-4 eval) 実験結果のランダム性を避けるため、greedy searchで出力しました。 | calm2-7b-chat | calm2-7b-chat-dpo | | ---- | ---- | | 2.67 | 2.85 | ### Japanese MT-Bench 以下の文をシステムプロンプト(system_message)としてcalm2-7b-chat-dpoとcalm2-7b-chatの評価を行いました。 "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。" このシステムプロンプトは[stabilityai/japanese-stablelm-instruct-alpha-7bを評価するときに使われるもの](https://github.com/Stability-AI/FastChat/blob/dfb653d2cadd16017b66bbc3a25cf361031f2da3/fastchat/conversation.py#L364)をそのまま使いました。 他のデコーディングパラメータはデフォルトのままです(ランダム性があります)。 | | calm2-7b-chat | calm2-7b-chat-dpo | | ---- | ---- | ---- | | 平均 | 6.1 | 6.7 | | extraction | 4.1 | 5.4 | | humanities | 8.2 | 8.4 | | reasoning | 3.9 | 4.3 | | roleplay | 6.4 | 7.0 | | stem | 6.3 | 6.2 | | writing | 7.7 | 9.1 | ## Releases 1.0: v1 release (Jan 24, 2024) ## Author Yuu Jinnai (jinnai_yu@cyberagent.co.jp), Standing on the shoulders of giants # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_cyberagent__calm2-7b-chat-dpo-experimental) | Metric |Value| |---------------------------------|----:| |Avg. |44.03| |AI2 Reasoning Challenge (25-Shot)|41.04| |HellaSwag (10-Shot) |68.99| |MMLU (5-Shot) |39.82| |TruthfulQA (0-shot) |43.13| |Winogrande (5-shot) |65.67| |GSM8k (5-shot) | 5.53|