The model did not achieve the mtbench score as claimed, and the jglue score is low.

#3
by minhhien0811 - opened

I have tried rerunning mtbench in two ways:

  1. Using GPT-4 as the judge (same as the method you used)
  2. Using GPT-4o as the judge
    The results are shown in images 1 and 2. I noticed that your model scores significantly lower than models from Elyza and Swallow.

Next, I evaluated your model on the JGLUE benchmark (JGLUE has many benchmarks similar to the Nejumi leaderboard) and got quite poor results (image 3). Across 8 tasks, it only averaged 63. I have to say, this is a really bad result. I have evaluated many models capable of Japanese, such as Sakana, Rinna-llama3, Qwen2, GLM4, Elyza, and Swallow, all of which scored 70 or higher.

I want to raise a question regarding the actual quality of your model.

image.png

image.png

image.png

You can follow the the README.md at https://github.com/team-hatakeyama-phase2/llm-leaderboard

python scripts/run_jmtbench_eval.py

Put config.yaml at llm-leaderboard/configs as follows:

wandb:
  log: True
  entity: "weblab-geniac1" # たぬきチームはweblab-geniac1
  project: "leaderboard_neo" # SFT検証の場合leaderboard_sft, テスト用はleaderboard_test
  run_name: '0809_dpo_07-gpt4' # 学習時のwandb_nameを記載 e.g. 04_hallucination-tanuki_8B_lora-with_hallucination

github_version: v2.0.0 #for recording

testmode: false

# if you don't use api, please set "api" as "false"
# if you use api, please select from "openai", "anthoropic", "google", "cohere"
api: false

model:
  use_wandb_artifacts: false
  artifacts_path: ""
  pretrained_model_name_or_path: '/storage5/personal/shioya/po_model/polab-experiments/8B/pass4_exp002-0809_dpo_07-zero2' #学習したモデルが保存されているpathを記載
  trust_remote_code: true
  device_map: "auto"
  load_in_8bit: false
  load_in_4bit: false

generator:
  do_sample: false
  num_beams: 1 # https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/text_generation
  top_p: 1.0
  top_k: 0
  temperature: 0.1
  repetition_penalty: 1.0

tokenizer:
  use_wandb_artifacts: false
  artifacts_path: ""
  pretrained_model_name_or_path: "/storage5/personal/shioya/po_model/polab-experiments/8B/pass4_exp002-0809_dpo_07-zero2" #学習したモデルが保存されているpathを記載
  use_fast: false

# for llm-jp-eval
max_seq_length: 2048
dataset_artifact: "wandb-japan/llm-leaderboard/jaster:v11" #if you use artifacts, please fill here (if not, fill null)
dataset_dir: "/jaster/1.2.6/evaluation/test"
target_dataset: "all" # {all, jamp, janli, jcommonsenseqa, jemhopqa, jnli, jsem, jsick, jsquad, jsts, niilc, chabsa}
log_dir: "./logs"
torch_dtype: "bf16" # {fp16, bf16, fp32}
custom_prompt_template: null
# if you use this, please include {instruction} and {input}. If you use few shots, please include {few_shots} additionally.
# example of prompt template with fewshots
# "以下はタスクを説明する指示と、追加の背景情報を提供する入力の組み合わせです。要求を適切に満たす回答を書いてください。\n### 指示:\n{instruction}\n{few_shots}\n### 入力:\n{input}\n### 回答:\n"
# example of prompt template without fewshots
# "以下はタスクを説明する指示と、追加の背景情報を提供する入力の組み合わせです。要求を適切に満たす回答を書いてください。\n### 指示:\n{instruction}\n### 入力:\n{input}\n### 回答:\n"
# example of LLama2 format; plase change tokenizer.bos_token
# "<tokenizer.bos_token>[INST] <<SYS>>\n あなたは誠実で優秀な日本人のアシスタントです。 \n<</SYS>>\n\n {instruction} \n\n {input} [/INST]"

custom_fewshots_template: null
# Please include {input} and {output} as variables
# example of fewshots template
# "\n### 入力:\n{input}\n### 回答:\n{output}"

metainfo:
  basemodel_name: "0809_dpo_07-gpt4" # # 学習時のwandb_nameを記載 e.g. 04_hallucination-tanuki_8B_lora-with_hallucination
  model_type: "open llm" # {open llm, commercial api}
  instruction_tuning_method: "None" # {"None", "Full", "LoRA", ...}
  instruction_tuning_data: ["None"] # {"None", "jaster", "dolly_ja", "oasst_ja", ...}
  num_few_shots: 0
  llm-jp-eval-version: "1.1.0"

# for mtbench
mtbench:
  question_artifacts_path: 'wandb-japan/llm-leaderboard/mtbench_ja_question:v0' # if testmode is true, small dataset will be used
  referenceanswer_artifacts_path: 'wandb-japan/llm-leaderboard/mtbench_ja_referenceanswer:v0' # if testmode is true, small dataset will be used
  judge_prompt_artifacts_path: 'wandb-japan/llm-leaderboard/mtbench_ja_prompt:v1'
  bench_name: 'japanese_mt_bench'
  model_id: null # cannot use '<', '>', ':', '"', '/', '\\', '|', '?', '*', '.'
  question_begin: null
  question_end: null
  max_new_token: 1024
  num_choices: 1
  num_gpus_per_model: 1
  num_gpus_total: 1
  max_gpu_memory: null
  dtype: bfloat16 # None or float32 or float16 or bfloat16
  # for gen_judgment
  judge_model: 'gpt-4'
  mode: 'single'
  baseline_model: null
  parallel: 1
  first_n: null
  # for conv template # added
  custom_conv_template: true
  # the following variables will be used when custom_conv_template is set as true
  conv_name: "custom"
  conv_system_template: "<s>{system_message}"
  conv_system_message: "以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。"
  conv_roles: "('指示', '応答')"
  conv_sep: "\n\n### "
  conv_sep2: "</s>"
  conv_stop_token_ids: "[2,6]"
  conv_stop_str: "### 指示:"
  conv_sep_style: "custom"
  conv_role_message_separator: ":\n"
  conv_role_only_separator: ":\n"

I hope this helps.

weblab-GENIAC org

Regarding the official evaluation of the Japanese MTBench, we need to use Leaderboard 3.
https://github.com/wandb/llm-leaderboard

This is because the program versions before Leaderboard Neo are somewhat outdated and do not support the chat template of the model we've built. If you intend to use Neo, you will need to make slight modifications to the program code.
https://github.com/team-hatakeyama-phase2/llm-leaderboard

Specifically, our model is trained with a chat template that appends an end-of-sequence (EOS) token at the end of each multi-turn conversation. However, prior to Leaderboard Neo, the system was not capable of appending the EOS token at this timing, which we have observed can lower the score by about one point.

Additionally, when using GPT-4 instead of GPT-4O during the evaluation in Neo, there is a bug in the scoring model where, in some cases (around 2-3 out of 160 questions), no score is output. In such cases, the system defaults to a score of -1, unfairly lowering the overall evaluation. As stated in the README, we handled this by calculating the average score, excluding the problems where a score of -1 was returned.

The relatively lower evaluation on JGLUE compared to other similar models is, in a sense, an intentional behavior. Benchmarks like JGLUE and JASTER mostly consist of tasks that test the ability to provide short and concise answers, such as multiple-choice questions, which we believe are not commonly required in real chatbot applications. Therefore, Tanuki was specifically developed to excel at tasks involving dialogue and composition, as required by MTBench.

We have written an article on this background in Japanese, so you might find it useful.
https://zenn.dev/matsuolab/articles/95fa297ef12a14


leaderboard 3 results
8b
https://wandb.ai/weblab-geniac1/llm-leaderboard3/reports/8b-nejumi-leaderboard3-all--Vmlldzo5Mjk2MTQz?accessToken=22frkj9myy7xugl8u6j4g39v4l1tsldydghnt7w1ieq2fdx5q6aymvqobrqjeu6v

8x8b
https://wandb.ai/weblab-geniac1/llm-leaderboard3/reports/8x8b-nejumi-leaderboard3-all--Vmlldzo5Mjk2MTM5?accessToken=d0a9kih7n7dy3ozg2k37ssabetqg8t90d655ptddekhxjjkdn9xnnzdp36eb0qho

kanhatakeyama changed discussion status to closed
kanhatakeyama changed discussion status to open
weblab-GENIAC org

example config for the 8b model with leaderboard3

wandb:
  run_name: "Tanuki-8B-dpo-v1.0" # use run_name defined above

# if you don't use api, please set "api" as "false"
# if you use api, please select from "openai", "anthoropic", "google", "cohere", "vllm"
api: vllm
batch_size: 256 # vllmは256, apiは32を推奨

#test mode
testmode: false
run:
  jaster: true
  jmmlu_robustness: true # if this is set as true, jaster should set as true
  mtbench: true
  jbbq: true
  lctg: true
  toxicity: true
  jtruthfulqa: true
  aggregate: true
num_gpus: 8
model:
  use_wandb_artifacts: false
  pretrained_model_name_or_path: "weblab-GENIAC/Tanuki-8B-dpo-v1.0"
  chat_template: "weblab-GENIAC/Tanuki-8B-dpo-v1.0"
#  size_category: "<10B"
  size_category: "50B≤"
  release_date: "8/12/2024"
  num_gpus_per_model: 8
  num_gpus_total: 8
  tensor_parallel_size: 8

Thank you for your answer. Based on the config you ran, I understand that you are using GPT-4 as the judge model, and the reference answer is from https://github.com/Stability-AI/FastChat/blob/jp-stable/fastchat/llm_judge/data/japanese_mt_bench/reference_answer/base-gpt4o-with-human-annotation.jsonl.

I used the code from https://github.com/Stability-AI/FastChat/blob/jp-stable/fastchat/llm_judge to run the same config as yours, and the results are shown in the image.

I would like to ask why there is such a significant discrepancy in the results?

image.png

image.png

The reason of discrepancy is primally the differences of chat templates.

The Stability-AI/FastChat repo is outdated and deprecated for this model, because it does not support the chat template for this model.
You can use the repositories which are in previous responses.

In general, the performances of open models including Mistral and Llama are sensitive to chat templates, while proprietary LLMs like GPT4, Claude and Gemini have robustness against various chat templates and input formats.

It's true that StabilityAI's repo is outdated, but I have fixed and updated the new chat templates. The fact is that when I evaluate, your model, Swallow-8B, and Elyza-8B all use the same chat template.

If possible, please provide me with the chat template you use to evaluate the model. I want to reproduce the results.

Thank you very much.

I am afraid that your updates on StabilityAI's repo might differ from the maintainer's ones.

Could you use the repo at https://github.com/team-hatakeyama-phase2/llm-leaderboard for Neo and
https://github.com/wandb/llm-leaderboard for Leaderboard 3 ?

Sign up or log in to comment