kaleinaNyan/jina-v3-rullmarena-judge-300924

JinaJudge: Proxy Judgement for Russian LLM Arena

Description

This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the Russian LLM Arena, designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.

Model Details

This is a small upgrade to the kaleinaNyan/jina-v3-rullmarena-judge model:

Number of decoder blocks increased from 4 to 5.
Hidden activations dimensionality reduced from 1024 to 512 (via a projection layer after JINA encoder).
The resulting model size went from 614M params to 589M params.
I also tweaked some training hyperparameters, but training data composition is the same.

Surprisingly, these changes gave a tangible performance improvement, so I decided to upload the model. As it turned out (after evaluation on the train set), previous model was not expressive enough.

Evaluation

The validation process was based on existing judgements from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

NOTE: values in parenthesis show relative improvement compared to previous model.

Models evaluated:

gemma-2-9b-it-sppo-iter3
glm-4-9b-chat
gpt-3.5-turbo-1106
mistral-7b-instruct-v0.3
storm-7b

Validation Performance:

Accuracy: 80.76% (+2.67)
Precision: 78.56% (+2.74)
Recall: 79.48% (+2.71)
F1-score: 79.00% (+2.73)

For the test phase, new judgements were generated using GPT-4 for the kolibri-mistral-0427-upd model.

Test Performance:

Accuracy: 82.72% (+2.64)
Precision: 80.11% (+3.43)
Recall: 82.42% (+4.69)
F1-score: 81.18% (+4.10)

Usage Example

from transformers import AutoModel

jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-300924", trust_remote_code=True)

prompt_template = """
<user prompt>
{user_prompt}
<end>
<assistant A answer>
{assistant_a}
<end>
<assistant B answer>
{assistant_b}
<end>
""".strip()

prompt = "your prompt"
assistant_a = "assistant a response"
assistant_b = "assistant b response"

example = prompt_template.format(
    user_prompt=user_prompt,
    assistant_a=assistant_a,
    assistant_b=assistant_b,
)

judgement = jina([example])[0].argmax()

judgement_map = {
  0: "A is better than B",
  1: "A == B",
  2: "B is better than A"
}

print(judgement_map[judgement])

Generated ranking

The ranking was obtained using a modified Russian LLM Arena code. All judgements were regenerated using the jina-judge model.

Model	Score	95% CI	Average #Tokens
gpt-4-1106-preview	81.6	(-2.3, 3.0)	541
gpt-4.0-mini	76.0	(-2.7, 2.4)	448
qwen-2.5-72b-it	72.5	(-3.6, 3.6)	557
gemma-2-9b-it-sppo-iter3	72.1	(-3.7, 3.6)	569
gemma-2-27b-it	71.1	(-3.3, 3.2)	482
gemma-2-9b-it	70.8	(-3.4, 3.5)	569
t-lite-instruct-0.1	68.3	(-3.8, 4.5)	810
suzume-llama-3-8b-multilingual-orpo	62.9	(-3.9, 4.0)	682
glm-4-9b-chat	60.5	(-3.9, 4.0)	516
sfr-iterative-dpo-llama-3-8b-r	59.9	(-4.0, 4.3)	682
c4ai-command-r-v01	56.9	(-4.2, 3.8)	516
phi-3-medium-4k-instruct	56.4	(-2.8, 3.3)	566
mistral-nemo-instruct-2407	56.1	(-2.9, 3.4)	682
yandex_gpt_pro	51.7	(-3.4, 3.4)	345
suzume-llama-3-8b-multilingual	51.3	(-3.4, 4.0)	489
hermes-2-theta-llama-3-8b	50.9	(-3.2, 3.4)	485
starling-1m-7b-beta	50.2	(-3.3, 3.4)	495
gpt-3.5-turbo-0125	50.0	(0.0, 0.0)	220
llama-3-instruct-8b-sppo-iter3	49.8	(-3.4, 4.0)	763
llama-3-8b-saiga-suzume-ties	48.2	(-4.1, 3.9)	569
llama-3-smaug-8b	46.6	(-3.9, 3.8)	763
vikhr-it-5.4-fp16-orpo-v2	46.6	(-3.7, 4.0)	379
aya-23-8b	46.3	(-3.8, 3.9)	571
saiga-llama3-8b_v6	45.5	(-3.8, 3.9)	471
vikhr-it-5.2-fp16-cp	43.8	(-3.9, 4.0)	543
qwen2-7b-instruct	43.7	(-2.5, 2.7)	492
opencchat-3.5-0106	43.4	(-3.3, 3.7)	485
gpt-3.5-turbo-1106	41.7	(-2.9, 3.5)	220
kolibri-mistral-0427-upd	41.5	(-3.2, 3.5)	551
paralex-llama-3-8b-sft	40.6	(-3.8, 3.3)	688
mistral-7b-instruct-v0.3	40.3	(-3.3, 3.4)	469
llama-3-instruct-8b-simpo	40.2	(-2.9, 3.7)	551
gigachat_pro	40.2	(-3.2, 3.5)	294
hermes-2-pro-llama-3-8b	39.5	(-2.9, 3.4)	689
vikhr-it-5.3-fp16-32k	39.5	(-2.8, 3.2)	519
opencchat-3.6-8b-2204522	37.7	(-3.3, 3.7)	409
meta-llama-3-8b-instruct	37.5	(-3.1, 3.5)	450
kolibri-vikhr-mistral-0427	37.1	(-3.1, 3.8)	488
neural-chat-v3.3	36.5	(-2.7, 3.6)	523
vikhr-it-5.1-fp16	36.4	(-3.5, 3.5)	448
gigachat-lite	36.0	(-2.8, 3.0)	523
saiga-7b	25.9	(-3.1, 3.7)	927
storm-7b	25.1	(-3.6, 4.1)	419
snorkel-mistral-pairrm-dpo	16.5	(-3.8, 3.2)	773