metadata

license: apache-2.0

Themis

Paper: https://arxiv.org/abs/2406.18365

Github: https://github.com/PKU-ONELab/Themis

Introduction

We propose Themis, an 8B-parameter large language model (LLM) specifically designed and trained for NLG evaluation with more comprehensive capabilities.

Our Themis can evaluate various NLG tasks, including uncommon ones like question-answering evaluation (Versatility), in a reference-free manner (Independence). Moreover, it allows for specific and customized evaluation aspects and criteria, including overall quality and more fine-grained aspects (Flexibility), and its evaluation contains corresponding analysis and explanation together with the rating (Interpretability).

We believe that an ideal evaluator should be convenient to use and possess these characteristics. The comparison between related methods and Themis is shown in the table below.

Method	Versatility	Independence	Flexibility	Interpretability	Open-source
UniEval	❌	❌	✔️	❌	✔️
G-Eval	✔️	✔️	✔️	✔️	❌
X-Eval	✔️	❌	✔️	❌	❌
Prometheus	✔️	❌	✔️	✔️	✔️
Auto-J	✔️	✔️	❌	✔️	✔️
InstructScore	✔️	❌	❌	✔️	✔️
TIGERScore	✔️	✔️	❌	✔️	✔️
Themis (Ours)	✔️	✔️	✔️	✔️	✔️

Performance

We implement experiments on several common NLG evaluation tasks and datasets to compare our Themis with other methods, including SummEval for summarization, Topical-Chat for dialogue response generation, SFRES&SFHOT for data-to-text, QAGS for factuality, MANS for story generation, and WMT23 zh-en for machine translation. Experimental results show that our Themis achieves better overall evaluation performance over other evaluation models, including GPT-4.

Method	SummEval	Topical-Chat	SFHOT& SFRES	QAGS	MANS	WMT23	Average Spearman
BLEU	0.075	0.388	0.024	-	0.032	0.021	-
ROUGE	0.152	0.412	0.101	-	-0.002	0.151	-
BARTScore	0.329	0.086	0.208	0.425	0.350	0.118	0.253
BERTScore	0.231	0.394	0.139	-	0.285	0.219	-
BLEURT	0.152	0.388	0.244	-	0.138	0.263	-
CometKiwi	0.228	0.340	0.251	0.094	0.251	0.343	0.251
UniEval	0.474	0.577	0.282	-	-	-	-
G-Eval (GPT-3.5)	0.409	0.585	-	0.461	-	-	-
G-Eval (GPT-4)	0.523	0.588	-	0.611	-	-	-
GPT-3.5 Turbo	0.416	0.578	0.306	0.431	0.328	0.347	0.401
GPT-4 Turbo	0.511	0.746	0.320	0.637	0.473	0.437	0.521
X-Eval	0.480	0.605	0.303	0.578	-	-	-
Prometheus-13B	0.163	0.434	0.173	-	0.007	0.129	-
Auto-J-13B	0.198	0.425	0.141	0.226	0.380	0.104	0.246
TIGERScore-13B	0.384	0.346	0.200	0.504	0.231	0.248	0.319
InstructScore-7B	0.258	0.241	0.247	-	0.298	0.219	-
Themis-8B (ours)	0.553	0.725	0.333	0.684	0.551	0.405	0.542

We further conduct more in-depth analyses, including generalization tests on unseen tasks like the instruction-following evaluation as well as aspect-targeted perturbation tests, and our Themis also exhibits superior evaluation performance. For more experimental results and details, please refer to our paper.

Requirements and Usage

Please refer to our github repo for more details.

Citation

@article{hu2024themis,
  title={Themis: Towards Flexible and Interpretable NLG Evaluation},
  author={Hu, Xinyu and Lin, Li and Gao, Mingqi and Yin, Xunjian and Wan, Xiaojun},
  journal={arXiv preprint arXiv:2406.18365},
  year={2024}
}