license: apache-2.0
Themis
Paper: https://arxiv.org/abs/2406.18365
Github: https://github.com/PKU-ONELab/Themis
Introduction
We propose Themis, an 8B-parameter large language model (LLM) specifically designed and trained for NLG evaluation with more comprehensive capabilities.
Our Themis can evaluate various NLG tasks, including uncommon ones like question-answering evaluation (Versatility), in a reference-free manner (Independence). Moreover, it allows for specific and customized evaluation aspects and criteria, including overall quality and more fine-grained aspects (Flexibility), and its evaluation contains corresponding analysis and explanation together with the rating (Interpretability).
We believe that an ideal evaluator should be convenient to use and possess these characteristics. The comparison between related methods and Themis is shown in the table below.
Method | Versatility | Independence | Flexibility | Interpretability | Open-source |
---|---|---|---|---|---|
UniEval | β | β | βοΈ | β | βοΈ |
G-Eval | βοΈ | βοΈ | βοΈ | βοΈ | β |
X-Eval | βοΈ | β | βοΈ | β | β |
Prometheus | βοΈ | β | βοΈ | βοΈ | βοΈ |
Auto-J | βοΈ | βοΈ | β | βοΈ | βοΈ |
InstructScore | βοΈ | β | β | βοΈ | βοΈ |
TIGERScore | βοΈ | βοΈ | β | βοΈ | βοΈ |
Themis (Ours) | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
Performance
We implement experiments on several common NLG evaluation tasks and datasets to compare our Themis with other methods, including SummEval for summarization, Topical-Chat for dialogue response generation, SFRES&SFHOT for data-to-text, QAGS for factuality, MANS for story generation, and WMT23 zh-en for machine translation. Experimental results show that our Themis achieves better overall evaluation performance over other evaluation models, including GPT-4.
Method | SummEval | Topical-Chat | SFHOT& SFRES | QAGS | MANS | WMT23 | Average Spearman |
---|---|---|---|---|---|---|---|
BLEU | 0.075 | 0.388 | 0.024 | - | 0.032 | 0.021 | - |
ROUGE | 0.152 | 0.412 | 0.101 | - | -0.002 | 0.151 | - |
BARTScore | 0.329 | 0.086 | 0.208 | 0.425 | 0.350 | 0.118 | 0.253 |
BERTScore | 0.231 | 0.394 | 0.139 | - | 0.285 | 0.219 | - |
BLEURT | 0.152 | 0.388 | 0.244 | - | 0.138 | 0.263 | - |
CometKiwi | 0.228 | 0.340 | 0.251 | 0.094 | 0.251 | 0.343 | 0.251 |
UniEval | 0.474 | 0.577 | 0.282 | - | - | - | - |
G-Eval (GPT-3.5) | 0.409 | 0.585 | - | 0.461 | - | - | - |
G-Eval (GPT-4) | 0.523 | 0.588 | - | 0.611 | - | - | - |
GPT-3.5 Turbo | 0.416 | 0.578 | 0.306 | 0.431 | 0.328 | 0.347 | 0.401 |
GPT-4 Turbo | 0.511 | 0.746 | 0.320 | 0.637 | 0.473 | 0.437 | 0.521 |
X-Eval | 0.480 | 0.605 | 0.303 | 0.578 | - | - | - |
Prometheus-13B | 0.163 | 0.434 | 0.173 | - | 0.007 | 0.129 | - |
Auto-J-13B | 0.198 | 0.425 | 0.141 | 0.226 | 0.380 | 0.104 | 0.246 |
TIGERScore-13B | 0.384 | 0.346 | 0.200 | 0.504 | 0.231 | 0.248 | 0.319 |
InstructScore-7B | 0.258 | 0.241 | 0.247 | - | 0.298 | 0.219 | - |
Themis-8B (ours) | 0.553 | 0.725 | 0.333 | 0.684 | 0.551 | 0.405 | 0.542 |
We further conduct more in-depth analyses, including generalization tests on unseen tasks like the instruction-following evaluation as well as aspect-targeted perturbation tests, and our Themis also exhibits superior evaluation performance. For more experimental results and details, please refer to our paper.
Requirements and Usage
Please refer to our github repo for more details.
Citation
@article{hu2024themis,
title={Themis: Towards Flexible and Interpretable NLG Evaluation},
author={Hu, Xinyu and Lin, Li and Gao, Mingqi and Yin, Xunjian and Wan, Xiaojun},
journal={arXiv preprint arXiv:2406.18365},
year={2024}
}