File size: 1,985 Bytes
ac6569d
 
 
 
d733f0a
 
 
 
 
 
 
 
 
 
 
ac6569d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d733f0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
license: llama3
---

[Original Repo](https://github.com/raunak-agarwal/factual-consistency-eval)

[Paper](https://arxiv.org/abs/2408.04114)

Inference can be done using VLLM

Data:
- [Training Data](https://huggingface.co/datasets/ragarwal/factual-consistency-training-mix)
- [Evaluation Benchmark](https://huggingface.co/datasets/ragarwal/factual-consistency-evaluation-benchmark)


### Results
| Method                  | Rank | Mean Win Rate (%) | Average AUC |
|-------------------------|------|-------------------|-------------|
| **Llama-3-8B (FT) (Ours)** | **1** | **78.11** | **78.037** |
| **Flan-T5-L (FT) (Ours)**  | **2** | **76.43** | **78.663** |
| MiniCheck-T5-L          | 3    | 72.39              | 76.674      |
| gpt-3.5-turbo           | 4    | 69.36              | 77.007      |
| Flan-T5-B (FT) (Ours)   | 5    | 66.00              | 76.126      |
| AlignScore-L            | 6    | 53.19              | 73.074      |
| Llama-3-8B              | 7    | 53.20              | 75.085      |
| AlignScore-B            | 8    | 39.39              | 71.319      |
| QuestEval               | 9    | 37.37              | 66.089      |
| BARTScore               | 10   | 26.94              | 62.637      |
| BERTScore               | 11   | 20.88              | 61.263      |
| ROUGE-L                 | 12   | 6.73               | 54.678      |

*Comparison of different factuality evaluation methods across all test datasets. The methods are ranked based on the Mean Win Rate, which measures overall performance on factuality tasks. The Average AUC column represents the average of all individual AUC-ROC scores.*



Cite this work as follows:
```
@misc{agarwal2024zeroshotfactualconsistencyevaluation,
      title={Zero-shot Factual Consistency Evaluation Across Domains}, 
      author={Raunak Agarwal},
      year={2024},
      eprint={2408.04114},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.04114}, 
}
```