Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
pipeline_tag: text-classification
|
5 |
+
---
|
6 |
+
# Model Summary
|
7 |
+
|
8 |
+
This is a fact-checking model from our work:
|
9 |
+
|
10 |
+
📃 [**MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**](https://arxiv.org/pdf/2404.10774.pdf) ([GitHub Repo](https://github.com/Liyan06/MiniCheck))
|
11 |
+
|
12 |
+
The model is based on RoBERTA-Large that predicts a binary label - 1 for supported and 0 for unsupported.
|
13 |
+
The model is doing predictions on the *sentence-level*. It takes as input a document and a sentence and determine
|
14 |
+
whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}**
|
15 |
+
|
16 |
+
|
17 |
+
MiniCheck-RoBERTa-Large is fine tuned from the trained RoBERTA-Large model from AlignScore ([Zha et al., 2023](https://aclanthology.org/2023.acl-long.634.pdf))
|
18 |
+
on 14K synthetic data generated from scratch in a structed way (more details in the paper).
|
19 |
+
|
20 |
+
|
21 |
+
### Model Variants
|
22 |
+
We also have other two MiniCheck model variants:
|
23 |
+
- [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large)
|
24 |
+
- [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large)
|
25 |
+
|
26 |
+
|
27 |
+
### Model Performance
|
28 |
+
The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
|
29 |
+
from 10 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-RoBERTa-Large outperform all
|
30 |
+
exisiting specialized fact-checkers with a similar scale by a large margin but is 2% worse than our best model MiniCheck-Flan-T5-Large. See full results in our work.
|
31 |
+
|
32 |
+
Note: We only evaluated the performance of our models on real claims -- without any human intervention in
|
33 |
+
any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
|
34 |
+
LLMs' actual behaviors.
|
35 |
+
|
36 |
+
|
37 |
+
# Model Usage Demo
|
38 |
+
|
39 |
+
Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and install necessary packages from `requirements.txt`.
|
40 |
+
|
41 |
+
|
42 |
+
### Below is a simple use case
|
43 |
+
|
44 |
+
```python
|
45 |
+
from minicheck.minicheck import MiniCheck
|
46 |
+
doc = "A group of students gather in the school library to study for their upcoming final exams."
|
47 |
+
claim_1 = "The students are preparing for an examination."
|
48 |
+
claim_2 = "The students are on vacation."
|
49 |
+
|
50 |
+
# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
|
51 |
+
scorer = MiniCheck(model_name='roberta-large', device=f'cuda:0', cache_dir='./ckpts')
|
52 |
+
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
|
53 |
+
print(pred_label) # [1, 0]
|
54 |
+
print(raw_prob) # [0.9581979513168335, 0.031335990875959396]
|
55 |
+
```
|
56 |
+
|
57 |
+
### Test on our [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact) Benchmark
|
58 |
+
|
59 |
+
```python
|
60 |
+
import pandas as pd
|
61 |
+
from datasets import load_dataset
|
62 |
+
from minicheck.minicheck import MiniCheck
|
63 |
+
|
64 |
+
# load 13K test data
|
65 |
+
df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
|
66 |
+
docs = df.doc.values
|
67 |
+
claims = df.claim.values
|
68 |
+
|
69 |
+
scorer = MiniCheck(model_name='roberta-large', device=f'cuda:0', cache_dir='./ckpts')
|
70 |
+
pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 15 mins, depending on hardware
|
71 |
+
```
|
72 |
+
|
73 |
+
To evalaute the result on the benchmark
|
74 |
+
```python
|
75 |
+
from sklearn.metrics import balanced_accuracy_score
|
76 |
+
df['preds'] = pred_label
|
77 |
+
result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
|
78 |
+
for dataset in df.dataset.unique():
|
79 |
+
sub_df = df[df.dataset == dataset]
|
80 |
+
bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
|
81 |
+
result_df.loc[len(result_df)] = [dataset, bacc]
|
82 |
+
result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
|
83 |
+
result_df.round(1)
|
84 |
+
```
|
85 |
+
|
86 |
+
# Citation
|
87 |
+
|
88 |
+
```
|
89 |
+
@misc{tang2024minicheck,
|
90 |
+
title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents},
|
91 |
+
author={Liyan Tang and Philippe Laban and Greg Durrett},
|
92 |
+
year={2024},
|
93 |
+
eprint={2404.10774},
|
94 |
+
archivePrefix={arXiv},
|
95 |
+
primaryClass={cs.CL}
|
96 |
+
}
|
97 |
+
```
|