lytang commited on
Commit
0c16056
1 Parent(s): aad2866

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ pipeline_tag: text-classification
5
+ ---
6
+ # Model Summary
7
+
8
+ This is a fact-checking model from our work:
9
+
10
+ 📃 [**MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**](https://arxiv.org/pdf/2404.10774.pdf) ([GitHub Repo](https://github.com/Liyan06/MiniCheck))
11
+
12
+ The model is based on RoBERTA-Large that predicts a binary label - 1 for supported and 0 for unsupported.
13
+ The model is doing predictions on the *sentence-level*. It takes as input a document and a sentence and determine
14
+ whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}**
15
+
16
+
17
+ MiniCheck-RoBERTa-Large is fine tuned from the trained RoBERTA-Large model from AlignScore ([Zha et al., 2023](https://aclanthology.org/2023.acl-long.634.pdf))
18
+ on 14K synthetic data generated from scratch in a structed way (more details in the paper).
19
+
20
+
21
+ ### Model Variants
22
+ We also have other two MiniCheck model variants:
23
+ - [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large)
24
+ - [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large)
25
+
26
+
27
+ ### Model Performance
28
+ The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
29
+ from 10 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-RoBERTa-Large outperform all
30
+ exisiting specialized fact-checkers with a similar scale by a large margin but is 2% worse than our best model MiniCheck-Flan-T5-Large. See full results in our work.
31
+
32
+ Note: We only evaluated the performance of our models on real claims -- without any human intervention in
33
+ any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
34
+ LLMs' actual behaviors.
35
+
36
+
37
+ # Model Usage Demo
38
+
39
+ Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and install necessary packages from `requirements.txt`.
40
+
41
+
42
+ ### Below is a simple use case
43
+
44
+ ```python
45
+ from minicheck.minicheck import MiniCheck
46
+ doc = "A group of students gather in the school library to study for their upcoming final exams."
47
+ claim_1 = "The students are preparing for an examination."
48
+ claim_2 = "The students are on vacation."
49
+
50
+ # model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
51
+ scorer = MiniCheck(model_name='roberta-large', device=f'cuda:0', cache_dir='./ckpts')
52
+ pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
53
+ print(pred_label) # [1, 0]
54
+ print(raw_prob) # [0.9581979513168335, 0.031335990875959396]
55
+ ```
56
+
57
+ ### Test on our [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact) Benchmark
58
+
59
+ ```python
60
+ import pandas as pd
61
+ from datasets import load_dataset
62
+ from minicheck.minicheck import MiniCheck
63
+
64
+ # load 13K test data
65
+ df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
66
+ docs = df.doc.values
67
+ claims = df.claim.values
68
+
69
+ scorer = MiniCheck(model_name='roberta-large', device=f'cuda:0', cache_dir='./ckpts')
70
+ pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 15 mins, depending on hardware
71
+ ```
72
+
73
+ To evalaute the result on the benchmark
74
+ ```python
75
+ from sklearn.metrics import balanced_accuracy_score
76
+ df['preds'] = pred_label
77
+ result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
78
+ for dataset in df.dataset.unique():
79
+ sub_df = df[df.dataset == dataset]
80
+ bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
81
+ result_df.loc[len(result_df)] = [dataset, bacc]
82
+ result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
83
+ result_df.round(1)
84
+ ```
85
+
86
+ # Citation
87
+
88
+ ```
89
+ @misc{tang2024minicheck,
90
+ title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents},
91
+ author={Liyan Tang and Philippe Laban and Greg Durrett},
92
+ year={2024},
93
+ eprint={2404.10774},
94
+ archivePrefix={arXiv},
95
+ primaryClass={cs.CL}
96
+ }
97
+ ```