Update README.md
Browse files
README.md
CHANGED
@@ -20,21 +20,21 @@ on 14K synthetic data generated from scratch in a structed way (more details in
|
|
20 |
|
21 |
|
22 |
### Model Variants
|
23 |
-
We also have other
|
24 |
-
- [
|
25 |
-
- [lytang/MiniCheck-
|
|
|
26 |
|
27 |
|
28 |
### Model Performance
|
29 |
|
30 |
<p align="center">
|
31 |
-
<img src="./
|
32 |
</p>
|
33 |
|
34 |
The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
|
35 |
from 10 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-RoBERTa-Large outperform all
|
36 |
-
exisiting specialized fact-checkers with a similar scale by a large margin
|
37 |
-
is on par with GPT-4 but 400x cheaper. See full results in our work.
|
38 |
|
39 |
Note: We only evaluated the performance of our models on real claims -- without any human intervention in
|
40 |
any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
|
@@ -50,12 +50,15 @@ Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and i
|
|
50 |
|
51 |
```python
|
52 |
from minicheck.minicheck import MiniCheck
|
|
|
|
|
|
|
53 |
doc = "A group of students gather in the school library to study for their upcoming final exams."
|
54 |
claim_1 = "The students are preparing for an examination."
|
55 |
claim_2 = "The students are on vacation."
|
56 |
|
57 |
-
# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
|
58 |
-
scorer = MiniCheck(model_name='roberta-large',
|
59 |
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
|
60 |
print(pred_label) # [1, 0]
|
61 |
print(raw_prob) # [0.9581979513168335, 0.031335990875959396]
|
@@ -67,14 +70,16 @@ print(raw_prob) # [0.9581979513168335, 0.031335990875959396]
|
|
67 |
import pandas as pd
|
68 |
from datasets import load_dataset
|
69 |
from minicheck.minicheck import MiniCheck
|
|
|
|
|
70 |
|
71 |
# load 13K test data
|
72 |
df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
|
73 |
docs = df.doc.values
|
74 |
claims = df.claim.values
|
75 |
|
76 |
-
scorer = MiniCheck(model_name='roberta-large',
|
77 |
-
pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~
|
78 |
```
|
79 |
|
80 |
To evalaute the result on the benchmark
|
|
|
20 |
|
21 |
|
22 |
### Model Variants
|
23 |
+
We also have other three MiniCheck model variants:
|
24 |
+
- [bespokelabs/Bespoke-Minicheck-7B](https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B) (Model Size: 7B)
|
25 |
+
- [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large) (Model Size: 0.8B)
|
26 |
+
- [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large) (Model Size: 0.4B)
|
27 |
|
28 |
|
29 |
### Model Performance
|
30 |
|
31 |
<p align="center">
|
32 |
+
<img src="./performance_focused.png" width="550">
|
33 |
</p>
|
34 |
|
35 |
The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
|
36 |
from 10 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-RoBERTa-Large outperform all
|
37 |
+
exisiting specialized fact-checkers with a similar scale by a large margin. See full results in our work.
|
|
|
38 |
|
39 |
Note: We only evaluated the performance of our models on real claims -- without any human intervention in
|
40 |
any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
|
|
|
50 |
|
51 |
```python
|
52 |
from minicheck.minicheck import MiniCheck
|
53 |
+
import os
|
54 |
+
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
|
55 |
+
|
56 |
doc = "A group of students gather in the school library to study for their upcoming final exams."
|
57 |
claim_1 = "The students are preparing for an examination."
|
58 |
claim_2 = "The students are on vacation."
|
59 |
|
60 |
+
# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B']
|
61 |
+
scorer = MiniCheck(model_name='roberta-large', cache_dir='./ckpts')
|
62 |
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
|
63 |
print(pred_label) # [1, 0]
|
64 |
print(raw_prob) # [0.9581979513168335, 0.031335990875959396]
|
|
|
70 |
import pandas as pd
|
71 |
from datasets import load_dataset
|
72 |
from minicheck.minicheck import MiniCheck
|
73 |
+
import os
|
74 |
+
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
|
75 |
|
76 |
# load 13K test data
|
77 |
df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
|
78 |
docs = df.doc.values
|
79 |
claims = df.claim.values
|
80 |
|
81 |
+
scorer = MiniCheck(model_name='roberta-large', cache_dir='./ckpts')
|
82 |
+
pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 800 docs/min, depending on hardware
|
83 |
```
|
84 |
|
85 |
To evalaute the result on the benchmark
|