lytang commited on
Commit
e9c9374
1 Parent(s): 2f224c8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -10
README.md CHANGED
@@ -20,21 +20,21 @@ on 14K synthetic data generated from scratch in a structed way (more details in
20
 
21
 
22
  ### Model Variants
23
- We also have other two MiniCheck model variants:
24
- - [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large)
25
- - [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large)
 
26
 
27
 
28
  ### Model Performance
29
 
30
  <p align="center">
31
- <img src="./cost-vs-bacc.png" width="360">
32
  </p>
33
 
34
  The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
35
  from 10 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-RoBERTa-Large outperform all
36
- exisiting specialized fact-checkers with a similar scale by a large margin but is 2% worse than our best model MiniCheck-Flan-T5-Large, which
37
- is on par with GPT-4 but 400x cheaper. See full results in our work.
38
 
39
  Note: We only evaluated the performance of our models on real claims -- without any human intervention in
40
  any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
@@ -50,12 +50,15 @@ Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and i
50
 
51
  ```python
52
  from minicheck.minicheck import MiniCheck
 
 
 
53
  doc = "A group of students gather in the school library to study for their upcoming final exams."
54
  claim_1 = "The students are preparing for an examination."
55
  claim_2 = "The students are on vacation."
56
 
57
- # model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
58
- scorer = MiniCheck(model_name='roberta-large', device=f'cuda:0', cache_dir='./ckpts')
59
  pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
60
  print(pred_label) # [1, 0]
61
  print(raw_prob) # [0.9581979513168335, 0.031335990875959396]
@@ -67,14 +70,16 @@ print(raw_prob) # [0.9581979513168335, 0.031335990875959396]
67
  import pandas as pd
68
  from datasets import load_dataset
69
  from minicheck.minicheck import MiniCheck
 
 
70
 
71
  # load 13K test data
72
  df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
73
  docs = df.doc.values
74
  claims = df.claim.values
75
 
76
- scorer = MiniCheck(model_name='roberta-large', device=f'cuda:0', cache_dir='./ckpts')
77
- pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 15 mins, depending on hardware
78
  ```
79
 
80
  To evalaute the result on the benchmark
 
20
 
21
 
22
  ### Model Variants
23
+ We also have other three MiniCheck model variants:
24
+ - [bespokelabs/Bespoke-Minicheck-7B](https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B) (Model Size: 7B)
25
+ - [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large) (Model Size: 0.8B)
26
+ - [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large) (Model Size: 0.4B)
27
 
28
 
29
  ### Model Performance
30
 
31
  <p align="center">
32
+ <img src="./performance_focused.png" width="550">
33
  </p>
34
 
35
  The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
36
  from 10 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-RoBERTa-Large outperform all
37
+ exisiting specialized fact-checkers with a similar scale by a large margin. See full results in our work.
 
38
 
39
  Note: We only evaluated the performance of our models on real claims -- without any human intervention in
40
  any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
 
50
 
51
  ```python
52
  from minicheck.minicheck import MiniCheck
53
+ import os
54
+ os.environ["CUDA_VISIBLE_DEVICES"] = "0"
55
+
56
  doc = "A group of students gather in the school library to study for their upcoming final exams."
57
  claim_1 = "The students are preparing for an examination."
58
  claim_2 = "The students are on vacation."
59
 
60
+ # model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B']
61
+ scorer = MiniCheck(model_name='roberta-large', cache_dir='./ckpts')
62
  pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
63
  print(pred_label) # [1, 0]
64
  print(raw_prob) # [0.9581979513168335, 0.031335990875959396]
 
70
  import pandas as pd
71
  from datasets import load_dataset
72
  from minicheck.minicheck import MiniCheck
73
+ import os
74
+ os.environ["CUDA_VISIBLE_DEVICES"] = "0"
75
 
76
  # load 13K test data
77
  df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
78
  docs = df.doc.values
79
  claims = df.claim.values
80
 
81
+ scorer = MiniCheck(model_name='roberta-large', cache_dir='./ckpts')
82
+ pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 800 docs/min, depending on hardware
83
  ```
84
 
85
  To evalaute the result on the benchmark