zli12321 commited on
Commit
329bec2
Β·
verified Β·
1 Parent(s): d15f109

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -143
README.md CHANGED
@@ -9,54 +9,57 @@ metrics:
9
  - bertscore
10
  pipeline_tag: text-classification
11
  ---
12
- # QA-Evaluation-Metrics
13
 
14
  [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
- [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
16
 
17
- QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic and efficient metrics to assess the performance of QA models.
18
 
19
- ### Updates
20
- - Uopdated to version 0.2.17
21
- - Supports prompting OPENAI GPT-series models and Claude Series models now. (Assuimg OPENAI version > 1.0)
22
- - Supports prompting various open source models such as LLaMA-2-70B-chat, LLaVA-1.5 etc by calling API from [deepinfra](https://deepinfra.com/models).
23
- - Added trained tiny-bert for QA evaluation. Model size is 18 MB.
24
- - Pass huggingface repository name to download model directly for TransformerMatcher
25
 
 
 
 
 
 
 
 
26
 
27
- ## Installation
28
- * Python version >= 3.6
29
- * openai version >= 1.0
30
 
 
 
 
31
 
32
- To install the package, run the following command:
33
-
34
  ```bash
35
  pip install qa-metrics
36
  ```
37
 
38
- ## Usage/Logistics
39
 
40
- The python package currently provides six QA evaluation methods.
41
- - Given a set of gold answers, a candidate answer to be evaluated, and a question (if applicable), the evaluation returns True if the candidate answer matches any one of the gold answer, False otherwise.
42
- - Different evaluation methods have distinct strictness of evaluating the correctness of a candidate answer. Some have higher correlation with human judgments than others.
43
- - Normalized Exact Match and Question/Answer type Evaluation are the most efficient method. They are suitable for short-form QA datasets such as NQ-OPEN, Hotpot QA, TriviaQA, SQuAD, etc.
44
- - Question/Answer Type Evaluation and Transformer Neural evaluations are cost free and suitable for short-form and longer-form QA datasets. They have higher correlation with human judgments than exact match and F1 score when the length of the gold and candidate answers become long.
45
- - Black-box LLM evaluations are closest to human evaluations, and they are not cost-free.
46
 
47
- ## Normalized Exact Match
48
- #### `em_match`
 
 
 
 
 
49
 
50
- Returns a boolean indicating whether there are any exact normalized matches between gold and candidate answers.
51
 
52
- **Parameters**
53
 
54
- - `reference_answer` (list of str): A list of gold (correct) answers to the question.
55
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
 
 
56
 
57
  **Returns**
58
-
59
- - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
60
 
61
  ```python
62
  from qa_metrics.em import em_match
@@ -64,202 +67,223 @@ from qa_metrics.em import em_match
64
  reference_answer = ["The Frog Prince", "The Princess and the Frog"]
65
  candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
66
  match_result = em_match(reference_answer, candidate_answer)
67
- print("Exact Match: ", match_result)
68
- '''
69
- Exact Match: False
70
- '''
71
  ```
72
 
73
- ## F1 Score
74
- #### `f1_score_with_precision_recall`
75
-
76
- Calculates F1 score, precision, and recall between a reference and a candidate answer.
77
 
 
78
  **Parameters**
79
-
80
- - `reference_answer` (str): A gold (correct) answers to the question.
81
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
82
 
83
  **Returns**
 
84
 
85
- - `dictionary`: A dictionary containing the F1 score, precision, and recall between a gold and candidate answer.
 
 
 
 
 
 
 
86
 
87
  ```python
88
- from qa_metrics.f1 import f1_match,f1_score_with_precision_recall
89
 
90
  f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
91
- print("F1 stats: ", f1_stats)
92
- '''
93
- F1 stats: {'f1': 0.25, 'precision': 0.6666666666666666, 'recall': 0.15384615384615385}
94
- '''
95
-
96
  match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
97
- print("F1 Match: ", match_result)
98
- '''
99
- F1 Match: False
100
- '''
101
  ```
102
 
103
- ## Efficient and Robust Question/Answer Type Evaluation
104
- #### 1. `get_highest_score`
105
-
106
- Returns the gold answer and candidate answer pair that has the highest matching score. This function is useful for evaluating the closest match to a given candidate response based on a list of reference answers.
107
 
 
108
  **Parameters**
109
-
110
- - `reference_answer` (list of str): A list of gold (correct) answers to the question.
111
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
112
- - `question` (str): The question for which the answers are being evaluated.
113
 
114
  **Returns**
 
115
 
116
- - `dictionary`: A dictionary containing the gold answer and candidate answer that have the highest matching score.
117
-
118
- #### 2. `get_scores`
 
 
119
 
120
- Returns all the gold answer and candidate answer pairs' matching scores.
 
121
 
 
122
  **Parameters**
123
-
124
- - `reference_answer` (list of str): A list of gold (correct) answers to the question.
125
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
126
- - `question` (str): The question for which the answers are being evaluated.
127
 
128
  **Returns**
 
129
 
130
- - `dictionary`: A dictionary containing gold answers and the candidate answer's matching score.
131
-
132
- #### 3. `evaluate`
 
 
133
 
134
- Returns True if the candidate answer is a match of any of the gold answers.
 
135
 
 
136
  **Parameters**
137
-
138
- - `reference_answer` (list of str): A list of gold (correct) answers to the question.
139
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
140
- - `question` (str): The question for which the answers are being evaluated.
141
 
142
  **Returns**
 
143
 
144
- - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
 
 
 
 
145
 
 
 
146
 
147
  ```python
148
  from qa_metrics.pedant import PEDANT
149
 
150
- question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
151
  pedant = PEDANT()
152
  scores = pedant.get_scores(reference_answer, candidate_answer, question)
153
- max_pair, highest_scores = pedant.get_highest_score(reference_answer, candidate_answer, question)
154
  match_result = pedant.evaluate(reference_answer, candidate_answer, question)
155
- print("Max Pair: %s; Highest Score: %s" % (max_pair, highest_scores))
156
- print("Score: %s; PANDA Match: %s" % (scores, match_result))
157
- '''
158
- Max Pair: ('the princess and the frog', 'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"'); Highest Score: 0.854451712151719
159
- Score: {'the frog prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7131625951317375}, 'the princess and the frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.854451712151719}}; PANDA Match: True
160
- '''
161
- ```
162
-
163
- ```python
164
- print(pedant.get_score(reference_answer[1], candidate_answer, question))
165
- '''
166
- 0.7122460127464126
167
- '''
168
  ```
169
 
170
- ## Transformer Neural Evaluation
171
- Our fine-tuned BERT model is on πŸ€— [Huggingface](https://huggingface.co/Zongxia/answer_equivalence_bert?text=The+goal+of+life+is+%5BMASK%5D.). Our Package also supports downloading and matching directly. [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), [roberta](https://huggingface.co/Zongxia/answer_equivalence_roberta), and [roberta-large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large) are also supported now! πŸ”₯πŸ”₯πŸ”₯
172
 
173
- #### `transformer_match`
 
 
 
 
174
 
175
- Returns True if the candidate answer is a match of any of the gold answers.
 
176
 
 
177
  **Parameters**
 
 
 
178
 
179
- - `reference_answer` (list of str): A list of gold (correct) answers to the question.
180
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
181
- - `question` (str): The question for which the answers are being evaluated.
 
 
 
 
 
182
 
183
  **Returns**
 
184
 
185
- - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
 
 
 
 
 
 
 
186
 
187
  ```python
188
  from qa_metrics.transformerMatcher import TransformerMatcher
189
 
190
- question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
191
- # Supported models: roberta-large, roberta, bert, distilbert, distilroberta
192
- tm = TransformerMatcher("zli12321/answer_equivalence_bert")
193
- scores = tm.get_scores(reference_answer, candidate_answer, question)
194
  match_result = tm.transformer_match(reference_answer, candidate_answer, question)
195
- print("Score: %s; bert Match: %s" % (scores, match_result))
196
- '''
197
- Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.6934309}, 'The Princess and the Frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7400551}}; TM Match: True
198
- '''
199
  ```
200
 
201
- ## Prompting LLM For Evaluation
202
 
203
- Note: The prompting function can be used for any prompting purposes.
 
 
 
 
 
204
 
205
- ###### OpenAI
206
  ```python
207
  from qa_metrics.prompt_llm import CloseLLM
 
208
  model = CloseLLM()
209
  model.set_openai_api_key(YOUR_OPENAI_KEY)
210
- prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
211
- model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
212
-
213
- '''
214
- 'correct'
215
- '''
216
  ```
217
 
218
- ###### Anthropic
 
 
 
 
 
 
 
219
  ```python
220
  model = CloseLLM()
221
- model.set_anthropic_api_key(YOUR_Anthropic_KEY)
222
- model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
223
-
224
- '''
225
- 'correct'
226
- '''
227
  ```
228
 
229
- ###### deepinfra (See below for descriptions of more models)
 
 
 
 
 
 
230
  ```python
231
  from qa_metrics.prompt_open_llm import OpenLLM
 
232
  model = OpenLLM()
233
  model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
234
- model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
235
-
236
- '''
237
- 'correct'
238
- '''
239
  ```
240
 
241
- If you find this repo avialable, please cite our paper:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242
  ```bibtex
243
- @misc{li2024panda,
244
- title={PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation},
245
  author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
246
  year={2024},
247
  eprint={2402.11161},
248
  archivePrefix={arXiv},
249
- primaryClass={cs.CL}
 
250
  }
251
  ```
252
 
 
253
 
254
- ## Updates
255
- - [01/24/24] πŸ”₯ The full paper is uploaded and can be accessed [here](https://arxiv.org/abs/2402.11161). The dataset is expanded and leaderboard is updated.
256
- - Our Training Dataset is adapted and augmented from [Bulian et al](https://github.com/google-research-datasets/answer-equivalence-dataset). Our [dataset repo](https://github.com/zli12321/Answer_Equivalence_Dataset.git) includes the augmented training set and QA evaluation testing sets discussed in our paper.
257
- - Now our model supports [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), a smaller and more robust matching model than Bert!
258
-
259
- ## License
260
-
261
- This project is licensed under the [MIT License](LICENSE.md) - see the LICENSE file for details.
262
 
263
- ## Contact
264
 
265
- For any additional questions or comments, please contact [zli12321@umd.edu].
 
9
  - bertscore
10
  pipeline_tag: text-classification
11
  ---
12
+ # QA-Evaluation-Metrics πŸ“Š
13
 
14
  [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
+ [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ke23KIeHFdPWad0BModmcWKZ6jSbF5nI?usp=sharing)
16
 
17
+ > A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.
18
 
19
+ ## πŸŽ‰ Latest Updates
 
 
 
 
 
20
 
21
+ - **Version 0.2.19 Released!**
22
+ - Paper accepted to EMNLP 2024 Findings! πŸŽ“
23
+ - Enhanced PEDANTS with multi-pipeline support and improved edge case handling
24
+ - Added support for OpenAI GPT-series and Claude Series models (OpenAI version > 1.0)
25
+ - Integrated support for open-source models (LLaMA-2-70B-chat, LLaVA-1.5, etc.) via [deepinfra](https://deepinfra.com/models)
26
+ - Introduced trained tiny-bert for QA evaluation (18MB model size)
27
+ - Added direct Huggingface model download support for TransformerMatcher
28
 
29
+ ## πŸš€ Quick Start
 
 
30
 
31
+ ### Prerequisites
32
+ - Python >= 3.6
33
+ - openai >= 1.0
34
 
35
+ ### Installation
 
36
  ```bash
37
  pip install qa-metrics
38
  ```
39
 
40
+ ## πŸ’‘ Features
41
 
42
+ Our package offers six QA evaluation methods with varying strengths:
 
 
 
 
 
43
 
44
+ | Method | Best For | Cost | Correlation with Human Judgment |
45
+ |--------|----------|------|--------------------------------|
46
+ | Normalized Exact Match | Short-form QA (NQ-OPEN, HotpotQA, etc.) | Free | Good |
47
+ | PEDANTS | Both short & medium-form QA | Free | Very High |
48
+ | [Neural Evaluation](https://huggingface.co/zli12321/answer_equivalence_tiny_bert) | Both short & long-form QA | Free | High |
49
+ | [Open Source LLM Evaluation](https://huggingface.co/zli12321/prometheus2-2B) | All QA types | Free | High |
50
+ | Black-box LLM Evaluation | All QA types | Paid | Highest |
51
 
52
+ ## πŸ“– Documentation
53
 
54
+ ### 1. Normalized Exact Match
55
 
56
+ #### Method: `em_match`
57
+ **Parameters**
58
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question
59
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
60
 
61
  **Returns**
62
+ - `boolean`: True if there are any exact normalized matches between gold and candidate answers
 
63
 
64
  ```python
65
  from qa_metrics.em import em_match
 
67
  reference_answer = ["The Frog Prince", "The Princess and the Frog"]
68
  candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
69
  match_result = em_match(reference_answer, candidate_answer)
 
 
 
 
70
  ```
71
 
72
+ ### 2. F1 Score
 
 
 
73
 
74
+ #### Method: `f1_score_with_precision_recall`
75
  **Parameters**
76
+ - `reference_answer` (str): A gold (correct) answer to the question
77
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
 
78
 
79
  **Returns**
80
+ - `dictionary`: Contains the F1 score, precision, and recall between a gold and candidate answer
81
 
82
+ #### Method: `f1_match`
83
+ **Parameters**
84
+ - `reference_answer` (list of str): List of gold answers
85
+ - `candidate_answer` (str): Candidate answer to evaluate
86
+ - `threshold` (float): F1 score threshold for considering a match (default: 0.5)
87
+
88
+ **Returns**
89
+ - `boolean`: True if F1 score exceeds threshold for any gold answer
90
 
91
  ```python
92
+ from qa_metrics.f1 import f1_match, f1_score_with_precision_recall
93
 
94
  f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
 
 
 
 
 
95
  match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
 
 
 
 
96
  ```
97
 
98
+ ### 3. PEDANTS
 
 
 
99
 
100
+ #### Method: `get_score`
101
  **Parameters**
102
+ - `reference_answer` (str): A Gold answer
103
+ - `candidate_answer` (str): Candidate answer to evaluate
104
+ - `question` (str): The question being evaluated
 
105
 
106
  **Returns**
107
+ - `float`: The similarity score between two strings (0 to 1)
108
 
109
+ #### Method: `get_highest_score`
110
+ **Parameters**
111
+ - `reference_answer` (list of str): List of gold answers
112
+ - `candidate_answer` (str): Candidate answer to evaluate
113
+ - `question` (str): The question being evaluated
114
 
115
+ **Returns**
116
+ - `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
117
 
118
+ #### Method: `get_scores`
119
  **Parameters**
120
+ - `reference_answer` (list of str): List of gold answers
121
+ - `candidate_answer` (str): Candidate answer to evaluate
122
+ - `question` (str): The question being evaluated
 
123
 
124
  **Returns**
125
+ - `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
126
 
127
+ #### Method: `evaluate`
128
+ **Parameters**
129
+ - `reference_answer` (list of str): List of gold answers
130
+ - `candidate_answer` (str): Candidate answer to evaluate
131
+ - `question` (str): The question being evaluated
132
 
133
+ **Returns**
134
+ - `boolean`: True if candidate answer matches any gold answer
135
 
136
+ #### Method: `get_question_type`
137
  **Parameters**
138
+ - `reference_answer` (list of str): List of gold answers
139
+ - `question` (str): The question being evaluated
 
 
140
 
141
  **Returns**
142
+ - `list`: The type of the question (what, who, when, how, why, which, where)
143
 
144
+ #### Method: `get_judgement_type`
145
+ **Parameters**
146
+ - `reference_answer` (list of str): List of gold answers
147
+ - `candidate_answer` (str): Candidate answer to evaluate
148
+ - `question` (str): The question being evaluated
149
 
150
+ **Returns**
151
+ - `list`: A list revised rules applicable to judge answer correctness
152
 
153
  ```python
154
  from qa_metrics.pedant import PEDANT
155
 
 
156
  pedant = PEDANT()
157
  scores = pedant.get_scores(reference_answer, candidate_answer, question)
 
158
  match_result = pedant.evaluate(reference_answer, candidate_answer, question)
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  ```
160
 
161
+ ### 4. Transformer Neural Evaluation
 
162
 
163
+ #### Method: `get_score`
164
+ **Parameters**
165
+ - `reference_answer` (str): A Gold answer
166
+ - `candidate_answer` (str): Candidate answer to evaluate
167
+ - `question` (str): The question being evaluated
168
 
169
+ **Returns**
170
+ - `float`: The similarity score between two strings (0 to 1)
171
 
172
+ #### Method: `get_highest_score`
173
  **Parameters**
174
+ - `reference_answer` (list of str): List of gold answers
175
+ - `candidate_answer` (str): Candidate answer to evaluate
176
+ - `question` (str): The question being evaluated
177
 
178
+ **Returns**
179
+ - `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
180
+
181
+ #### Method: `get_scores`
182
+ **Parameters**
183
+ - `reference_answer` (list of str): List of gold answers
184
+ - `candidate_answer` (str): Candidate answer to evaluate
185
+ - `question` (str): The question being evaluated
186
 
187
  **Returns**
188
+ - `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
189
 
190
+ #### Method: `transformer_match`
191
+ **Parameters**
192
+ - `reference_answer` (list of str): List of gold answers
193
+ - `candidate_answer` (str): Candidate answer to evaluate
194
+ - `question` (str): The question being evaluated
195
+
196
+ **Returns**
197
+ - `boolean`: True if transformer model considers candidate answer equivalent to any gold answer
198
 
199
  ```python
200
  from qa_metrics.transformerMatcher import TransformerMatcher
201
 
202
+ ### supports `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
203
+ tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
 
 
204
  match_result = tm.transformer_match(reference_answer, candidate_answer, question)
 
 
 
 
205
  ```
206
 
207
+ ### 5. LLM Integration
208
 
209
+ #### Method: `prompt_gpt`
210
+ **Parameters**
211
+ - `prompt` (str): The input prompt text
212
+ - `model_engine` (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')
213
+ - `temperature` (float): Controls randomness (0-1)
214
+ - `max_tokens` (int): Maximum tokens in response
215
 
 
216
  ```python
217
  from qa_metrics.prompt_llm import CloseLLM
218
+
219
  model = CloseLLM()
220
  model.set_openai_api_key(YOUR_OPENAI_KEY)
221
+ result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')
 
 
 
 
 
222
  ```
223
 
224
+ #### Method: `prompt_claude`
225
+ **Parameters**
226
+ - `prompt` (str): The input prompt text
227
+ - `model_engine` (str): Claude model to use
228
+ - `anthropic_version` (str): API version
229
+ - `max_tokens_to_sample` (int): Maximum tokens in response
230
+ - `temperature` (float): Controls randomness (0-1)
231
+
232
  ```python
233
  model = CloseLLM()
234
+ model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
235
+ result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')
 
 
 
 
236
  ```
237
 
238
+ #### Method: `prompt`
239
+ **Parameters**
240
+ - `message` (str): The input message text
241
+ - `model_engine` (str): Model to use
242
+ - `temperature` (float): Controls randomness (0-1)
243
+ - `max_tokens` (int): Maximum tokens in response
244
+
245
  ```python
246
  from qa_metrics.prompt_open_llm import OpenLLM
247
+
248
  model = OpenLLM()
249
  model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
250
+ result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')
 
 
 
 
251
  ```
252
 
253
+ ## πŸ€— Model Hub
254
+
255
+ Our fine-tuned models are available on Huggingface:
256
+ - [BERT](https://huggingface.co/Zongxia/answer_equivalence_bert)
257
+ - [DistilRoBERTa](https://huggingface.co/Zongxia/answer_equivalence_distilroberta)
258
+ - [DistilBERT](https://huggingface.co/Zongxia/answer_equivalence_distilbert)
259
+ - [RoBERTa](https://huggingface.co/Zongxia/answer_equivalence_roberta)
260
+ - [Tiny-BERT](https://huggingface.co/Zongxia/answer_equivalence_tiny_bert)
261
+ - [RoBERTa-Large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large)
262
+
263
+ ## πŸ“š Resources
264
+
265
+ - [Full Paper](https://arxiv.org/abs/2402.11161)
266
+ - [Dataset Repository](https://github.com/zli12321/Answer_Equivalence_Dataset.git)
267
+ - [Supported Models on Deepinfra](https://deepinfra.com/models)
268
+
269
+ ## πŸ“„ Citation
270
+
271
  ```bibtex
272
+ @misc{li2024pedantspreciseevaluationsdiverse,
273
+ title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence},
274
  author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
275
  year={2024},
276
  eprint={2402.11161},
277
  archivePrefix={arXiv},
278
+ primaryClass={cs.CL},
279
+ url={https://arxiv.org/abs/2402.11161},
280
  }
281
  ```
282
 
283
+ ## πŸ“ License
284
 
285
+ This project is licensed under the [MIT License](LICENSE.md).
 
 
 
 
 
 
 
286
 
287
+ ## πŸ“¬ Contact
288
 
289
+ For questions or comments, please contact: zli12321@umd.edu