File size: 10,177 Bytes
fdffa5f f168954 fdffa5f 9691e52 1c55642 fdffa5f 329bec2 ef40418 986db96 329bec2 ef40418 329bec2 986db96 74f2202 329bec2 ef40418 329bec2 fb56dd6 329bec2 986db96 74f2202 329bec2 ef40418 329bec2 ef40418 329bec2 ef40418 329bec2 ef40418 329bec2 6e5ca22 74f2202 329bec2 6e5ca22 74f2202 6e5ca22 329bec2 6e5ca22 90eecca 329bec2 986db96 74f2202 b380cb9 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 b380cb9 ef40418 329bec2 ef40418 74f2202 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca ef40418 5d5ccc8 ef40418 5d5ccc8 ef40418 74f2202 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca d711b20 329bec2 90eecca 74f2202 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 90eecca 329bec2 1810b5d 329bec2 0d53b1d 329bec2 1810b5d 329bec2 1810b5d 329bec2 ef40418 329bec2 ef40418 329bec2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 |
---
inference: false
license: mit
language:
- en
metrics:
- exact_match
- f1
- bertscore
pipeline_tag: text-classification
tags:
- question-answering
- evaluation
- text
datasets:
- zli12321/pedants_qa_evaluation_bench
---
# QA-Evaluation-Metrics π
[![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ke23KIeHFdPWad0BModmcWKZ6jSbF5nI?usp=sharing)
> A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.
> `pip install qa-metrics` is all you need!
## π Latest Updates
- **Version 0.2.19 Released!**
- Paper accepted to EMNLP 2024 Findings! π
- Enhanced PEDANTS with multi-pipeline support and improved edge case handling
- Added support for OpenAI GPT-series and Claude Series models (OpenAI version > 1.0)
- Integrated support for open-source models (LLaMA-2-70B-chat, LLaVA-1.5, etc.) via [deepinfra](https://deepinfra.com/models)
- Introduced trained tiny-bert for QA evaluation (18MB model size)
- Added direct Huggingface model download support for TransformerMatcher
## π Quick Start
## Table of Contents
* 1. [Normalized Exact Match](#em)
* 2. [Token F1 Score](#f1)
* 3. [PEDANTS](#pedants)
* 4. [Finetuned Neural Matching](#neural)
* 5. [Prompting LLM](#llm)
### Prerequisites
- Python >= 3.6
- openai >= 1.0
### Installation
```bash
pip install qa-metrics
```
## π‘ Features
Our package offers six QA evaluation methods with varying strengths:
| Method | Best For | Cost | Correlation with Human Judgment |
|--------|----------|------|--------------------------------|
| Normalized Exact Match | Short-form QA (NQ-OPEN, HotpotQA, etc.) | Free | Good |
| PEDANTS | Both short & medium-form QA | Free | Very High |
| [Neural Evaluation](https://huggingface.co/zli12321/answer_equivalence_tiny_bert) | Both short & long-form QA | Free | High |
| [Open Source LLM Evaluation](https://huggingface.co/zli12321/prometheus2-2B) | All QA types | Free | High |
| Black-box LLM Evaluation | All QA types | Paid | Highest |
## π Documentation
### 1. <a name='em'></a>Normalized Exact Match
#### Method: `em_match`
**Parameters**
- `reference_answer` (list of str): A list of gold (correct) answers to the question
- `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
**Returns**
- `boolean`: True if there are any exact normalized matches between gold and candidate answers
```python
from qa_metrics.em import em_match
reference_answer = ["The Frog Prince", "The Princess and the Frog"]
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
match_result = em_match(reference_answer, candidate_answer)
```
### 2. <a name='f1'></a>F1 Score
#### Method: `f1_score_with_precision_recall`
**Parameters**
- `reference_answer` (str): A gold (correct) answer to the question
- `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
**Returns**
- `dictionary`: Contains the F1 score, precision, and recall between a gold and candidate answer
#### Method: `f1_match`
**Parameters**
- `reference_answer` (list of str): List of gold answers
- `candidate_answer` (str): Candidate answer to evaluate
- `threshold` (float): F1 score threshold for considering a match (default: 0.5)
**Returns**
- `boolean`: True if F1 score exceeds threshold for any gold answer
```python
from qa_metrics.f1 import f1_match, f1_score_with_precision_recall
f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
```
### 3. <a name='pedants'></a>PEDANTS
#### Method: `get_score`
**Parameters**
- `reference_answer` (str): A Gold answer
- `candidate_answer` (str): Candidate answer to evaluate
- `question` (str): The question being evaluated
**Returns**
- `float`: The similarity score between two strings (0 to 1)
#### Method: `get_highest_score`
**Parameters**
- `reference_answer` (list of str): List of gold answers
- `candidate_answer` (str): Candidate answer to evaluate
- `question` (str): The question being evaluated
**Returns**
- `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
#### Method: `get_scores`
**Parameters**
- `reference_answer` (list of str): List of gold answers
- `candidate_answer` (str): Candidate answer to evaluate
- `question` (str): The question being evaluated
**Returns**
- `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
#### Method: `evaluate`
**Parameters**
- `reference_answer` (list of str): List of gold answers
- `candidate_answer` (str): Candidate answer to evaluate
- `question` (str): The question being evaluated
**Returns**
- `boolean`: True if candidate answer matches any gold answer
#### Method: `get_question_type`
**Parameters**
- `reference_answer` (list of str): List of gold answers
- `question` (str): The question being evaluated
**Returns**
- `list`: The type of the question (what, who, when, how, why, which, where)
#### Method: `get_judgement_type`
**Parameters**
- `reference_answer` (list of str): List of gold answers
- `candidate_answer` (str): Candidate answer to evaluate
- `question` (str): The question being evaluated
**Returns**
- `list`: A list revised rules applicable to judge answer correctness
```python
from qa_metrics.pedant import PEDANT
pedant = PEDANT()
scores = pedant.get_scores(reference_answer, candidate_answer, question)
match_result = pedant.evaluate(reference_answer, candidate_answer, question)
```
### 4. <a name='neural'></a>Transformer Neural Evaluation
#### Method: `get_score`
**Parameters**
- `reference_answer` (str): A Gold answer
- `candidate_answer` (str): Candidate answer to evaluate
- `question` (str): The question being evaluated
**Returns**
- `float`: The similarity score between two strings (0 to 1)
#### Method: `get_highest_score`
**Parameters**
- `reference_answer` (list of str): List of gold answers
- `candidate_answer` (str): Candidate answer to evaluate
- `question` (str): The question being evaluated
**Returns**
- `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
#### Method: `get_scores`
**Parameters**
- `reference_answer` (list of str): List of gold answers
- `candidate_answer` (str): Candidate answer to evaluate
- `question` (str): The question being evaluated
**Returns**
- `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
#### Method: `transformer_match`
**Parameters**
- `reference_answer` (list of str): List of gold answers
- `candidate_answer` (str): Candidate answer to evaluate
- `question` (str): The question being evaluated
**Returns**
- `boolean`: True if transformer model considers candidate answer equivalent to any gold answer
```python
from qa_metrics.transformerMatcher import TransformerMatcher
### supports `zli12321/roberta-large-qa-evaluator`, `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
match_result = tm.transformer_match(reference_answer, candidate_answer, question)
```
### 5. <a name='llm'></a>LLM Integration
#### Method: `prompt_gpt`
**Parameters**
- `prompt` (str): The input prompt text
- `model_engine` (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')
- `temperature` (float): Controls randomness (0-1)
- `max_tokens` (int): Maximum tokens in response
```python
from qa_metrics.prompt_llm import CloseLLM
model = CloseLLM()
model.set_openai_api_key(YOUR_OPENAI_KEY)
result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')
```
#### Method: `prompt_claude`
**Parameters**
- `prompt` (str): The input prompt text
- `model_engine` (str): Claude model to use
- `anthropic_version` (str): API version
- `max_tokens_to_sample` (int): Maximum tokens in response
- `temperature` (float): Controls randomness (0-1)
```python
model = CloseLLM()
model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')
```
#### Method: `prompt`
**Parameters**
- `message` (str): The input message text
- `model_engine` (str): Model to use
- `temperature` (float): Controls randomness (0-1)
- `max_tokens` (int): Maximum tokens in response
```python
from qa_metrics.prompt_open_llm import OpenLLM
model = OpenLLM()
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')
```
## π€ Model Hub
Our fine-tuned models are available on Huggingface:
- [BERT](https://huggingface.co/Zongxia/answer_equivalence_bert)
- [DistilRoBERTa](https://huggingface.co/Zongxia/answer_equivalence_distilroberta)
- [DistilBERT](https://huggingface.co/Zongxia/answer_equivalence_distilbert)
- [RoBERTa](https://huggingface.co/Zongxia/answer_equivalence_roberta)
- [Tiny-BERT](https://huggingface.co/Zongxia/answer_equivalence_tiny_bert)
- [RoBERTa-Large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large)
## π Resources
- [Full Paper](https://arxiv.org/abs/2402.11161)
- [Dataset Repository](https://github.com/zli12321/Answer_Equivalence_Dataset.git)
- [Supported Models on Deepinfra](https://deepinfra.com/models)
## π Citation
```bibtex
@misc{li2024pedantspreciseevaluationsdiverse,
title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence},
author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
year={2024},
eprint={2402.11161},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2402.11161},
}
```
## π License
This project is licensed under the [MIT License](LICENSE.md).
## π¬ Contact
For questions or comments, please contact: zli12321@umd.edu
|