Project page: https://tifa-benchmark.github.io/

This is the text parsing and question generation model for the ICCV 2023 paper TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering

We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a language model. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image.

Specifically, this fine-tuned LLaMA 2 model is the substitute for the GPT-3 model in the paper. It can parse an arbitrary prompt into visual entities, attributes, relations, etc. and generate question-answer tuples for each of them. See examples below.

QuickStart

All codes are from https://github.com/Yushi-Hu/tifa. Clone this repo to easily use this model together with other modules (e.g. VQA) provided in TIFA.

Please follow the prompt format, which will give the best performance.

import torch
import transformers

# prepare the LLaMA 2 model
model_name = "tifa-benchmark/llama2_tifa_question_generation"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)


# formating prompt following LLaMA 2 style
def create_qg_prompt(caption):
    INTRO_BLURB = "Given an image description, generate one or two multiple-choice questions that verifies if the image description is correct.\nClassify each concept into a type (object, human, animal, food, activity, attribute, counting, color, material, spatial, location, shape, other), and then generate a question for each type.\n"
    formated_prompt = f"<s>[INST] <<SYS>>\n{INTRO_BLURB}\n<</SYS>>\n\n"
    formated_prompt += f"Description: {caption} [/INST] Entities:"
    return formated_prompt


test_caption = "a blue rabbit and a red plane"

# create prompt
prompt = create_qg_prompt(text_caption)

# text completion
sequences = pipeline(
        prompt, do_sample=False, num_beams=5, num_return_sequences=1, max_length=512)
output = sequences[0]['generated_text'][len(prompt):]
output = output.split('\n\n')[0]

# output
print(output)

#### Expected output ###
#  rabbit, plane
# Activites:
# Colors: blue, red
# Counting:
# Other attributes:
# About rabbit (animal):
# Q: is this a rabbit?
# Choices: yes, no
# A: yes
# About rabbit (animal):
# Q: what animal is in the picture?
# Choices: rabbit, dog, cat, fish
# A: rabbit
# About plane (object):
# Q: is this a plane?
# Choices: yes, no
# A: yes
# About plane (object):
# Q: what type of vehicle is this?
# Choices: plane, car, motorcycle, bus
# A: plane
# About blue (color):
# Q: is the rabbit blue?
# Choices: yes, no
# A: yes
# About blue (color):
# Q: what color is the rabbit?
# Choices: blue, red, yellow, green
# A: blue
# About red (color):
# Q: is the plane red?
# Choices: yes, no
# A: yes
# About red (color):
# Q: what color is the plane?
# Choices: red, blue, yellow, green
# A: red

Use this LM under tifascore package

tifascore provides extra functions to parse this output etc. First install tifascore according to https://github.com/Yushi-Hu/tifa. Then the usage is below

from tifascore import get_llama2_pipeline, get_llama2_question_and_answers

pipeline = get_llama2_pipeline("tifa-benchmark/llama2_tifa_question_generation")

print(get_llama2_question_and_answers(pipeline, "a blue rabbit and a red plane"))

#### Expected output ###
# [{'caption': 'a blue rabbit and a red plane', 'element': 'rabbit', 'question': 'what animal is in the picture?', 'choices': ['rabbit', 'dog', 'cat', 'fish'], 'answer': 'rabbit', 'element_type': 'animal/human'}, {'caption': 'a blue rabbit and a red plane', 'element': 'plane', 'question': 'is this a plane?', 'choices': ['yes', 'no'], 'answer': 'yes', 'element_type': 'object'}, {'caption': 'a blue rabbit and a red plane', 'element': 'plane', 'question': 'what type of vehicle is this?', 'choices': ['plane', 'car', 'motorcycle', 'bus'], 'answer': 'plane', 'element_type': 'object'}, {'caption': 'a blue rabbit and a red plane', 'element': 'blue', 'question': 'is the rabbit blue?', 'choices': ['yes', 'no'], 'answer': 'yes', 'element_type': 'color'}, {'caption': 'a blue rabbit and a red plane', 'element': 'blue', 'question': 'what color is the rabbit?', 'choices': ['blue', 'red', 'yellow', 'green'], 'answer': 'blue', 'element_type': 'color'}, {'caption': 'a blue rabbit and a red plane', 'element': 'red', 'question': 'is the plane red?', 'choices': ['yes', 'no'], 'answer': 'yes', 'element_type': 'color'}, {'caption': 'a blue rabbit and a red plane', 'element': 'red', 'question': 'what color is the plane?', 'choices': ['red', 'blue', 'yellow', 'green'], 'answer': 'red', 'element_type': 'color'}]

Bibtex

@article{hu2023tifa,
  title={Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering},
  author={Hu, Yushi and Liu, Benlin and Kasai, Jungo and Wang, Yizhong and Ostendorf, Mari and Krishna, Ranjay and Smith, Noah A},
  journal={arXiv preprint arXiv:2303.11897},
  year={2023}
}