Phi-3-V / docs /LLaVA_Bench.md
mmaaz60's picture
Adds code to host LLaVA-Phi-3 demo on HF space.
5920b49

A newer version of the Gradio SDK is available: 5.7.0

Upgrade

LLaVA-Bench [Download]

-Introduction- Large commercial multimodal chatbots have been released in this week, including

These chatbots are presumably supported by proprietary large multimodal models (LMM). Compared with the open-source LMM such as LLaVA, proprietary LMM represent the scaling success upperbound of the current SoTA techniques. They share the goal of developing multimodal chatbots that follow human intents to complete various daily-life visual tasks in the wild. While it remains less explored how to evaluate multimodal chat ability, it provides useful feedback to study open-source LMMs against the commercial multimodal chatbots. In addition to the LLaVA-Bench (COCO) dataset we used to develop the early versions of LLaVA, we are releasing LLaVA-Bench (In-the-Wild) to the community for the public use.

LLaVA-Bench (In-the-Wild [Ongoing work])

To evaluate the model's capability in more challenging tasks and generalizability to novel domains, we collect a diverse set of 24 images with 60 questions in total, including indoor and outdoor scenes, memes, paintings, sketches, etc, and associate each image with a highly-detailed and manually-curated description and a proper selection of questions. Such design also assesses the model's robustness to different prompts. In this release, we also categorize questions into three categories: conversation (simple QA), detailed description, and complex reasoning. We continue to expand and improve the diversity of the LLaVA-Bench (In-the-Wild). We manually query Bing-Chat and Bard to get the responses.

Results

The score is measured by comparing against a reference answer generated by text-only GPT-4. It is generated by feeding the question, along with the ground truth image annotations as the context. A text-only GPT-4 evaluator rates both answers. We query GPT-4 by putting the reference answer first, and then the answer generated by the candidate model. We upload images at their original resolution to Bard and Bing-Chat to obtain the results.

Approach Conversation Detail Reasoning Overall
Bard-0718 83.7 69.7 78.7 77.8
Bing-Chat-0629 59.6 52.2 90.1 71.5
LLaVA-13B-v1-336px-0719 (beam=1) 64.3 55.9 81.7 70.1
LLaVA-13B-v1-336px-0719 (beam=5) 68.4 59.9 84.3 73.5

Note that Bard sometimes refuses to answer questions about images containing humans, and Bing-Chat blurs the human faces in the images. We also provide the benchmark score for the subset without humans.

Approach Conversation Detail Reasoning Overall
Bard-0718 94.9 74.3 84.3 84.6
Bing-Chat-0629 55.8 53.6 93.5 72.6
LLaVA-13B-v1-336px-0719 (beam=1) 62.2 56.4 82.2 70.0
LLaVA-13B-v1-336px-0719 (beam=5) 65.6 61.7 85.0 73.6