PicAnswer / README.md
niloydebbarma's picture
Update README.md
0984d13 verified
metadata
license: apache-2.0
datasets:
  - HuggingFaceM4/VQAv2
  - Phando/vqa_v2
  - lmms-lab/VQAv2-FewShot
  - pminervini/VQAv2
  - lmms-lab/VQAv2
  - nlphuji/flickr30k
  - damerajee/VQA-COCO-HI
language:
  - en
metrics:
  - accuracy
  - f1
  - precision
  - recall
pipeline_tag: visual-question-answering
library_name: transformers
tags:
  - VQA
  - Visual Question Answering
  - Image Captioning
  - AI for Accessibility
  - Transformer Models
  - Vision Transformer (ViT)
  - Image Captioning
base_model:
  - openai/clip-vit-large-patch14
  - google-bert/bert-base-uncased

PicAnswer

PicAnswer is a state-of-the-art Visual Question Answering (VQA) model that answers natural language questions about the content of images. It uses advanced deep learning techniques that combine computer vision and natural language processing to generate accurate answers based on the provided visual data.

This model is designed for a wide range of applications such as automated image captioning, enhancing accessibility, and improving interactive AI systems.

Features

  • State-of-the-art VQA: Utilizes Vision Transformers (ViT) and BERT for processing images and questions.
  • Multimodal Learning: Effectively combines visual and textual data, enabling accurate answers to questions based on images.
  • Pretrained on Multiple Datasets: Trained on popular datasets such as VQAv2, Flickr30k, and others.
  • Optimized for Accessibility: Can be used to provide answers to visually impaired users by describing images and answering related questions.
  • High Performance: Evaluated using metrics like accuracy, F1 score, precision, and recall to ensure robust performance.

Model Overview

  • Model Name: PicAnswer
  • Model Type: Visual Question Answering (VQA)
  • Base Models:
    • Vision Transformer (ViT)
    • BERT
  • Pretrained: Yes
  • License: Apache-2.0
  • Supported Languages: English
  • Metrics:
    • Accuracy
    • F1 Score
    • Precision
    • Recall
  • Pipeline Tag: Visual Question Answering
  • Library: transformers

Datasets

PicAnswer is trained on a variety of VQA datasets to ensure robust performance across different types of image-question pairs:

  • HuggingFaceM4/VQAv2
  • Phando/vqa_v2
  • lmms-lab/VQAv2-FewShot
  • pminervini/VQAv2
  • lmms-lab/VQAv2
  • nlphuji/flickr30k
  • damerajee/VQA-COCO-HI