--- license: apache-2.0 datasets: - HuggingFaceM4/VQAv2 - Phando/vqa_v2 - lmms-lab/VQAv2-FewShot - pminervini/VQAv2 - lmms-lab/VQAv2 - nlphuji/flickr30k - damerajee/VQA-COCO-HI language: - en metrics: - accuracy - f1 - precision - recall pipeline_tag: visual-question-answering library_name: transformers tags: - VQA - Visual Question Answering - Image Captioning - AI for Accessibility - Transformer Models - Vision Transformer (ViT) - Image Captioning base_model: - openai/clip-vit-large-patch14 - google-bert/bert-base-uncased --- # PicAnswer **PicAnswer** is a state-of-the-art Visual Question Answering (VQA) model that answers natural language questions about the content of images. It uses advanced deep learning techniques that combine computer vision and natural language processing to generate accurate answers based on the provided visual data. This model is designed for a wide range of applications such as automated image captioning, enhancing accessibility, and improving interactive AI systems. ## Features - **State-of-the-art VQA**: Utilizes Vision Transformers (ViT) and BERT for processing images and questions. - **Multimodal Learning**: Effectively combines visual and textual data, enabling accurate answers to questions based on images. - **Pretrained on Multiple Datasets**: Trained on popular datasets such as VQAv2, Flickr30k, and others. - **Optimized for Accessibility**: Can be used to provide answers to visually impaired users by describing images and answering related questions. - **High Performance**: Evaluated using metrics like accuracy, F1 score, precision, and recall to ensure robust performance. ## Model Overview - **Model Name**: PicAnswer - **Model Type**: Visual Question Answering (VQA) - **Base Models**: - Vision Transformer (ViT) - BERT - **Pretrained**: Yes - **License**: Apache-2.0 - **Supported Languages**: English - **Metrics**: - Accuracy - F1 Score - Precision - Recall - **Pipeline Tag**: Visual Question Answering - **Library**: `transformers` ## Datasets PicAnswer is trained on a variety of VQA datasets to ensure robust performance across different types of image-question pairs: - HuggingFaceM4/VQAv2 - Phando/vqa_v2 - lmms-lab/VQAv2-FewShot - pminervini/VQAv2 - lmms-lab/VQAv2 - nlphuji/flickr30k - damerajee/VQA-COCO-HI