Spaces:
Running
Running
import streamlit as st | |
import streamlit.components.v1 as components | |
def run_model_arch() -> None: | |
""" | |
Displays the model architecture and accompanying abstract and design details for the Knowledge-Based Visual Question | |
Answering (KB-VQA) model. | |
This function reads an HTML file containing the model architecture and renders it in a Streamlit application. | |
It also provides detailed descriptions of the research, abstract, and design of the KB-VQA model. | |
Returns: | |
None | |
""" | |
# Read the model architecture HTML file | |
with open("Files/Model Arch.html", 'r', encoding='utf-8') as f: | |
model_arch_html = f.read() | |
col1, col2 = st.columns(2) | |
with col1: | |
st.markdown("#### Model Architecture") | |
components.html(model_arch_html, height=1400) | |
with col2: | |
st.markdown("#### Abstract") | |
st.markdown(""" | |
<div style="text-align: justify;"> | |
Navigating the frontier of the Visual Turing Test, this research delves into multimodal learning to bridge | |
the gap between visual perception and linguistic interpretation, a foundational challenge in artificial | |
intelligence. It scrutinizes the integration of visual cognition and external knowledge, emphasizing the | |
pivotal role of the Transformer model in enhancing language processing and supporting complex multimodal tasks. | |
This research explores the task of Knowledge-Based Visual Question Answering (KB-VQA), examining the influence | |
of Pre-Trained Large Language Models (PT-LLMs) and Pre-Trained Multimodal Models (PT-LMMs), which have | |
transformed the machine learning landscape by utilizing expansive, pre-trained knowledge repositories to tackle | |
complex tasks, thereby enhancing KB-VQA systems. | |
An examination of existing Knowledge-Based Visual Question Answering (KB-VQA) methodologies led to a refined | |
approach that converts visual content into the linguistic domain, creating detailed captions and object | |
enumerations. This process leverages the implicit knowledge and inferential capabilities of PT-LLMs. The | |
research refines the fine-tuning of PT-LLMs by integrating specialized tokens, enhancing the models’ ability | |
to interpret visual contexts. The research also reviews current image representation techniques and knowledge | |
sources, advocating for the utilization of implicit knowledge in PT-LLMs, especially for tasks that do not | |
require specialized expertise. | |
Rigorous ablation experiments conducted to assess the impact of various visual context elements on model | |
performance, with a particular focus on the importance of image descriptions generated during the captioning | |
phase. The study includes a comprehensive analysis of major KB-VQA datasets, specifically the OK-VQA corpus, | |
and critically evaluates the metrics used, incorporating semantic evaluation with GPT-4 to align the assessment | |
with practical application needs. | |
The evaluation results underscore the developed model’s competent and competitive performance. It achieves a | |
VQA score of 63.57% under syntactic evaluation and excels with an Exact Match (EM) score of 68.36%. Further, | |
semantic evaluations yield even more impressive outcomes, with VQA and EM scores of 71.09% and 72.55%, | |
respectively. These results demonstrate that the model effectively applies reasoning over the visual context | |
and successfully retrieves the necessary knowledge to answer visual questions. | |
</div> | |
""", unsafe_allow_html=True) | |
st.markdown("<br>" * 2, unsafe_allow_html=True) | |
st.markdown("#### Design") | |
st.markdown(""" | |
<div style="text-align: justify;"> | |
As illustrated in architecture, the model operates through a sequential pipeline, beginning with the Image to | |
Language Transformation Module. In this module, the image undergoes simultaneous processing via image captioning | |
and object detection frozen models, aiming to comprehensively capture the visual context and cues. These models, | |
selected for their initial effectiveness, are designed to be pluggable, allowing for easy replacement with more | |
advanced models as new technologies develop, thus ensuring the module remains at the forefront of technological | |
advancement. | |
Following this, the Prompt Engineering Module processes the generated captions and the list of detected objects, | |
along with their bounding boxes and confidence levels, merging these elements with the question at hand utilizing | |
a meticulously crafted prompting template. The pipeline ends with a Fine-tuned Pre-Trained Large Language Model | |
(PT-LLMs), which is responsible for performing reasoning and deriving the required knowledge to formulate an | |
informed response to the question. | |
</div> | |
""", unsafe_allow_html=True) | |