Spaces:
Sleeping
Sleeping
Update my_model/tabs/model_arch.py
Browse files- my_model/tabs/model_arch.py +44 -8
my_model/tabs/model_arch.py
CHANGED
@@ -24,13 +24,49 @@ def run_model_arch() -> None:
|
|
24 |
components.html(model_arch_html, height=1400)
|
25 |
with col2:
|
26 |
st.markdown("#### Abstract")
|
27 |
-
st.
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
st.markdown("#### Design")
|
35 |
-
st.
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
components.html(model_arch_html, height=1400)
|
25 |
with col2:
|
26 |
st.markdown("#### Abstract")
|
27 |
+
st.markdown("""
|
28 |
+
<div style="text-align: justify;">
|
29 |
+
Navigating the frontier of the Visual Turing Test, this research delves into multimodal learning to bridge
|
30 |
+
the gap between visual perception and linguistic interpretation, a foundational challenge in artificial
|
31 |
+
intelligence. It scrutinizes the integration of visual cognition and external knowledge, emphasizing the
|
32 |
+
pivotal role of the Transformer model in enhancing language processing and supporting complex multimodal tasks.
|
33 |
+
This research explores the task of Knowledge-Based Visual Question Answering (KB-VQA), examining the influence
|
34 |
+
of Pre-Trained Large Language Models (PT-LLMs) and Pre-Trained Multimodal Models (PT-LMMs), which have
|
35 |
+
transformed the machine learning landscape by utilizing expansive, pre-trained knowledge repositories to tackle
|
36 |
+
complex tasks, thereby enhancing KB-VQA systems.
|
37 |
+
An examination of existing Knowledge-Based Visual Question Answering (KB-VQA) methodologies led to a refined
|
38 |
+
approach that converts visual content into the linguistic domain, creating detailed captions and object
|
39 |
+
enumerations. This process leverages the implicit knowledge and inferential capabilities of PT-LLMs. The
|
40 |
+
research refines the fine-tuning of PT-LLMs by integrating specialized tokens, enhancing the models’ ability
|
41 |
+
to interpret visual contexts. The research also reviews current image representation techniques and knowledge
|
42 |
+
sources, advocating for the utilization of implicit knowledge in PT-LLMs, especially for tasks that do not
|
43 |
+
require specialized expertise.
|
44 |
+
Rigorous ablation experiments conducted to assess the impact of various visual context elements on model
|
45 |
+
performance, with a particular focus on the importance of image descriptions generated during the captioning
|
46 |
+
phase. The study includes a comprehensive analysis of major KB-VQA datasets, specifically the OK-VQA corpus,
|
47 |
+
and critically evaluates the metrics used, incorporating semantic evaluation with GPT-4 to align the assessment
|
48 |
+
with practical application needs.
|
49 |
+
The evaluation results underscore the developed model’s competent and competitive performance. It achieves a
|
50 |
+
VQA score of 63.57% under syntactic evaluation and excels with an Exact Match (EM) score of 68.36%. Further,
|
51 |
+
semantic evaluations yield even more impressive outcomes, with VQA and EM scores of 71.09% and 72.55%,
|
52 |
+
respectively. These results demonstrate that the model effectively applies reasoning over the visual context
|
53 |
+
and successfully retrieves the necessary knowledge to answer visual questions.
|
54 |
+
</div>
|
55 |
+
""", unsafe_allow_html=True)
|
56 |
|
57 |
st.markdown("#### Design")
|
58 |
+
st.markdown("""
|
59 |
+
<div style="text-align: justify;">
|
60 |
+
As illustrated in architecture, the model operates through a sequential pipeline, beginning with the Image to
|
61 |
+
Language Transformation Module. In this module, the image undergoes simultaneous processing via image captioning
|
62 |
+
and object detection frozen models, aiming to comprehensively capture the visual context and cues. These models,
|
63 |
+
selected for their initial effectiveness, are designed to be pluggable, allowing for easy replacement with more
|
64 |
+
advanced models as new technologies develop, thus ensuring the module remains at the forefront of technological
|
65 |
+
advancement.
|
66 |
+
Following this, the Prompt Engineering Module processes the generated captions and the list of detected objects,
|
67 |
+
along with their bounding boxes and confidence levels, merging these elements with the question at hand utilizing
|
68 |
+
a meticulously crafted prompting template. The pipeline ends with a Fine-tuned Pre-Trained Large Language Model
|
69 |
+
(PT-LLMs), which is responsible for performing reasoning and deriving the required knowledge to formulate an
|
70 |
+
informed response to the question.
|
71 |
+
</div>
|
72 |
+
""", unsafe_allow_html=True)
|