BERTopic_Multimodal

This is a BERTopic model. BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.

This model was trained on 8000 images from Flickr without the captions. This demonstrates how BERTopic can be used for topic modeling using images as input only.

A few examples of generated topics:

Usage

To use this model, please install BERTopic:

pip install -U bertopic[vision]
pip install -U safetensors

You can use the model as follows:

from bertopic import BERTopic
topic_model = BERTopic.load("MaartenGr/BERTopic_Multimodal")

topic_model.get_topic_info()

You can view all information about a topic as follows:

topic_model.get_topic(topic_id, full=True)

Topic overview

Number of topics: 29
Number of training documents: 8091

Click here for an overview of all topics.

Topic ID	Topic Keywords	Topic Frequency	Label
-1	while - air - the - in - jumping	34	-1_while_air_the_in
0	bench - sitting - people - woman - street	1132	0_bench_sitting_people_woman
1	grass - running - dog - grassy - field	1693	1_grass_running_dog_grassy
2	boy - girl - little - young - holding	1290	2_boy_girl_little_young
3	dog - frisbee - running - water - mouth	1224	3_dog_frisbee_running_water
4	skateboard - ramp - doing - trick - cement	415	4_skateboard_ramp_doing_trick
5	snow - dog - covered - running - through	309	5_snow_dog_covered_running
6	mountain - range - slope - standing - person	205	6_mountain_range_slope_standing
7	pool - blue - boy - toy - water	189	7_pool_blue_boy_toy
8	trail - bike - down - riding - person	166	8_trail_bike_down_riding
9	snowboarder - mid - jump - air - after	126	9_snowboarder_mid_jump_air
10	rock - climbing - up - wall - tree	124	10_rock_climbing_up_wall
11	wave - surfboard - top - riding - of	112	11_wave_surfboard_top_riding
12	beach - surfboard - people - with - walking	102	12_beach_surfboard_people_with
13	jumping - track - horse - racquet - dog	98	13_jumping_track_horse_racquet
14	snowboard - snow - girl - hill - slope	95	14_snowboard_snow_girl_hill
15	game - being - football - played - professional	91	15_game_being_football_played
16	soccer - kicking - team - ball - player	80	16_soccer_kicking_team_ball
17	dirt - bike - person - rider - going	75	17_dirt_bike_person_rider
18	soccer - boys - field - ball - kicking	69	18_soccer_boys_field_ball
19	baseball - player - bat - swinging - into	63	19_baseball_player_bat_swinging
20	basketball - up - and - playing - jumping	59	20_basketball_up_and_playing
21	bird - body - flying - over - long	55	21_bird_body_flying_over
22	motorcycle - track - race - racer - racing	55	22_motorcycle_track_race_racer
23	boat - sitting - water - lake - hose	53	23_boat_sitting_water_lake
24	street - riding - down - bike - woman	52	24_street_riding_down_bike
25	paddle - suit - paddling - water - in	49	25_paddle_suit_paddling_water
26	pair - scissors - stage - white - shirt	42	26_pair_scissors_stage_white
27	tennis - court - racket - racquet - swinging	34	27_tennis_court_racket_racquet

Training Procedure

The data was retrieved as follows:

import os
import glob
import zipfile
import numpy as np
import pandas as pd
from tqdm import tqdm
from sentence_transformers import util

# Flickr 8k images
img_folder = 'photos/'
caps_folder = 'captions/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
    os.makedirs(img_folder, exist_ok=True)

    if not os.path.exists('Flickr8k_Dataset.zip'):   #Download dataset if does not exist
        util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip', 'Flickr8k_Dataset.zip')
        util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip', 'Flickr8k_text.zip')

    for folder, file in [(img_folder, 'Flickr8k_Dataset.zip'), (caps_folder, 'Flickr8k_text.zip')]:
        with zipfile.ZipFile(file, 'r') as zf:
            for member in tqdm(zf.infolist(), desc='Extracting'):
                zf.extract(member, folder)
images = list(glob.glob('photos/Flicker8k_Dataset/*.jpg'))

Then, to perform topic modeling on multimodal data with BERTopic:

from bertopic import BERTopic
from bertopic.backend import MultiModalBackend
from bertopic.representation import VisualRepresentation, KeyBERTInspired

# Image embedding model
embedding_model = MultiModalBackend('clip-ViT-B-32', batch_size=32)

# Image to text representation model
representation_model = {
    "Visual_Aspect": VisualRepresentation(image_to_text_model="nlpconnect/vit-gpt2-image-captioning", image_squares=True),
    "KeyBERT": KeyBERTInspired()
}

# Train our model with images only
topic_model = BERTopic(representation_model=representation_model, verbose=True, embedding_model=embedding_model, min_topic_size=30)
topics, probs = topic_model.fit_transform(documents=None, images=images)

The above demonstrates that the input were only images. These images are clustered and from those clusters a small subset of representative images are extracted. The representative images are captioned using "nlpconnect/vit-gpt2-image-captioning" to generate a small textual dataset over which we can run c-TF-IDF and the additional KeyBERTInspired representation model.

Training hyperparameters

calculate_probabilities: False
language: None
low_memory: False
min_topic_size: 30
n_gram_range: (1, 1)
nr_topics: None
seed_topic_list: None
top_n_words: 10
verbose: True

Framework versions

Numpy: 1.23.5
HDBSCAN: 0.8.29
UMAP: 0.5.3
Pandas: 1.5.3
Scikit-Learn: 1.2.2
Sentence-transformers: 2.2.2
Transformers: 4.29.2
Numba: 0.56.4
Plotly: 5.14.1
Python: 3.10.10