thisisiron/lavis-blip2-qformer · Seeking resources to perform multimodal semantic search

Hello,

I’m Myles, an applied scientist looking to build on LAVIS BLIP-2 in order to create multimodal embeddings on objects that contain images and structured text input derived from training videos in the wheelchair and mobility device custom seating sector. BLIP-2 stands out due to its higher input context window as compared to OpenClip or ViLT.

I’m considering BLIP-2 due to its cross-attention focus as I need to perform multimodal semantic searches across the latent semantic embedding space.

I’m considering use of PineCone to actuate this plan.

I lack the experience and resources to successfully perform fine tuning at this stage so am looking to the community for building blocks that will allow me to move forward and create value.

Discussion welcome /!encouraged.

This journey is chronicled in:

https://ai.stackexchange.com/questions/40753/how-to-generate-original-training-videos-based-on-existing-videoset