Researchers from Auburn University and the University of Alberta have explored the limitations of Vision Language Models (VLMs) in their recently published paper titled "Vision language models are blind." (Vision language models are blind (2407.06581))
Key Findings:๐ VLMs, including GPT-4o, Gemini-1.5 Pro, Claude-3 Sonnet, and Claude-3.5 Sonnet, struggle with basic visual tasks. Tasks such as identifying where lines intersect or counting basic shapes are challenging for these models. The authors noted, "The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like of a person with myopia seeing fine details as blurry, and at worst, like an intelligent person that is blind making educated guesses"โ(Vision Language Models Are Blind; 2024)โ.
Human-like Myopia? ๐ VLMs may have a blind spot similar to human myopia. This limitation makes it difficult for VLMs to perceive details. Suggests a potential parallel between human and machine vision limitations.
Technical Details: ๐ง The researchers created a new benchmark called BlindTest. BlindTest consists of simple visual tasks to evaluate VLMs low-level vision capabilities. Four VLMs were assessed using BlindTest. Many shortcomings were revealed in the models ability to process basic visual information.