### README.md for Multi-Model Object Detection Demo --- # Multi-Model Object Detection Demo This repository provides a demo application that uses multiple state-of-the-art vision-language models for various tasks such as object detection, image captioning, visual question answering, and image-text matching. The demo is built using Gradio for the user interface and leverages Hugging Face's `transformers` library to load and run various pre-trained models. ## Available Models The following models are available in the demo: - **Qwen2-VL (7B, 2B, 5B, 1B):** Vision-language models optimized for object detection, question-answering, and image description tasks. - **BLIP:** Specialized in image captioning and visual question-answering. - **CLIP:** Uses contrastive learning for image-text matching. ## Usage To use the demo: 1. **Input an Image:** Upload an image that you want to analyze. 2. **Select a Model:** Choose a model from the dropdown list to perform the desired task. 3. **Provide a System Prompt:** Optionally, enter a system prompt to guide the model's behavior. 4. **Enter a User Prompt:** Describe the object or task you want the model to perform. 5. **Submit:** Click the "Submit" button to run the model and display the results. ## Getting Started ### Example Inputs The demo provides some pre-configured examples to try: - **Image 1:** Detect goats in an image. - **Image 2:** Find a blue button in the image. - **Image 3:** Describe a person on a bike. - **Image 4:** Solve questions from a screenshot. - **Image 5:** Describe various images such as landscapes, animals, or objects. ## Available Functions - `run_example`: Core function to process the input image and prompts, run the selected model, and return the results. - `image_to_base64`: Converts an image to a base64 encoded string for model processing. - `draw_bounding_boxes`: Draws bounding boxes around detected objects in the image. - `rescale_bounding_boxes`: Rescales bounding boxes to the original image dimensions.