### README.md for Multi-Model Object Detection Demo

---

# Multi-Model Object Detection Demo

This repository provides a demo application that uses multiple state-of-the-art vision-language models for various tasks such as object detection, image captioning, visual question answering, and image-text matching. The demo is built using Gradio for the user interface and leverages Hugging Face's `transformers` library to load and run various pre-trained models.

## Available Models

The following models are available in the demo:

- **Qwen2-VL (7B, 2B, 5B, 1B):** Vision-language models optimized for object detection, question-answering, and image description tasks.
- **BLIP:** Specialized in image captioning and visual question-answering.
- **CLIP:** Uses contrastive learning for image-text matching.

## Usage

To use the demo:

1. **Input an Image:** Upload an image that you want to analyze.
2. **Select a Model:** Choose a model from the dropdown list to perform the desired task.
3. **Provide a System Prompt:** Optionally, enter a system prompt to guide the model's behavior.
4. **Enter a User Prompt:** Describe the object or task you want the model to perform.
5. **Submit:** Click the "Submit" button to run the model and display the results.

## Getting Started


### Example Inputs

The demo provides some pre-configured examples to try:

- **Image 1:** Detect goats in an image.
- **Image 2:** Find a blue button in the image.
- **Image 3:** Describe a person on a bike.
- **Image 4:** Solve questions from a screenshot.
- **Image 5:** Describe various images such as landscapes, animals, or objects.

## Available Functions

- `run_example`: Core function to process the input image and prompts, run the selected model, and return the results.
- `image_to_base64`: Converts an image to a base64 encoded string for model processing.
- `draw_bounding_boxes`: Draws bounding boxes around detected objects in the image.
- `rescale_bounding_boxes`: Rescales bounding boxes to the original image dimensions.