--- license: bsd-3-clause tags: - image-captioning datasets: - unography/laion-81k-GPT4V-LIVIS-Captions pipeline_tag: image-to-text languages: - en widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg example_title: Savanna - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg example_title: Football Match - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg example_title: Airport inference: parameters: max_length: 250 num_beams: 3 repetition_penalty: 2.5 --- # LongCap: Finetuned [BLIP](https://huggingface.co/Salesforce/blip-image-captioning-base) for generating long captions of images, suitable for prompts for text-to-image generation and captioning text-to-image datasets ## Usage You can use this model for conditional and un-conditional image captioning ### Using the Pytorch model #### Running the model on CPU
Click to expand ```python import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained("unography/blip-long-cap") model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') inputs = processor(raw_image, return_tensors="pt") pixel_values = inputs.pixel_values out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5) print(processor.decode(out[0], skip_special_tokens=True)) >>> a woman sitting on a sandy beach, interacting with a dog wearing a blue and white checkered shirt. the background is an ocean or sea with waves crashing in the distance. there are no other animals or people visible in the image. ```
#### Running the model on GPU ##### In full precision
Click to expand ```python import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap") model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap").to("cuda") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') inputs = processor(raw_image, return_tensors="pt").to("cuda") pixel_values = inputs.pixel_values out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5) print(processor.decode(out[0], skip_special_tokens=True)) >>> a woman sitting on a sandy beach, interacting with a dog wearing a blue and white checkered shirt. the background is an ocean or sea with waves crashing in the distance. there are no other animals or people visible in the image. ```
##### In half precision (`float16`)
Click to expand ```python import torch import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap") model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap", torch_dtype=torch.float16).to("cuda") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16) pixel_values = inputs.pixel_values out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5) print(processor.decode(out[0], skip_special_tokens=True)) >>> a woman sitting on a sandy beach, interacting with a dog wearing a blue and white checkered shirt. the background is an ocean or sea with waves crashing in the distance. there are no other animals or people visible in the image. ```