metadata

pipeline_tag: image-to-text
tags:
  - image-captioning
languages:
  - en
license: bsd-3-clause
widget:
  - src: >-
      https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
    example_title: Savanna
  - src: >-
      https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
    example_title: Football Match
  - src: >-
      https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
    example_title: Airport
datasets:
  - unography/laion-14k-GPT4V-LIVIS-Captions
inference:
  parameters:
    max_length: 300

LongCap: Finetuned BLIP for generating long captions of images, suitable for prompts for text-to-image generation and captioning text-to-image datasets

Usage

You can use this model for conditional and un-conditional image captioning

Using the Pytorch model

Running the model on CPU

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a beach setting with a woman kneeling down and interacting with a dog. the woman is wearing a collar and is standing near the dog. the dog is positioned on the sand, and the atmosphere is calm and relaxing. there are no other people or animals in the image.

Running the model on GPU

In full precision

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a beach setting with a woman kneeling down and interacting with a dog. the woman is wearing a collar and is standing near the dog. the dog is positioned on the sand, and the atmosphere is calm and relaxing. there are no other people or animals in the image.

In half precision (`float16`)

Click to expand

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a beach setting with a woman kneeling down and interacting with a dog. the woman is wearing a collar and is standing near the dog. the dog is positioned on the sand, and the atmosphere is calm and relaxing. there are no other people or animals in the image.

LongCap: Finetuned BLIP for generating long captions of images, suitable for prompts for text-to-image generation and captioning text-to-image datasets

Usage

Using the Pytorch model

Running the model on CPU

Running the model on GPU

In full precision

In half precision (float16)

In half precision (`float16`)