unography
/

blip-long-cap

@@ -3,7 +3,7 @@ license: bsd-3-clause
 tags:
 - image-captioning
 datasets:
-- unography/laion-14k-GPT4V-LIVIS-Captions
 pipeline_tag: image-to-text
 languages:
 - en
@@ -16,10 +16,12 @@ widget:
   example_title: Airport
 inference:
   parameters:
-    max_length: 300
 ---
-# LongCap: Finetuned [BLIP](https://huggingface.co/Salesforce/blip-image-captioning-large) for generating long captions of images, suitable for prompts for text-to-image generation and captioning text-to-image datasets
 ## Usage
@@ -38,17 +40,17 @@ import requests
 from PIL import Image
 from transformers import BlipProcessor, BlipForConditionalGeneration
-processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap")
-model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap")
 img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
 raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
 inputs = processor(raw_image, return_tensors="pt")
 pixel_values = inputs.pixel_values
-out = model.generate(pixel_values=pixel_values, max_length=250)
 print(processor.decode(out[0], skip_special_tokens=True))
->>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.
 ```
 </details>
@@ -73,9 +75,9 @@ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
 inputs = processor(raw_image, return_tensors="pt").to("cuda")
 pixel_values = inputs.pixel_values
-out = model.generate(pixel_values=pixel_values, max_length=250)
 print(processor.decode(out[0], skip_special_tokens=True))
->>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.
 ```
 </details>
@@ -98,8 +100,8 @@ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
 inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
 pixel_values = inputs.pixel_values
-out = model.generate(pixel_values=pixel_values, max_length=250)
 print(processor.decode(out[0], skip_special_tokens=True))
->>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.
 ```
 </details>

 tags:
 - image-captioning
 datasets:
+- unography/laion-81k-GPT4V-LIVIS-Captions
 pipeline_tag: image-to-text
 languages:
 - en
   example_title: Airport
 inference:
   parameters:
+    max_length: 250
+    num_beams: 3
+    repetition_penalty: 2.5
 ---
+# LongCap: Finetuned [BLIP](https://huggingface.co/Salesforce/blip-image-captioning-base) for generating long captions of images, suitable for prompts for text-to-image generation and captioning text-to-image datasets
 ## Usage
 from PIL import Image
 from transformers import BlipProcessor, BlipForConditionalGeneration
+processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
+model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap")
 img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
 raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
 inputs = processor(raw_image, return_tensors="pt")
 pixel_values = inputs.pixel_values
+out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
 print(processor.decode(out[0], skip_special_tokens=True))
+>>> a woman sitting on a sandy beach, interacting with a dog wearing a blue and white checkered shirt. the background is an ocean or sea with waves crashing in the distance. there are no other animals or people visible in the image.
 ```
 </details>
 inputs = processor(raw_image, return_tensors="pt").to("cuda")
 pixel_values = inputs.pixel_values
+out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
 print(processor.decode(out[0], skip_special_tokens=True))
+>>> a woman sitting on a sandy beach, interacting with a dog wearing a blue and white checkered shirt. the background is an ocean or sea with waves crashing in the distance. there are no other animals or people visible in the image.
 ```
 </details>
 inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
 pixel_values = inputs.pixel_values
+out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
 print(processor.decode(out[0], skip_special_tokens=True))
+>>> a woman sitting on a sandy beach, interacting with a dog wearing a blue and white checkered shirt. the background is an ocean or sea with waves crashing in the distance. there are no other animals or people visible in the image.
 ```
 </details>