pcuenq's picture
pcuenq HF staff
Add some metadata (#8)
04b747f verified
|
raw
history blame
2.43 kB
metadata
language:
  - en
  - de
  - fr
  - it
  - pt
  - hi
  - es
  - th
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - facebook
  - meta
  - pytorch
  - llama
  - llama-3

This repository is a pre-release checkpoint for Llama 3.2 11B Vision Instruct.

It contains two versions of the model, for use with transformers and with the original llama3 codebase (under the original directory).

Inference with transformers

Please, install the in-progress development wheel from https://huggingface.co/nltpt/transformers/tree/main.

This is an example inference snippet (API subject to change):

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "nltpt/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe image in two sentences"}
        ]
    }
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)

url = "https://llava-vl.github.io/static/images/view.jpg"
raw_image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=text, images=raw_image, return_tensors="pt").to(model.device)
output = model.generate(**inputs, do_sample=False, max_new_tokens=25)
print(processor.decode(output[0]))

Output:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

<|image|>Describe image in two sentences<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The image depicts a serene lake scene, featuring a long wooden dock extending into the calm water, with a dense forest of trees

Running the original checkpoints

The package installed will provide three binaries:

  1. example_chat_completion
  2. example_text_completion
  3. multimodal_example_chat_completion

You can invoke them via torchrun by doing the following:

CHECKPOINT_DIR=~/.llama/checkpoints/Llama-3.2-11B-Vision-Instruct/

torchrun `which multimodal_example_chat_completion` "$CHECKPOINT_DIR"

You can study the code for the script by doing something like:

PACKAGE_DIR=$(pip show -f llama-models | grep Location | awk '{ print $2 }')

echo "Scripts are in the directory: $PACKAGE_DIR/llama-models/scripts/"