ViLaH
ViLaH (Vision Language Hindi) is a model with 3 billion parameters, fine-tuned from the base-model google/paligemma-3b-pt-224 to handle input images and bilingual (Hindi and English) text sequences for both input and output.
Training Details
- Model Configuration: Fine-tuned on a single epoch using a V100 gpu.
- Training Duration: Approximately one day.
- Evaluation Loss: Achieved an eval loss of 1.6384 at the end of the epoch.
- The model is still being train as of right now with better quality dataset
- The model's performance may be compromised due to insufficient data and the fact that it was trained for only one epoch.
Dataset
The model was finetuned on only one dataset
- damerajee/clean_hin_vqa : This dataset was derived from Lin-Chen/ShareGPT4V and filtered to include only images from the COCO dataset. The original dataset was translated and cleaned to ensure high-quality Hindi visual question answering content.
How to Use
!pip install peft trl datasets accelerate bitsandbytes
!pip install transformers --upgrade
To Run the model on a single T4 GPU on Float16
from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training
from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer
from peft import PeftModel, PeftConfig
from datasets import load_dataset
import torch
from datasets import load_dataset
dataset = load_dataset("damerajee/clean_hin_vqa",split='train')
test_example = dataset[10000]
test_image = test_example["image"]
text = test_example['question']
device_index = torch.cuda.current_device()
print("device_index:",device_index)
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},torch_dtype=torch.float16,low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH")
inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda")
for k,v in inputs.items():
print(k,v.shape)
MAX_LENGTH = 200
# Autoregressively generate
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True)
# Next we turn each predicted token ID back into a string using the decode method
# We chop of the prompt, which consists of image tokens and our text prompt
image_token_index = base_model.config.image_token_index
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
num_text_tokens = len(processor.tokenizer.encode(text))
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
generated_text
To Run the model on a single T4 GPU in 4Bits
from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training
from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer
from peft import PeftModel, PeftConfig
from datasets import load_dataset
import torch
from datasets import load_dataset
dataset = load_dataset("damerajee/clean_hin_vqa",split='train')
test_example = dataset[10000]
test_image = test_example["image"]
text = test_example['question']
device_index = torch.cuda.current_device()
print("device_index:",device_index)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},quantization_config=quantization_config,torch_dtype=torch.float16,low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH")
inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda")
for k,v in inputs.items():
print(k,v.shape)
MAX_LENGTH = 200
# Autoregressively generate
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True)
# Next we turn each predicted token ID back into a string using the decode method
# We chop of the prompt, which consists of image tokens and our text prompt
image_token_index = base_model.config.image_token_index
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
num_text_tokens = len(processor.tokenizer.encode(text))
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
generated_text
- Downloads last month
- 3
Inference API (serverless) has been turned off for this model.