@merve on Hugging Face: "New open Vision Language Model by @Google: PaliGemma 💙🤍 📝 Comes in 3B…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

merve

posted an update May 14

Post

1756

New open Vision Language Model by @Google : PaliGemma 💙🤍

📝 Comes in 3B, pretrained, mix and fine-tuned models in 224, 448 and 896 resolution
🧩 Combination of Gemma 2B LLM and SigLIP image encoder
🤗 Supported in transformers

PaliGemma can do..
🧩 Image segmentation and detection! 🤯
📑 Detailed document understanding and reasoning
🙋 Visual question answering, captioning and any other VLM task!

Read our blog 🔖 hf.co/blog/paligemma
Try the demo 🪀 hf.co/spaces/google/paligemma
Check out the Spaces and the models all in the collection 📚 google/paligemma-release-6643a9ffbf57de2ae0448dda
Collection of fine-tuned PaliGemma models google/paligemma-ft-models-6643b03efb769dad650d2dda

MoonRide

May 14

Nice scores in benchmarks, but it failed at my first test image: https://huggingface.co/google/paligemma-3b-mix-448/discussions/2

It might be something wrong with demo space configuration, or... we need better benchmarks.

merve

May 14

•

edited May 14

@MoonRide it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.

Cuiunbo

May 15

•

edited May 15

Hi! nice work!
I tried this model and it is more than capable of doing what I thought it could do, it's awesome! I have some questions about some of the details I would like to ask.
Is the training data mentioned in the blog all the training data, and did paligemma have any other training data that is not mentioned?
is there any plan to open-source a chatty model?

merve

May 15

@Cuiunbo I think @giffmana et al will release a technical report in the upcoming days. for mix models and finetuned models the details should be in the model cards. for chatty model I think it's not the intention of this release.

In this post