50 43 103

Andres Marafioti

andito

AI & ML interests

Multimodal models, VLM and TTS

Recent Activity

updated a dataset about 7 hours ago

huggingface/documentation-images

Reacted to merve's post with 🔥 about 7 hours ago

The authors of ColPali trained a retrieval model based on SmolVLM 🤠 https://huggingface.co/vidore/colsmolvlm-alpha TLDR; - ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks - ColSmolVLM is more memory efficient than ColQwen2 💗

New activity about 7 hours ago

HuggingFaceTB/SmolVLM-Instruct:ValueError: `resolution_max_side` cannot be larger than `max_image_size` with N=5

View all activity

Articles

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

Jul 25

• 18

Docmatix - a huge dataset for Document Visual Question Answering

Jul 18

• 68

Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

Jun 24

• 177

Organizations

andito's activity

Reacted to merve's post with 🔥 about 7 hours ago

Post

417

The authors of ColPali trained a retrieval model based on SmolVLM 🤠 vidore/colsmolvlm-alpha
TLDR;

- ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks

- ColSmolVLM is more memory efficient than ColQwen2 💗

posted an update about 11 hours ago

Post

372

Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! 🤯
- Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! 🚀
- SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU!
- SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!

Check out more!
Demo: HuggingFaceTB/SmolVLM
Blog: https://huggingface.co/blog/smolvlm
Model: HuggingFaceTB/SmolVLM-Instruct
Fine-tuning script: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb

Reacted to THUdyh's post with 🔥 about 1 month ago

Post

3158

🔥🔥🔥Introducing Oryx-1.5!
A series of unified MLLMs with much stronger performance on all the image, video, and 3D benchmarks 😍
🛠️Github: https://github.com/Oryx-mllm/Oryx
🚀Model: THUdyh/oryx-15-6718c60763845525c2bba71d
🎨Demo: THUdyh/Oryx
👋Try the top-tier MLLM yourself!

👀Stay tuned for more explorations on MLLMs!

Reacted to clem's post with 🚀 about 1 month ago

Post

2045

Who's going to get to the most liked model on Hugging Face first: StabilityAI, Meta, Black Forest or someone else? The race is on!

2 replies

Reacted to John6666's post with 👀 2 months ago

Post

9236

@victor @not-lain There has been a sudden and unusual outbreak of spam postings on the HF Forum that seem to be aimed at relaying online videos and commenting on them. It is also spanning multiple languages for some reason. I've flagged it too, but I'm not sure if the staff will be able to keep up with the manual measures in the future.

16 replies

Reacted to davidberenstein1957's post with 🚀 2 months ago

Post

2004

🧶 We are launching distilabel DataCraft: get started with synthetic data using clicks and natural language!

🌊 Workflow
- Write down your custom GenAI usecase
- Automatically generate system prompts
- Create sample datasets for quick iteration
- Produce full-scale datasets with customizable parameters
- Push generated datasets directly to the Hugging Face Hub

⚡️ Powered by Argilla's distilabel and open source LLMs
🆓 Uses Free Serverless HF Inference Endpoints

💡 Use Cases:
- Fine-tuning language models for specific domains
- Creating diverse datasets for robust model training
- Rapid prototyping of AI applications
- Generating synthetic data for privacy-sensitive projects

🚀 Start crafting your custom datasets today and do it quicker, easier and more private with distilabel DataCraft!
https://huggingface.co/spaces/argilla/distilabel-datacraft

1 reply

Reacted to julien-c's post with ❤️ 2 months ago

Post

4934

Hey it was good meeting you yesterday @MaziyarPanahi 🔥

thanks @mishig for setting this up

Let's make the Hub as useful as possible for the community ❤️

1 reply

Reacted to grimjim's post with 👀 2 months ago

Post

1955

I was reading through an abstract and found myself wondering how much LLM performance is being left on the table due to insufficient curation of training datasets: "Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning" by Kaur, Park, Goyal, Arora.
https://arxiv.org/abs/2408.14774
In particular, the observation that "Introducing low quality answers ("shirkers") in 20% of Instruct-SkillMix examples causes performance to plummet..." had me wondering how many ostensibly good datasets out there are in fact populated with a significant number of "shirkers".

7 replies

Reacted to KingNish's post with 🔥 3 months ago

Post

3896

I am experimenting with Flux and trying to push it to its limits without training (as I am GPU-poor 😅).
I found some flaws in the pipelines, which I resolved, and now I am able to generate an approx similar quality image as Flux Schnell 4 steps in just 1 step.
Demo Link:
KingNish/Realtime-FLUX

1 reply

posted an update 3 months ago

Post

1066

Hugging face presents FineVideo 🎥! Unlocking the next generation of Video understanding 🚀

🤯3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs.
🔥
@mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection.

Very psyched to fine-tune idefics on this dataset. ⚡️
Explore the videos: HuggingFaceFV/FineVideo-Explorer

Reacted to kz919's post with 🤗 3 months ago

Post

1683

The only 405B spaces still freely accessible are powered by SN fast api.

xianbao/SambaNova-fast

https://sambanova.ai/fast-api?api_ref=907266

Reacted to AlexBodner's post with 👀 3 months ago

Post

1269

Do you know how PCA and SVD are related?
I explained it for everyone in this post!
Go and check it out: https://x.com/alexbodner_/status/1798357519678718062?s=46

Reacted to merve's post with 🚀👍 3 months ago

Post

2369

NVIDIA just dropped NVEagle 🦅

Super impressive vision language model that comes in 7B, 13B and 13B fine-tuned on chat 💬
Model repositories: merve/nveagle-66d0705108582d73bb235c26
Try it: NVEagle/Eagle-X5-13B-Chat 💬 (works very well! 🤯)

This model essentially explores having different experts (MoE) for image encoder part of vision language model.
How? 🧐
The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder.

Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning ✨

In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well.
Rest of the architecture is quite similar to LLaVA. (see below the architecture)

Reacted to ggbetz's post with 🔥 3 months ago

Post

1150

🧭 Guided Reasoning

👋Hi everyone,

We've been releasing Guided Reasoning:

Our AI guides walk your favorite LLM through complex reasoning problems.

🎯 Goals:

1️⃣ Reliability. AIs consistently follow reasoning methods.
2️⃣ Self-explainability. AIs see reasoning protocols and can explain internal deliberation.
3️⃣ Contestability. Users may amend AI reasoning and revise plausibility assessments.

Try out Guided Reasoning with our light demo chatbot, powered by 🤗 HuggingFace's free Inference Api and small LLMs. (Sorry for poor latency and limited availability -- we are currently searching for 💸 compute sponsors to run more powerful models, faster, and optimize guided reasoning performance.)

Built on top of Logikon's open-source AI reasoning analytics.

Demo chat app: logikon/benjamin-chat
Github: https://github.com/logikon-ai/logikon
Technical report: https://arxiv.org/abs/2408.16331

➡️ Check it out and get involved! Looking forward to hearing from you.

replied to their post 3 months ago

Check out the repo here: https://github.com/huggingface/speech-to-speech

posted an update 3 months ago

Post

1612

🚀 Introducing Hugging Face's Multilingual Speech-to-Speech! 🎤
💬Our modular, cross-platform pipeline to run GPT4o-like experiences on device can now seamlessly switch languages mid-conversation with an imperceptible 100ms delay.

🌟 Building on an amazing early reception with 2600 stars on GitHub 🌟
🚀 We are expanding the library to support multiple languages
🔥 Try it out with a flag: --language fr
🤯 Or don't set the flag and let the system detect the language

💡 What feature should we add next?

1 reply

Reacted to vikhyatk's post with 🔥 3 months ago

Post

4417

Pushed a new update to vikhyatk/moondream2 today. TextVQA up from 60.2 to 65.2, DocVQA up from 61.9 to 70.5.

Space has been updated to the new model if you want to try it out! vikhyatk/moondream2

Reacted to merve's post with 🤗 4 months ago

Post

6009

Fine-tune Florence-2 on any task 🔥

Today we release a notebook and a walkthrough blog on fine-tuning Florence-2 on DocVQA dataset @andito @SkalskiP

Blog: https://huggingface.co/blog 📕
Notebook: https://colab.research.google.com/drive/1hKDrJ5AH_o7I95PtZ9__VlCTNAo1Gjpf?usp=sharing 📖
Florence-2 is a great vision-language model thanks to it's massive dataset and small size!

This model requires conditioning through task prefixes and it's not as generalist, requiring fine-tuning on a new task, such as DocVQA 📝

We have fine-tuned the model on A100 (and one can also use a smaller GPU with smaller batch size) and saw that model picks up new tasks 🥹

See below how it looks like before and after FT 🤩
Play with the demo here andito/Florence-2-DocVQA 🏄‍♀️

Reacted to HugoLaurencon's post with ❤️ 5 months ago

Post

2810

The Cauldron is a massive collection of 50 high-quality datasets, all converted to the user/assistant format, and ready to use to fine-tune any Vision Language Model.

The Cauldron covers a wide range of tasks, including general visual question answering, counting, captioning, text transcription, document understanding, chart/figure understanding, table understanding, visual reasoning, geometry, spotting differences between 2 images or converting a screenshot to a code.

HuggingFaceM4/the_cauldron

Andres Marafioti

AI & ML interests

Recent Activity

Articles

SmolVLM - small yet mighty Vision Language Model

Deploying Speech-to-Speech on Hugging Face

FineVideo: behind the scenes

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

Docmatix - a huge dataset for Document Visual Question Answering

Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

Organizations

andito's activity