Andres Marafioti

andito

AI & ML interests

Multimodal models, VLM and TTS

Recent Activity

Articles

Organizations

andito's activity

Reacted to merve's post with πŸ”₯ about 7 hours ago
view post
Post
417
The authors of ColPali trained a retrieval model based on SmolVLM 🀠 vidore/colsmolvlm-alpha
TLDR;

- ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks

- ColSmolVLM is more memory efficient than ColQwen2 πŸ’—
posted an update about 11 hours ago
view post
Post
372
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! 🀯
- Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! πŸš€
- SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU!
- SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!

Check out more!
Demo: HuggingFaceTB/SmolVLM
Blog: https://huggingface.co/blog/smolvlm
Model: HuggingFaceTB/SmolVLM-Instruct
Fine-tuning script: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Reacted to THUdyh's post with πŸ”₯ about 1 month ago
Reacted to clem's post with πŸš€ about 1 month ago
view post
Post
2045
Who's going to get to the most liked model on Hugging Face first: StabilityAI, Meta, Black Forest or someone else? The race is on!
  • 2 replies
Β·
Reacted to John6666's post with πŸ‘€ 2 months ago
view post
Post
9236
@victor @not-lain There has been a sudden and unusual outbreak of spam postings on the HF Forum that seem to be aimed at relaying online videos and commenting on them. It is also spanning multiple languages for some reason. I've flagged it too, but I'm not sure if the staff will be able to keep up with the manual measures in the future.
Β·
Reacted to davidberenstein1957's post with πŸš€ 2 months ago
view post
Post
2004
🧢 We are launching distilabel DataCraft: get started with synthetic data using clicks and natural language!

🌊 Workflow
- Write down your custom GenAI usecase
- Automatically generate system prompts
- Create sample datasets for quick iteration
- Produce full-scale datasets with customizable parameters
- Push generated datasets directly to the Hugging Face Hub

⚑️ Powered by Argilla's distilabel and open source LLMs
πŸ†“ Uses Free Serverless HF Inference Endpoints

πŸ’‘ Use Cases:
- Fine-tuning language models for specific domains
- Creating diverse datasets for robust model training
- Rapid prototyping of AI applications
- Generating synthetic data for privacy-sensitive projects

πŸš€ Start crafting your custom datasets today and do it quicker, easier and more private with distilabel DataCraft!
https://huggingface.co/spaces/argilla/distilabel-datacraft
  • 1 reply
Β·
Reacted to julien-c's post with ❀️ 2 months ago
view post
Post
4934
Hey it was good meeting you yesterday @MaziyarPanahi πŸ”₯

thanks @mishig for setting this up

Let's make the Hub as useful as possible for the community ❀️
  • 1 reply
Β·
Reacted to grimjim's post with πŸ‘€ 2 months ago
view post
Post
1955
I was reading through an abstract and found myself wondering how much LLM performance is being left on the table due to insufficient curation of training datasets: "Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning" by Kaur, Park, Goyal, Arora.
https://arxiv.org/abs/2408.14774
In particular, the observation that "Introducing low quality answers ("shirkers") in 20% of Instruct-SkillMix examples causes performance to plummet..." had me wondering how many ostensibly good datasets out there are in fact populated with a significant number of "shirkers".
Β·
Reacted to KingNish's post with πŸ”₯ 3 months ago
view post
Post
3896
I am experimenting with Flux and trying to push it to its limits without training (as I am GPU-poor πŸ˜…).
I found some flaws in the pipelines, which I resolved, and now I am able to generate an approx similar quality image as Flux Schnell 4 steps in just 1 step.
Demo Link:
KingNish/Realtime-FLUX

  • 1 reply
Β·
posted an update 3 months ago
view post
Post
1066
Hugging face presents FineVideo πŸŽ₯! Unlocking the next generation of Video understanding πŸš€

🀯3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs.
πŸ”₯
@mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection.

Very psyched to fine-tune idefics on this dataset. ⚑️
Explore the videos: HuggingFaceFV/FineVideo-Explorer
Reacted to kz919's post with πŸ€— 3 months ago
Reacted to AlexBodner's post with πŸ‘€ 3 months ago
Reacted to merve's post with πŸš€πŸ‘ 3 months ago
view post
Post
2369
NVIDIA just dropped NVEagle πŸ¦…

Super impressive vision language model that comes in 7B, 13B and 13B fine-tuned on chat πŸ’¬
Model repositories: merve/nveagle-66d0705108582d73bb235c26
Try it: NVEagle/Eagle-X5-13B-Chat πŸ’¬ (works very well! 🀯)

This model essentially explores having different experts (MoE) for image encoder part of vision language model.
How? 🧐
The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder.

Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning ✨

In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well.
Rest of the architecture is quite similar to LLaVA. (see below the architecture)
Reacted to ggbetz's post with πŸ”₯ 3 months ago
view post
Post
1150
🧭 Guided Reasoning

πŸ‘‹Hi everyone,

We've been releasing Guided Reasoning:

Our AI guides walk your favorite LLM through complex reasoning problems.

🎯 Goals:

1️⃣ Reliability. AIs consistently follow reasoning methods.
2️⃣ Self-explainability. AIs see reasoning protocols and can explain internal deliberation.
3️⃣ Contestability. Users may amend AI reasoning and revise plausibility assessments.

Try out Guided Reasoning with our light demo chatbot, powered by πŸ€— HuggingFace's free Inference Api and small LLMs. (Sorry for poor latency and limited availability -- we are currently searching for πŸ’Έ compute sponsors to run more powerful models, faster, and optimize guided reasoning performance.)

Built on top of Logikon's open-source AI reasoning analytics.

Demo chat app: logikon/benjamin-chat
Github: https://github.com/logikon-ai/logikon
Technical report: https://arxiv.org/abs/2408.16331

➑️ Check it out and get involved! Looking forward to hearing from you.
replied to their post 3 months ago
posted an update 3 months ago
view post
Post
1612
πŸš€ Introducing Hugging Face's Multilingual Speech-to-Speech! 🎀
πŸ’¬Our modular, cross-platform pipeline to run GPT4o-like experiences on device can now seamlessly switch languages mid-conversation with an imperceptible 100ms delay.

🌟 Building on an amazing early reception with 2600 stars on GitHub 🌟
πŸš€ We are expanding the library to support multiple languages
πŸ”₯ Try it out with a flag: --language fr
🀯 Or don't set the flag and let the system detect the language

πŸ’‘ What feature should we add next?
  • 1 reply
Β·
Reacted to vikhyatk's post with πŸ”₯ 3 months ago
view post
Post
4417
Pushed a new update to vikhyatk/moondream2 today. TextVQA up from 60.2 to 65.2, DocVQA up from 61.9 to 70.5.

Space has been updated to the new model if you want to try it out! vikhyatk/moondream2
Reacted to merve's post with πŸ€— 4 months ago
view post
Post
6009
Fine-tune Florence-2 on any task πŸ”₯

Today we release a notebook and a walkthrough blog on fine-tuning Florence-2 on DocVQA dataset @andito @SkalskiP

Blog: https://huggingface.co/blog πŸ“•
Notebook: https://colab.research.google.com/drive/1hKDrJ5AH_o7I95PtZ9__VlCTNAo1Gjpf?usp=sharing πŸ“–
Florence-2 is a great vision-language model thanks to it's massive dataset and small size!

This model requires conditioning through task prefixes and it's not as generalist, requiring fine-tuning on a new task, such as DocVQA πŸ“

We have fine-tuned the model on A100 (and one can also use a smaller GPU with smaller batch size) and saw that model picks up new tasks πŸ₯Ή

See below how it looks like before and after FT 🀩
Play with the demo here andito/Florence-2-DocVQA πŸ„β€β™€οΈ
Reacted to HugoLaurencon's post with ❀️ 5 months ago
view post
Post
2810
The Cauldron is a massive collection of 50 high-quality datasets, all converted to the user/assistant format, and ready to use to fine-tune any Vision Language Model.

The Cauldron covers a wide range of tasks, including general visual question answering, counting, captioning, text transcription, document understanding, chart/figure understanding, table understanding, visual reasoning, geometry, spotting differences between 2 images or converting a screenshot to a code.

HuggingFaceM4/the_cauldron