Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.
- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! π€― - Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! π - SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU! - SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!
@victor@not-lain There has been a sudden and unusual outbreak of spam postings on the HF Forum that seem to be aimed at relaying online videos and commenting on them. It is also spanning multiple languages for some reason. I've flagged it too, but I'm not sure if the staff will be able to keep up with the manual measures in the future.
π§Ά We are launching distilabel DataCraft: get started with synthetic data using clicks and natural language!
π Workflow - Write down your custom GenAI usecase - Automatically generate system prompts - Create sample datasets for quick iteration - Produce full-scale datasets with customizable parameters - Push generated datasets directly to the Hugging Face Hub
β‘οΈ Powered by Argilla's distilabel and open source LLMs π Uses Free Serverless HF Inference Endpoints
π‘ Use Cases: - Fine-tuning language models for specific domains - Creating diverse datasets for robust model training - Rapid prototyping of AI applications - Generating synthetic data for privacy-sensitive projects
I was reading through an abstract and found myself wondering how much LLM performance is being left on the table due to insufficient curation of training datasets: "Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning" by Kaur, Park, Goyal, Arora. https://arxiv.org/abs/2408.14774 In particular, the observation that "Introducing low quality answers ("shirkers") in 20% of Instruct-SkillMix examples causes performance to plummet..." had me wondering how many ostensibly good datasets out there are in fact populated with a significant number of "shirkers".
I am experimenting with Flux and trying to push it to its limits without training (as I am GPU-poor π ). I found some flaws in the pipelines, which I resolved, and now I am able to generate an approx similar quality image as Flux Schnell 4 steps in just 1 step. Demo Link: KingNish/Realtime-FLUX
Hugging face presents FineVideo π₯! Unlocking the next generation of Video understanding π
π€―3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs. π₯ @mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection.
This model essentially explores having different experts (MoE) for image encoder part of vision language model. How? π§ The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder.
Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning β¨
In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well. Rest of the architecture is quite similar to LLaVA. (see below the architecture)
Our AI guides walk your favorite LLM through complex reasoning problems.
π― Goals:
1οΈβ£ Reliability. AIs consistently follow reasoning methods. 2οΈβ£ Self-explainability. AIs see reasoning protocols and can explain internal deliberation. 3οΈβ£ Contestability. Users may amend AI reasoning and revise plausibility assessments.
Try out Guided Reasoning with our light demo chatbot, powered by π€ HuggingFace's free Inference Api and small LLMs. (Sorry for poor latency and limited availability -- we are currently searching for πΈ compute sponsors to run more powerful models, faster, and optimize guided reasoning performance.)
Built on top of Logikon's open-source AI reasoning analytics.
π Introducing Hugging Face's Multilingual Speech-to-Speech! π€ π¬Our modular, cross-platform pipeline to run GPT4o-like experiences on device can now seamlessly switch languages mid-conversation with an imperceptible 100ms delay.
π Building on an amazing early reception with 2600 stars on GitHub π π We are expanding the library to support multiple languages π₯ Try it out with a flag: --language fr π€― Or don't set the flag and let the system detect the language
The Cauldron is a massive collection of 50 high-quality datasets, all converted to the user/assistant format, and ready to use to fine-tune any Vision Language Model.
The Cauldron covers a wide range of tasks, including general visual question answering, counting, captioning, text transcription, document understanding, chart/figure understanding, table understanding, visual reasoning, geometry, spotting differences between 2 images or converting a screenshot to a code.