685 316 789

Daniel van Strien PRO

davanstrien

https://danielvanstrien.xyz/

AI & ML interests

Machine Learning Librarian

Recent Activity

Reacted to andito's post with 🔥 about 8 hours ago

Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs. - SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! 🤯 - Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! 🚀 - SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU! - SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos! Check out more! Demo: https://huggingface.co/spaces/HuggingFaceTB/SmolVLM Blog: https://huggingface.co/blog/smolvlm Model: https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct Fine-tuning script: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb

liked a dataset about 8 hours ago

IGNF/PureForest

Reacted to nataliaElv's post with 👀 about 10 hours ago

Would you like to get a high-quality dataset to pre-train LLMs in your language? 🌏 At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative. Follow the link below, check if your language is listed and sign up to be a Language Lead! https://forms.gle/s9nGajBh6Pb9G72J6

View all activity

Articles

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Jun 20

• 12

Data Is Better Together: A Look Back and Forward

Jun 20

• 19

Synthetic dataset generation techniques: generating custom sentence similarity data

May 23

• 15

Synthetic dataset generation techniques: Self-Instruct

May 15

• 13

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

May 7

• 7

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 67

Extracting Insights from Model Cards Using Open Large Language Models

Nov 27, 2023

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 27

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

Aug 2, 2023

• 1

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

Introducing BERTopic Integration with Hugging Face Hub

May 31, 2023

• 7

Organizations

davanstrien's activity

Reacted to andito's post with 🔥 about 8 hours ago

Post

372

Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! 🤯
- Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! 🚀
- SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU!
- SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!

Check out more!
Demo: HuggingFaceTB/SmolVLM
Blog: https://huggingface.co/blog/smolvlm
Model: HuggingFaceTB/SmolVLM-Instruct
Fine-tuning script: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb

Reacted to nataliaElv's post with 👀 about 10 hours ago

Post

473

Would you like to get a high-quality dataset to pre-train LLMs in your language? 🌏

At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.

Follow the link below, check if your language is listed and sign up to be a Language Lead!

https://forms.gle/s9nGajBh6Pb9G72J6

Reacted to their post with ❤️ 1 day ago

Post

1356

First dataset for the new Hugging Face Bluesky community organisation: bluesky-community/one-million-bluesky-posts 🦋

📊 1M public posts from Bluesky's firehose API
🔍 Includes text, metadata, and language predictions
🔬 Perfect to experiment with using ML for Bluesky 🤗

Excited to see people build more open tools for a more open social media platform!

posted an update 1 day ago

Post

1356

Reacted to KnutJaegersberg's post with ❤️🔥 1 day ago

Post

779

openGPT-X/Teuken-7B-instruct-research-v0.4

New European LLM

openGPT-X/Teuken-7B-instruct-research-v0.4

Reacted to davidberenstein1957's post with 🚀🔥 1 day ago

Post

1156

Let’s make a generation of amazing image-generation models

The best image generation models are trained on human preference datasets, where annotators have selected the best image from a choice of two. Unfortunately, many of these datasets are closed source so the community cannot train open models on them. Let’s change that!

The community can contribute image preferences for an open-source dataset that could be used for building AI models that convert text to image, like the flux or stable diffusion families. The dataset will be open source so everyone can use it to train models that we can all use.

Blog: https://huggingface.co/blog/burtenshaw/image-preferences

posted an update 2 days ago

Post

1241

The Bluesky AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co/bluesky-community

My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API 🚰

bluesky-community/bluesky-posts-over-time

1 reply

Reacted to rwightman's post with 👍 5 days ago

Post

1118

I'm currently on a push to expand the scope of image based datasets on the Hub. There's certainly a lot already, but for anyone who's looked closely, there's not a whole lot of standardization. I am to fix that, datasets under the https://huggingface.co/timm and https://huggingface.co/pixparse orgs will serve as canonical examples for various task / modality combinations and be useable without fuss in libraries like timm, OpenCLIP, and hopefully more.

I just uploaded the first multi-label dataset that I'll support with timm scripts soon: timm/plant-pathology-2021

Next up object detection & segmentation! I've got an annotation spec sorted out, a lot of datasets ready to rip, and yeah that means timm support for object detection, eventually segmentation, is finally under development :O

Reacted to ArthurZ's post with 🔥 6 days ago

Post

2268

Native tensor parallel has landed in transformers!!! https://github.com/huggingface/transformers/pull/34184 thanks a lot to the torch team for their support!

Contributions are welcome to support more models! 🔥

Reacted to jsulz's post with 🔥 6 days ago

Post

2848

When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
🚀 Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks

Reacted to takarajordan's post with 👍 6 days ago

Post

917

First post here goes!

takarajordan/CineDiffusion

Super excited to announce CineDiffusion🎥, it creates images up to 4.2 Megapixels in Cinematic ultrawide formats like:
- 2.39:1 (Modern Widescreen)
- 2.76:1 (Ultra Panavision 70)
- 3.00:1 (Experimental Ultra-wide)
- 4.00:1 (Polyvision)
- 2.55:1 (CinemaScope)
- 2.20:1 (Todd-AO)

More to come soon!!

Thanks to @John6666 and @Resoldjew for your early support <3

And thanks to the team at ShuttleAI for their brand new Shuttle-3 model, what an amazing job.

shuttleai/shuttle-3-diffusion

Reacted to davidberenstein1957's post with 👀 6 days ago

Post

1008

🤗🔭 Introducing Observers: A Lightweight SDK for AI Observability 🔭🤗

Observers is an open-source Python SDK that provides comprehensive observability for AI applications. Our library makes it easy to:

- Track and record interactions with AI models
- Store observations in multiple backends
- Query and analyse your AI interactions with ease

https://huggingface.co/blog/davidberenstein1957/observers-a-lightweight-sdk-for-ai-observability

Reacted to AdinaY's post with 😎 6 days ago

Post

998

Build a collection for the trending demos recently released by the Chinese community 🚀 From Qwen2.5 Turbo to FishAgent, see what these models can really do 🔥
zh-ai-community/trending-demo-673b6ca2416a3b3c9d3bf8f1

Reacted to singhsidhukuldeep's post with 👀 6 days ago

Post

1239

Sorry judge, my lawyer hallucinated? 😂 If you get an AI lawyer, you would want it to be hallucination-free!

New @Stanford -@Yale research reveals surprising findings about leading AI legal research tools. Here's what you need to know:

>> Key Findings
The study tested LexisNexis (Lexis+ AI), Thomson Reuters (Westlaw AI & Ask Practical Law AI), and GPT-4, finding hallucination rates between 17-33% despite claims of being "hallucination-free".

>> Technical Deep Dive
The research evaluated these tools using Retrieval-Augmented Generation (RAG) architecture, which operates in two crucial steps:

1. Retrieval System:
- Uses neural text embeddings to capture semantic meaning
- Employs both lexical and semantic search mechanisms
- Implements document filtering and extraction
- Retrieves relevant legal documents from vast databases

2. Generation Pipeline:
- Processes retrieved documents alongside original queries
- Synthesizes information from multiple legal sources
- Generates responses based on retrieved context
- Includes citation verification mechanisms

>> Performance Breakdown:
- Lexis+ AI: 65% accuracy rate
- Westlaw AI: 42% accuracy rate
- Ask Practical Law AI: Over 60% incomplete answers

>> Why This Matters
This research exposes critical vulnerabilities in AI legal tools that lawyers increasingly rely on. It's essential for legal professionals to understand these limitations when incorporating AI into their practice.

Reacted to AkimfromParis's post with ❤️ 6 days ago

Post

1393

🇯🇵 The Open Japanese LLM Leaderboard created by LLM-jp 🌸 in partnership with HuggingFace 🤗 was released today!

Blog: https://huggingface.co/blog/leaderboard-japanese
Space: llm-jp/open-japanese-llm-leaderboard

🌍 The leaderboard is available in both Japanese and English
📚 Based on the evaluation tool, llm-jp-eval with more than 20 datasets for Japanese LLMs
📊 The leaderboard showcases all the metrics for NLP experts, plus averages for NLP beginners
💻 For the comfort of users, we chose a horizontal UI, and implemented it in a light and dark theme on Gradio
🔬 The radar chart provides a very interesting visualization of metrics!
🌱 We are using the Japanese research platform, MDX, so please be patient!
⚡ LLMs bigger than +70B will be evaluated soon…

How do you say “GPUs Go Brrr” in Japanese - > GPUがブンブン～! (To pronounce "GPU ga bunbun!") 🔥

4 replies

Reacted to fdaudens's post with 👀 6 days ago

Post

1542

🚀 DeepSeek just dropped DeepSeek-R1-Lite-Preview with “reasoning” capacity.

- Matches OpenAI o1-preview on AIME & MATH benchmarks.
- Transparent process output
- Open-source model to be released

Try it out: https://chat.deepseek.com/

Reacted to rwightman's post with 🚀 6 days ago

Post

1021

Want to validate some hparams or figure out what timm model to use before commiting to download or training with a large dataset? Try mini-imagenet: timm/mini-imagenet

I had this sitting on my drive and forgot where I pulled it together from. It's 100 classes of imagenet, 50k train and 10k val images (from ImageNet-1k train set), and 5k test images (from ImageNet-1k val set). 7.4GB instead of > 100GB for the full ImageNet-1k. This ver is not reduced resolution like some other 'mini' versions. Super easy to use with timm train/val scripts, checkout the dataset card.

I often check fine-tuning with even smaller datasets like:
* timm/resisc45
* timm/oxford-iiit-pet
But those are a bit small to train any modest size model w/o starting from pretrained weights.

Reacted to Symbol-LLM's post with 🔥 6 days ago

Post

890

🥳 Thrilled to introduce our recent efforts on bootstrapping VLMs for multi-modal chain-of-thought reasoning !

📕 Title: Vision-Language Models Can Self-Improve Reasoning via Reflection

🔗 Link: Vision-Language Models Can Self-Improve Reasoning via Reflection (2411.00855)

😇Takeaways:

- We found that VLMs can self-improve reasoning performance through a reflection mechanism, and importantly, this approach can scale through test-time computing.

- Evaluation on comprehensive and diverse Vision-Language reasoning tasks are included !

Daniel van Strien PRO

AI & ML interests

Recent Activity

Articles

Let’s make a generation of amazing image generation models

Share your open ML datasets on Hugging Face Hub!

Scaling AI-based Data Processing with Hugging Face + Dask

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity