Daniel van Strien PRO

davanstrien

AI & ML interests

Machine Learning Librarian

Recent Activity

Articles

Organizations

davanstrien's activity

Reacted to andito's post with πŸ”₯ about 8 hours ago
view post
Post
372
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! 🀯
- Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! πŸš€
- SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU!
- SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!

Check out more!
Demo: HuggingFaceTB/SmolVLM
Blog: https://huggingface.co/blog/smolvlm
Model: HuggingFaceTB/SmolVLM-Instruct
Fine-tuning script: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Reacted to nataliaElv's post with πŸ‘€ about 10 hours ago
view post
Post
473
Would you like to get a high-quality dataset to pre-train LLMs in your language? 🌏

At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.

Follow the link below, check if your language is listed and sign up to be a Language Lead!

https://forms.gle/s9nGajBh6Pb9G72J6
Reacted to their post with ❀️ 1 day ago
view post
Post
1356
First dataset for the new Hugging Face Bluesky community organisation: bluesky-community/one-million-bluesky-posts πŸ¦‹

πŸ“Š 1M public posts from Bluesky's firehose API
πŸ” Includes text, metadata, and language predictions
πŸ”¬ Perfect to experiment with using ML for Bluesky πŸ€—

Excited to see people build more open tools for a more open social media platform!
posted an update 1 day ago
view post
Post
1356
First dataset for the new Hugging Face Bluesky community organisation: bluesky-community/one-million-bluesky-posts πŸ¦‹

πŸ“Š 1M public posts from Bluesky's firehose API
πŸ” Includes text, metadata, and language predictions
πŸ”¬ Perfect to experiment with using ML for Bluesky πŸ€—

Excited to see people build more open tools for a more open social media platform!
Reacted to KnutJaegersberg's post with ❀️πŸ”₯ 1 day ago
Reacted to davidberenstein1957's post with πŸš€πŸ”₯ 1 day ago
view post
Post
1156
Let’s make a generation of amazing image-generation models

The best image generation models are trained on human preference datasets, where annotators have selected the best image from a choice of two. Unfortunately, many of these datasets are closed source so the community cannot train open models on them. Let’s change that!

The community can contribute image preferences for an open-source dataset that could be used for building AI models that convert text to image, like the flux or stable diffusion families. The dataset will be open source so everyone can use it to train models that we can all use.

Blog: https://huggingface.co/blog/burtenshaw/image-preferences
posted an update 2 days ago
view post
Post
1241
The Bluesky AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co/bluesky-community

My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API 🚰

bluesky-community/bluesky-posts-over-time
  • 1 reply
Β·
Reacted to rwightman's post with πŸ‘ 5 days ago
view post
Post
1118
I'm currently on a push to expand the scope of image based datasets on the Hub. There's certainly a lot already, but for anyone who's looked closely, there's not a whole lot of standardization. I am to fix that, datasets under the https://huggingface.co/timm and https://huggingface.co/pixparse orgs will serve as canonical examples for various task / modality combinations and be useable without fuss in libraries like timm, OpenCLIP, and hopefully more.

I just uploaded the first multi-label dataset that I'll support with timm scripts soon: timm/plant-pathology-2021

Next up object detection & segmentation! I've got an annotation spec sorted out, a lot of datasets ready to rip, and yeah that means timm support for object detection, eventually segmentation, is finally under development :O
Reacted to ArthurZ's post with πŸ”₯ 6 days ago
Reacted to jsulz's post with πŸ”₯ 6 days ago
view post
Post
2848
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
πŸš€ Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks
Reacted to takarajordan's post with πŸ‘ 6 days ago
view post
Post
917
First post here goes!

takarajordan/CineDiffusion

Super excited to announce CineDiffusionπŸŽ₯, it creates images up to 4.2 Megapixels in Cinematic ultrawide formats like:
- 2.39:1 (Modern Widescreen)
- 2.76:1 (Ultra Panavision 70)
- 3.00:1 (Experimental Ultra-wide)
- 4.00:1 (Polyvision)
- 2.55:1 (CinemaScope)
- 2.20:1 (Todd-AO)

More to come soon!!

Thanks to @John6666 and @Resoldjew for your early support <3

And thanks to the team at ShuttleAI for their brand new Shuttle-3 model, what an amazing job.

shuttleai/shuttle-3-diffusion
Reacted to davidberenstein1957's post with πŸ‘€ 6 days ago
view post
Post
1008
πŸ€—πŸ”­ Introducing Observers: A Lightweight SDK for AI Observability πŸ”­πŸ€—

Observers is an open-source Python SDK that provides comprehensive observability for AI applications. Our library makes it easy to:

- Track and record interactions with AI models
- Store observations in multiple backends
- Query and analyse your AI interactions with ease

https://huggingface.co/blog/davidberenstein1957/observers-a-lightweight-sdk-for-ai-observability
Reacted to AdinaY's post with 😎 6 days ago
Reacted to singhsidhukuldeep's post with πŸ‘€ 6 days ago
view post
Post
1239
Sorry judge, my lawyer hallucinated? πŸ˜‚ If you get an AI lawyer, you would want it to be hallucination-free!

New @Stanford -@Yale research reveals surprising findings about leading AI legal research tools. Here's what you need to know:

>> Key Findings
The study tested LexisNexis (Lexis+ AI), Thomson Reuters (Westlaw AI & Ask Practical Law AI), and GPT-4, finding hallucination rates between 17-33% despite claims of being "hallucination-free".

>> Technical Deep Dive
The research evaluated these tools using Retrieval-Augmented Generation (RAG) architecture, which operates in two crucial steps:

1. Retrieval System:
- Uses neural text embeddings to capture semantic meaning
- Employs both lexical and semantic search mechanisms
- Implements document filtering and extraction
- Retrieves relevant legal documents from vast databases

2. Generation Pipeline:
- Processes retrieved documents alongside original queries
- Synthesizes information from multiple legal sources
- Generates responses based on retrieved context
- Includes citation verification mechanisms

>> Performance Breakdown:
- Lexis+ AI: 65% accuracy rate
- Westlaw AI: 42% accuracy rate
- Ask Practical Law AI: Over 60% incomplete answers

>> Why This Matters
This research exposes critical vulnerabilities in AI legal tools that lawyers increasingly rely on. It's essential for legal professionals to understand these limitations when incorporating AI into their practice.
Reacted to AkimfromParis's post with ❀️ 6 days ago
view post
Post
1393
πŸ‡―πŸ‡΅ The Open Japanese LLM Leaderboard created by LLM-jp 🌸 in partnership with HuggingFace πŸ€— was released today!

Blog: https://huggingface.co/blog/leaderboard-japanese
Space: llm-jp/open-japanese-llm-leaderboard

🌍 The leaderboard is available in both Japanese and English
πŸ“š Based on the evaluation tool, llm-jp-eval with more than 20 datasets for Japanese LLMs
πŸ“Š The leaderboard showcases all the metrics for NLP experts, plus averages for NLP beginners
πŸ’» For the comfort of users, we chose a horizontal UI, and implemented it in a light and dark theme on Gradio
πŸ”¬ The radar chart provides a very interesting visualization of metrics!
🌱 We are using the Japanese research platform, MDX, so please be patient!
⚑ LLMs bigger than +70B will be evaluated soon…

How do you say β€œGPUs Go Brrr” in Japanese - > GPUγŒγƒ–γƒ³γƒ–γƒ³ο½ž! (To pronounce "GPU ga bunbun!") πŸ”₯
  • 4 replies
Β·
Reacted to fdaudens's post with πŸ‘€ 6 days ago
view post
Post
1542
πŸš€ DeepSeek just dropped DeepSeek-R1-Lite-Preview with β€œreasoning” capacity.

- Matches OpenAI o1-preview on AIME & MATH benchmarks.
- Transparent process output
- Open-source model to be released

Try it out: https://chat.deepseek.com/
Reacted to rwightman's post with πŸš€ 6 days ago
view post
Post
1021
Want to validate some hparams or figure out what timm model to use before commiting to download or training with a large dataset? Try mini-imagenet: timm/mini-imagenet

I had this sitting on my drive and forgot where I pulled it together from. It's 100 classes of imagenet, 50k train and 10k val images (from ImageNet-1k train set), and 5k test images (from ImageNet-1k val set). 7.4GB instead of > 100GB for the full ImageNet-1k. This ver is not reduced resolution like some other 'mini' versions. Super easy to use with timm train/val scripts, checkout the dataset card.

I often check fine-tuning with even smaller datasets like:
* timm/resisc45
* timm/oxford-iiit-pet
But those are a bit small to train any modest size model w/o starting from pretrained weights.
Reacted to Symbol-LLM's post with πŸ”₯ 6 days ago
view post
Post
890
πŸ₯³ Thrilled to introduce our recent efforts on bootstrapping VLMs for multi-modal chain-of-thought reasoning !

πŸ“• Title: Vision-Language Models Can Self-Improve Reasoning via Reflection

πŸ”— Link: Vision-Language Models Can Self-Improve Reasoning via Reflection (2411.00855)

πŸ˜‡Takeaways:

- We found that VLMs can self-improve reasoning performance through a reflection mechanism, and importantly, this approach can scale through test-time computing.

- Evaluation on comprehensive and diverse Vision-Language reasoning tasks are included !