Daniel van Strien's picture

Daniel van Strien PRO

davanstrien

·

https://danielvanstrien.xyz/

AI & ML interests

Machine Learning Librarian

Recent Activity

updated a dataset about 19 hours ago

librarian-bots/dataset_cards_with_metadata

liked a model about 23 hours ago

PrimeIntellect/INTELLECT-1-Instruct

upvoted a collection about 23 hours ago

INTELLECT-1 Dataset

View all activity

Articles

Let’s make a generation of amazing image generation models

Share your open ML datasets on Hugging Face Hub!

Scaling AI-based Data Processing with Hugging Face + Dask

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

upvoted a collection about 23 hours ago

INTELLECT-1 Dataset

INTELLECT-1 Training dataset • 5 items • Updated Oct 8 • 18

upvoted a paper 1 day ago

On Limitations of LLM as Annotator for Low Resource Languages

Paper • 2411.17637 • Published 4 days ago • 2

upvoted 2 articles 2 days ago

Article

Use Models from the Hugging Face Hub in LM Studio

By

•

2 days ago

• 54

Article

Fine-Tuning 1B LLaMA 3.2: A Comprehensive Step-by-Step Guide with Code

By

•

Oct 2

• 34

upvoted an article 4 days ago

Article

Let’s make a generation of amazing image generation models

By

•

4 days ago

• 32

upvoted an article 5 days ago

Article

Model2Vec: Distill a Small Fast Model from any Sentence Transformer

By

•

Oct 14

• 56

upvoted a collection 8 days ago

Models for dataset curation

8 items • Updated 6 days ago • 17

upvoted 2 papers 8 days ago

UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages

Paper • 2411.14343 • Published 9 days ago • 7

Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published 9 days ago • 38

upvoted 2 collections 9 days ago

Tulu 3 Datasets

All datasets released with Tulu 3 -- state of the art open post-training recipes. • 32 items • Updated 3 days ago • 48

Tulu 3 Models

All models released with Tulu 3 -- state of the art open post-training recipes. • 7 items • Updated 3 days ago • 24

upvoted a paper 9 days ago

Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline

Paper • 2411.12814 • Published 11 days ago • 20

upvoted an article 9 days ago

Article

Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK

By

•

9 days ago

• 32

upvoted a collection 10 days ago

OpenScholar_V1

The set of models, index, data associated with the paper "OpenScholar: Synthesizing Scientific Literature with Retrieval-Augmented LMs". • 8 items • Updated 9 days ago • 26

upvoted a paper 10 days ago

RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published 11 days ago • 47

upvoted a paper 12 days ago

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published 15 days ago • 106

upvoted 4 papers 13 days ago

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

Paper • 2410.23956 • Published about 1 month ago • 1

SWEb: A Large Web Dataset for the Scandinavian Languages

Paper • 2410.04456 • Published Oct 6 • 1

AstroMLab 3: Achieving GPT-4o Level Performance in Astronomy with a Specialized 8B-Parameter Large Language Model

Paper • 2411.09012 • Published 17 days ago • 1

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Paper • 2309.07462 • Published Sep 14, 2023 • 4