58 115 420

Yacine Jernite

yjernite

https://yjernite.github.io/

AI & ML interests

Technical, community, and regulatory tools of AI governance @HuggingFace

Recent Activity

upvoted a collection 1 day ago

OLMo 2

liked a Space 1 day ago

PR-Puppets/PR-Puppet-Sora

upvoted an article 1 day ago

Let’s make a generation of amazing image generation models

View all activity

Articles

EU Training Data Transparency: A Proposal for a Sufficiently Detailed Summary 📑📚🖼️🇪🇺

Jul 3

• 8

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

Jun 24

• 33

Policy Questions Blog 1: AI Data Transparency Remarks for NAIAC Panel 📚🔍⚖️

Mar 27

• 2

AI Watermarking 101: Tools and Techniques

Feb 26

• 15

📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️

Dec 5, 2023

• 1

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 27

AI Policy @🤗: Open ML Considerations in the EU AI Act

Jul 24, 2023

• 2

AI Policy @🤗: Response to the U.S. NTIA's Request for Comment on AI Accountability

Jun 20, 2023

Hugging Face Selected for the French Data Protection Agency Enhanced Support Program

May 15, 2023

Introducing the Data Measurements Tool: an Interactive Tool for Looking at Datasets

Nov 29, 2021

Organizations

yjernite's activity

upvoted a collection 1 day ago

OLMo 2

Collection

Artifacts for the second set of OLMo models. • 17 items • Updated about 2 hours ago • 27

liked a Space 1 day ago

Running

502

👁

Let’s make a generation of amazing image generation models

•

1 day ago

• 29

liked a Space 2 days ago

Running

🔥

Dataset Exploration

Collection

3 items • Updated 17 days ago • 4

Dataset transformation, preparation and edition

Collection

2 items • Updated 5 days ago • 5

Reacted to cfahlgren1's post with ❤️ 7 days ago

Post

2900

You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned

1 reply

Reacted to fdaudens's post with 🔥 14 days ago

Post

1830

Fascinating point from @thomwolf at Web Summit: AI misuse (deepfakes, fake news) is actually easier to make with closed models, not with open-source ones.

This challenges the common narrative that open-source AI is inherently more dangerous. The reality is more nuanced - while we may think open source is technically easier to misuse, closed models' accessibility and product-focused design appear to be driving more actual harm.

Important context for current AI safety discussions and regulation debates.

Do you agree? 👇