Jared Sulzdorf PRO

jsulz

AI & ML interests

NLP + (Law|Medicine) & Ethics

Recent Activity

Articles

Organizations

Hugging Face's profile picture Spaces Examples's profile picture Blog-explorers's profile picture Journalists on Hugging Face's profile picture Hugging Face Discord Community's profile picture Xet Team's profile picture open/ acc's profile picture

jsulz's activity

Reacted to prithivMLmods's post with 🤗❤️🔥 3 days ago
view post
Post
2967
HF Posts Receipts 🏆🚀

[ HF POSTS RECEIPT ] : prithivMLmods/HF-POSTS-RECEIPT

🥠The one thing that needs to be remembered is the 'username'.

🥠And yeah, thank you, @maxiw , for creating the awesome dataset and sharing them here! 🙌

🥠[ Dataset ] : maxiw/hf-posts

.
.
.
@prithivMLmods
replied to their post 4 days ago
view reply

Great question, we've talked about torrents before, actually!

How would you include torrents in your workflows today?

There's nothing stopping us from doing it, but the user/developer experience doesn't quite align with what we're trying to support right now. There are benefits to leveraging CDNs as we do today, and this integrates relatively seamlessly with existing clients (e.g., huggingface_hub) that are used across the Hub.

Maybe if there's enough interest in the future!

posted an update 4 days ago
view post
Post
1418
Something I love about working at Hugging Face is the opportunity to design and work in public. Right now, we’re redesigning the architecture that supports uploads and downloads on the Hub.

Datasets and models are growing fast, and so are the challenges of storing and transferring them efficiently. To keep up, we're introducing a new protocol for uploads and downloads, supported by a content-addressed store (CAS).

Here’s what’s coming:

📦 Smarter uploads: Chunk-level management enables advanced deduplication, compression, and reduces redundant transfers, speeding up uploads.
⚡ Efficient downloads: High throughput and low latency ensure fast access, even during high-demand model releases.
🔒 Enhanced security: Validate uploads before storage to block malicious or invalid data.

We analyzed 24 hours of global upload activity in October (88 countries, 130TB of data!) to design a system that scales with your needs.

The result? A proposed infrastructure with CAS nodes in us-east-1, eu-west-3, and ap-southeast-1.

🔗 Read the blog post for the full details: https://huggingface.co/blog/rearchitecting-uploads-and-downloads

🌟 Check out our interactive demo to explore the data yourself!
xet-team/cas-analysis

We’d love to hear your feedback - let us know if you have questions or want to see more.
·
Reacted to davanstrien's post with 🔥 5 days ago
view post
Post
1318
The Bluesky AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co/bluesky-community

My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API 🚰

bluesky-community/bluesky-posts-over-time
  • 1 reply
·
Reacted to reach-vb's post with ❤️ 6 days ago
view post
Post
2359
Massive week for Open AI/ ML:

Mistral Pixtral & Instruct Large - ~123B, 128K context, multilingual, json + function calling & open weights
mistralai/Pixtral-Large-Instruct-2411
mistralai/Mistral-Large-Instruct-2411

Allen AI Tülu 70B & 8B - competive with claude 3.5 haiku, beats all major open models like llama 3.1 70B, qwen 2.5 and nemotron
allenai/tulu-3-models-673b8e0dc3512e30e7dc54f5
allenai/tulu-3-datasets-673b8df14442393f7213f372

Llava o1 - vlm capable of spontaneous, systematic reasoning, similar to GPT-o1, 11B model outperforms gemini-1.5-pro, gpt-4o-mini, and llama-3.2-90B-vision
Xkev/Llama-3.2V-11B-cot

Black Forest Labs Flux.1 tools - four new state of the art model checkpoints & 2 adapters for fill, depth, canny & redux, open weights
reach-vb/black-forest-labs-flux1-6743847bde9997dd26609817

Jina AI Jina CLIP v2 - general purpose multilingual and multimodal (text & image) embedding model, 900M params, 512 x 512 resolution, matroyoshka representations (1024 to 64)
jinaai/jina-clip-v2

Apple AIM v2 & CoreML MobileCLIP - large scale vision encoders outperform CLIP and SigLIP. CoreML optimised MobileCLIP models
apple/aimv2-6720fe1558d94c7805f7688c
apple/coreml-mobileclip

A lot more got released like, OpenScholar ( OpenScholar/openscholar-v1-67376a89f6a80f448da411a6), smoltalk ( HuggingFaceTB/smoltalk), Hymba ( nvidia/hymba-673c35516c12c4b98b5e845f), Open ASR Leaderboard ( hf-audio/open_asr_leaderboard) and much more..

Can't wait for the next week! 🤗
Reacted to BrigitteTousi's post with 🚀 8 days ago
Reacted to fdaudens's post with ❤️ 8 days ago
view post
Post
1861
🦋 Hug the butterfly! You can now add your Bluesky handle to your Hugging Face profile! ✨
Reacted to elliesleightholm's post with 🤗 9 days ago
posted an update 10 days ago
view post
Post
2872
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
🚀 Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks
Reacted to reach-vb's post with 🤗🔥 12 days ago
view post
Post
4166
What a brilliant week for Open Source AI!

Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f

LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17%
microsoft/llm2clip-672323a266173cfa40b32d4c

Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents
Nexusflow/athene-v2-6735b85e505981a794fb02cc

Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed
microsoft/orca-agentinstruct-1M-v1

Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder
reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71

JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow
deepseek-ai/JanusFlow-1.3B

Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens!
PleIAs/common_corpus

I'm sure I missed a lot, can't wait for the next week!

Put down in comments what I missed! 🤗
Reacted to m-ric's post with 🔥 14 days ago
view post
Post
3693
𝗧𝗵𝗲 𝗻𝗲𝘅𝘁 𝗯𝗶𝗴 𝘀𝗼𝗰𝗶𝗮𝗹 𝗻𝗲𝘁𝘄𝗼𝗿𝗸 𝗶𝘀 𝗻𝗼𝘁 🦋, 𝗶𝘁'𝘀 𝗛𝘂𝗯 𝗣𝗼𝘀𝘁𝘀! [INSERT STONKS MEME WITH LASER EYES]

See below: I got 105k impressions since regularly posting Hub Posts, coming close to my 275k on Twitter!

⚙️ Computed with the great dataset maxiw/hf-posts
⚙️ Thanks to Qwen2.5-Coder-32B for showing me how to access dict attributes in a SQL request!

cc @merve who's far in front of me
·
Reacted to cfahlgren1's post with 🔥 15 days ago
view post
Post
2212
Why use Google Drive when you can have:

• Free storage with generous limits🆓
• Dataset Viewer (Sorting, Filtering, FTS) 🔍
• Third Party Library Support
• SQL Console 🟧
• Security 🔒
• Community, Reach, and Visibility 📈

It's a no brainer!

Check out our post on what you get instantly out of the box when you create a dataset.
https://huggingface.co/blog/researcher-dataset-sharing
  • 1 reply
·
Reacted to maxiw's post with ❤️ 17 days ago
view post
Post
4581
I was curious to see what people post here on HF so I created a dataset with all HF Posts: maxiw/hf-posts

Some interesting stats:

Top 5 Authors by Total Impressions:
-----------------------------------
@merve : 171,783 impressions (68 posts)
@fdaudens : 135,253 impressions (81 posts)
@singhsidhukuldeep : 122,591 impressions (81 posts)
@akhaliq : 119,526 impressions (78 posts)
@MonsterMMORPG : 112,500 impressions (45 posts)

Top 5 Users by Number of Reactions Given:
----------------------------------------
@osanseviero : 1278 reactions
@clem : 910 reactions
@John6666 : 899 reactions
@victor : 674 reactions
@samusenps : 655 reactions

Top 5 Most Used Reactions:
-------------------------
❤️: 7048 times
🔥: 5921 times
👍: 4856 times
🚀: 2549 times
🤗: 2065 times
·
Reacted to erinys's post with 🚀 about 1 month ago
posted an update about 2 months ago
view post
Post
1655
The Hugging Face Hub hosts over 1.5M Model, Dataset, and Space repositories. To scale to 10M+, the XetHub team (https://huggingface.co/xet-team) is replacing Git LFS with a new technology that improves storage and transfer capabilities with some future developer experience benefits to boot.

Thanks to @yuchenglow and @port8080 (for their analysis covering LFS usage from March 2022–Sept 2024), we now have insights into what we’re storing. Check out the Gradio app to explore:
- Storage growth over time
- File types over all repositories
- Some simple optimizations we're investigating

xet-team/lfs-analysis
Reacted to fdaudens's post with 🔥 about 2 months ago
view post
Post
1971
This is how AI can be useful in journalism: Just tested DataTalk - a tool that lets you dig through campaign finance data with just your words.

It's transforming complex FEC filings and OpenSecrets datasets into actionable insights for journalists.

Key features for newsrooms:
- Natural language queries on FEC data
- Rapid insights on donors, spending, special interests
- SQL access for deep dives

Tested it out:
- Retrieved how much Harris and Trump raised
- Found top donors instantly (#1 is Timothy Mellon—have you heard about him?)
- Uncovered big self-funders like David Trone ($62M)

Pros:
- Saves hours of data wrangling
- Surfaces story leads quickly
- Transparent AI retrieving steps makes this tool auditable

Awesome work by Stanford University Open Virtual Assistant Lab, Big Local News, and Columbia University - Graduate School of Journalism. Expert-guided.

Remember: Always verify. Use for leads, not final copy. But this is gold for finding new leads.

How might this change campaign finance reporting? What other datasets need this treatment?

Try it out: https://www.datatalk.genie.stanford.edu/

#AIJournalism #campaignfinance #datajournalism #election2024
Reacted to clem's post with ❤️ about 2 months ago
view post
Post
4153
Open-source AI creates healthy competition in a field where natural tendencies lead to extreme concentration of power. Imagine a world where only one or two companies could build software. This is the biggest risk and ethical challenge of them all IMO. Let's fight this!
  • 3 replies
·