Jared Sulzdorf PRO

jsulz

AI & ML interests

NLP + (Law|Medicine) & Ethics

Recent Activity

Articles

Organizations

Hugging Face's profile picture Spaces Examples's profile picture Blog-explorers's profile picture Journalists on Hugging Face's profile picture Hugging Face Discord Community's profile picture Xet Team's profile picture open/ acc's profile picture

jsulz's activity

Reacted to prithivMLmods's post with ๐Ÿค—โค๏ธ๐Ÿ”ฅ 3 days ago
view post
Post
2967
HF Posts Receipts ๐Ÿ†๐Ÿš€

[ HF POSTS RECEIPT ] : prithivMLmods/HF-POSTS-RECEIPT

๐Ÿฅ The one thing that needs to be remembered is the 'username'.

๐Ÿฅ And yeah, thank you, @maxiw , for creating the awesome dataset and sharing them here! ๐Ÿ™Œ

๐Ÿฅ [ Dataset ] : maxiw/hf-posts

.
.
.
@prithivMLmods
replied to their post 4 days ago
view reply

Great question, we've talked about torrents before, actually!

How would you include torrents in your workflows today?

There's nothing stopping us from doing it, but the user/developer experience doesn't quite align with what we're trying to support right now. There are benefits to leveraging CDNs as we do today, and this integrates relatively seamlessly with existing clients (e.g., huggingface_hub) that are used across the Hub.

Maybe if there's enough interest in the future!

posted an update 4 days ago
view post
Post
1418
Something I love about working at Hugging Face is the opportunity to design and work in public. Right now, weโ€™re redesigning the architecture that supports uploads and downloads on the Hub.

Datasets and models are growing fast, and so are the challenges of storing and transferring them efficiently. To keep up, we're introducing a new protocol for uploads and downloads, supported by a content-addressed store (CAS).

Hereโ€™s whatโ€™s coming:

๐Ÿ“ฆ Smarter uploads: Chunk-level management enables advanced deduplication, compression, and reduces redundant transfers, speeding up uploads.
โšก Efficient downloads: High throughput and low latency ensure fast access, even during high-demand model releases.
๐Ÿ”’ Enhanced security: Validate uploads before storage to block malicious or invalid data.

We analyzed 24 hours of global upload activity in October (88 countries, 130TB of data!) to design a system that scales with your needs.

The result? A proposed infrastructure with CAS nodes in us-east-1, eu-west-3, and ap-southeast-1.

๐Ÿ”— Read the blog post for the full details: https://huggingface.co/blog/rearchitecting-uploads-and-downloads

๐ŸŒŸ Check out our interactive demo to explore the data yourself!
xet-team/cas-analysis

Weโ€™d love to hear your feedback - let us know if you have questions or want to see more.
ยท
Reacted to davanstrien's post with ๐Ÿ”ฅ 5 days ago
view post
Post
1318
The Bluesky AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co/bluesky-community

My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API ๐Ÿšฐ

bluesky-community/bluesky-posts-over-time
  • 1 reply
ยท
Reacted to reach-vb's post with โค๏ธ 6 days ago
view post
Post
2359
Massive week for Open AI/ ML:

Mistral Pixtral & Instruct Large - ~123B, 128K context, multilingual, json + function calling & open weights
mistralai/Pixtral-Large-Instruct-2411
mistralai/Mistral-Large-Instruct-2411

Allen AI Tรผlu 70B & 8B - competive with claude 3.5 haiku, beats all major open models like llama 3.1 70B, qwen 2.5 and nemotron
allenai/tulu-3-models-673b8e0dc3512e30e7dc54f5
allenai/tulu-3-datasets-673b8df14442393f7213f372

Llava o1 - vlm capable of spontaneous, systematic reasoning, similar to GPT-o1, 11B model outperforms gemini-1.5-pro, gpt-4o-mini, and llama-3.2-90B-vision
Xkev/Llama-3.2V-11B-cot

Black Forest Labs Flux.1 tools - four new state of the art model checkpoints & 2 adapters for fill, depth, canny & redux, open weights
reach-vb/black-forest-labs-flux1-6743847bde9997dd26609817

Jina AI Jina CLIP v2 - general purpose multilingual and multimodal (text & image) embedding model, 900M params, 512 x 512 resolution, matroyoshka representations (1024 to 64)
jinaai/jina-clip-v2

Apple AIM v2 & CoreML MobileCLIP - large scale vision encoders outperform CLIP and SigLIP. CoreML optimised MobileCLIP models
apple/aimv2-6720fe1558d94c7805f7688c
apple/coreml-mobileclip

A lot more got released like, OpenScholar ( OpenScholar/openscholar-v1-67376a89f6a80f448da411a6), smoltalk ( HuggingFaceTB/smoltalk), Hymba ( nvidia/hymba-673c35516c12c4b98b5e845f), Open ASR Leaderboard ( hf-audio/open_asr_leaderboard) and much more..

Can't wait for the next week! ๐Ÿค—
Reacted to BrigitteTousi's post with ๐Ÿš€ 8 days ago
Reacted to fdaudens's post with โค๏ธ 8 days ago
view post
Post
1861
๐Ÿฆ‹ Hug the butterfly! You can now add your Bluesky handle to your Hugging Face profile! โœจ
Reacted to elliesleightholm's post with ๐Ÿค— 9 days ago
posted an update 10 days ago
view post
Post
2872
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. Thatโ€™s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

โฉ Only upload the chunks that changed.
๐Ÿš€ Download just the updates, not the whole file.
๐Ÿง  We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isnโ€™t just a performance boost. Itโ€™s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks