1154 142 320

Pedro Cuenca

pcuenq

AI & ML interests

None yet

Recent Activity

New activity about 1 hour ago

HuggingFaceTB/README:Remove typo

New activity about 5 hours ago

ggml-org/gguf-my-repo:If generating model cards readmes, consider adding support for these extra authorship parameters

Reacted to jsulz's post with 🔥 about 5 hours ago

Something I love about working at Hugging Face is the opportunity to design and work in public. Right now, we’re redesigning the architecture that supports uploads and downloads on the Hub. Datasets and models are growing fast, and so are the challenges of storing and transferring them efficiently. To keep up, we're introducing a new protocol for uploads and downloads, supported by a content-addressed store (CAS). Here’s what’s coming: 📦 Smarter uploads: Chunk-level management enables advanced deduplication, compression, and reduces redundant transfers, speeding up uploads. ⚡ Efficient downloads: High throughput and low latency ensure fast access, even during high-demand model releases. 🔒 Enhanced security: Validate uploads before storage to block malicious or invalid data. We analyzed 24 hours of global upload activity in October (88 countries, 130TB of data!) to design a system that scales with your needs. The result? A proposed infrastructure with CAS nodes in us-east-1, eu-west-3, and ap-southeast-1. 🔗 Read the blog post for the full details: https://huggingface.co/blog/rearchitecting-uploads-and-downloads 🌟 Check out our interactive demo to explore the data yourself! https://huggingface.co/spaces/xet-team/cas-analysis We’d love to hear your feedback - let us know if you have questions or want to see more.

View all activity

Articles

Powerful ASR + diarization + speculative decoding with Hugging Face Inference Endpoints

May 1

• 68

Welcome Llama 3 - Meta's new open LLM

Apr 18

• 279

CodeGemma - an official Google release for code LLMs

Apr 9

• 99

Welcome Gemma - Google's new open LLM

Feb 21

• 18

Welcome Mixtral - a SOTA Mixture of Experts on Hugging Face

Dec 11, 2023

• 11

SDXL in 4 steps with Latent Consistency LoRAs

Nov 9, 2023

• 10

Accelerating Stable Diffusion XL Inference with JAX on Cloud TPU v5e

Oct 3, 2023

• 5

Introducing Würstchen: Fast Diffusion for Image Generation

Sep 13, 2023

• 12

Spread Your Wings: Falcon 180B is here

Sep 6, 2023

• 4

Code Llama: Llama 2 learns to code

Aug 25, 2023

• 8

Releasing Swift Transformers: Run On-Device LLMs in Apple Devices

Aug 8, 2023

• 24

Stable Diffusion XL on Mac with Advanced Core ML Quantization

Jul 27, 2023

• 4

Happy 1st anniversary 🤗 Diffusers!

Jul 20, 2023

• 1

Llama 2 is here - get it on Hugging Face

Jul 18, 2023

• 22

Faster Stable Diffusion with Core ML on iPhone, iPad, and Mac

Jun 15, 2023

• 4

The Falcon has landed in the Hugging Face ecosystem

Jun 5, 2023

• 9

Train your ControlNet with diffusers

Mar 24, 2023

• 17

Swift Diffusers: Fast Stable Diffusion for Mac

Feb 24, 2023

• 4

Using LoRA for Efficient Stable Diffusion Fine-Tuning

Jan 26, 2023

• 39

Using Stable Diffusion with Core ML on Apple Silicon

Dec 1, 2022

• 4

Hugging Face Machine Learning Demos on arXiv

Nov 17, 2022

Training Stable Diffusion with Dreambooth using 🧨 Diffusers

Nov 7, 2022

• 16

Stable Diffusion in JAX/Flax 🚀

Oct 13, 2022

• 2

Stable Diffusion with 🧨 Diffusers

Aug 22, 2022

• 36

Organizations

Posts 1

Post

4379

OpenELM in Core ML

Apple recently released a set of efficient LLMs in sizes varying between 270M and 3B parameters. Their quality, according to benchmarks, is similar to OLMo models of comparable size, but they required half the pre-training tokens because they use layer-wise scaling, where the number of attention heads increases in deeper layers.

I converted these models to Core ML, for use on Apple Silicon, using this script: https://gist.github.com/pcuenca/23cd08443460bc90854e2a6f0f575084. The converted models were uploaded to this community in the Hub for anyone that wants to integrate inside their apps: corenet-community/openelm-core-ml-6630c6b19268a5d878cfd194

The conversion was done with the following parameters:
- Precision: float32.
- Sequence length: fixed to 128.

With swift-transformers (https://github.com/huggingface/swift-transformers), I'm getting about 56 tok/s with the 270M on my M1 Max, and 6.5 with the largest 3B model. These speeds could be improved by converting to float16. However, there's some precision loss somewhere and generation doesn't work in float16 mode yet. I'm looking into this and will keep you posted! Or take a look at this issue if you'd like to help: https://github.com/huggingface/swift-transformers/issues/95

I'm also looking at optimizing inference using an experimental kv cache in swift-transformers. It's a bit tricky because the layers have varying number of attention heads, but I'm curious to see how much this feature can accelerate performance in this model family :)

Regarding the instruct fine-tuned models, I don't know the chat template that was used. The models use the Llama 2 tokenizer, but the Llama 2 chat template, or the default Alignment Handbook one that was used to train, are not recognized. Any ideas on this welcome!