13 5 11

Garreth Lee PRO

garrethlee

AI & ML interests

None yet

Recent Activity

posted an update 2 days ago

Does tokenizing numbers into single digits outperform three-digit or BPE tokenization for arithmetic tasks? We explore various tokenization methods in our upcoming blog (releasing next week 👀)! 🔹 Bringing objectivity to comparisons Existing comparisons of number tokenization methods often ignore the difference in models’ compute budgets: larger tokenizer vocabularies naturally lead to more parameters, which produces less objective comparisons of model performances due to more “learning” being done by these bigger models. We addressed this by keeping architectures consistent but adjusting the number of hidden layers to produce roughly equal parameter counts. 🔹 Key results We trained models on the same data mix and evaluated their performance on various arithmetic tasks (digits, operations, floats vs. ints): - When splitting evals based on operators, single-digit tokenization consistently outperformed other methods. - Right-to-left tokenization (which I covered in a previous post) matched or exceeded left-to-right approaches in all tasks. All in all, single-digit tokenization is best compared to other methods, and similar to our previous post’s finding, R2L works better than L2R tokenization, although not as significant as the gap between single-digit and the rest! The wait is almost over 🤗, the full report is coming next week - stay tuned!

liked a Space 5 days ago

argilla/synthetic-data-generator

Reacted to jsulz's post with 🔥 10 days ago

When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in. Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means: ⏩ Only upload the chunks that changed. 🚀 Download just the updates, not the whole file. 🧠 We store your file as deduplicated chunks In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub. We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows? https://huggingface.co/blog/from-files-to-chunks

View all activity

Organizations

garrethlee's activity

posted an update 2 days ago

Post

253

Does tokenizing numbers into single digits outperform three-digit or BPE tokenization for arithmetic tasks? We explore various tokenization methods in our upcoming blog (releasing next week 👀)!

🔹 Bringing objectivity to comparisons

Existing comparisons of number tokenization methods often ignore the difference in models’ compute budgets: larger tokenizer vocabularies naturally lead to more parameters, which produces less objective comparisons of model performances due to more “learning” being done by these bigger models.

We addressed this by keeping architectures consistent but adjusting the number of hidden layers to produce roughly equal parameter counts.

🔹 Key results

We trained models on the same data mix and evaluated their performance on various arithmetic tasks (digits, operations, floats vs. ints):

- When splitting evals based on operators, single-digit tokenization consistently outperformed other methods.
- Right-to-left tokenization (which I covered in a previous post) matched or exceeded left-to-right approaches in all tasks.

All in all, single-digit tokenization is best compared to other methods, and similar to our previous post’s finding, R2L works better than L2R tokenization, although not as significant as the gap between single-digit and the rest!

The wait is almost over 🤗, the full report is coming next week - stay tuned!

liked a Space 5 days ago

Running

220

🧬

Synthetic Data Generator

Build datasets using natural language

Reacted to jsulz's post with 🔥 10 days ago

Post

2872

When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
🚀 Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks