Loubna Ben Allal

loubnabnl

AI & ML interests

LLMs, ML for code, Synthetic data

Recent Activity

updated a collection about 10 hours ago
SmolLM2
updated a collection about 10 hours ago
SmolLM2
updated a collection about 10 hours ago
SmolLM2
View all activity

Articles

Organizations

loubnabnl's activity

Reacted to merve's post with πŸ”₯ 1 day ago
view post
Post
2063
Small yet mighty! πŸ’«

We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient 🀠

We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base HuggingFaceTB/smolvlm-6740bd584b2dcbf51ecb1f39

Learn more from our blog here: huggingface.co/blog/smolvlm
This release comes with a demo, fine-tuning code, MLX integration and TRL integration for DPO πŸ’
Try the demo: HuggingFaceTB/SmolVLM
Fine-tuning Recipe: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Also TRL integration for DPO πŸ’—
Reacted to thomwolf's post with πŸ”₯ 2 days ago
Reacted to openfree's post with πŸ‘€πŸ”₯ 2 days ago
view post
Post
2832
πŸ€— HuggingFace Trending TOP 300 Board - Featuring AI Rating System
πŸ“Š Service Introduction
A comprehensive dashboard that provides at-a-glance access to the real-time TOP 300 trending Spaces, Models, and Datasets on HuggingFace.
Our specially developed AI rating system evaluates the practical value and growth potential of each item.
⭐ Key Features
1. AI Rising Rate

Growth potential evaluation based on creation date and ranking
5-tier star rating system (β˜…β˜…β˜…β˜…β˜…)
Evaluation Criteria:

Recency: Higher relative weights for recently created items
Ranking Impact: Higher relative weights for top rankings
Comprehensive assessment using statistical/analytical models applied to AI



2. AI Popularity Score

Comprehensive evaluation combining objective popularity and Rising Rate
18-tier grading system from AAA+ to B-
Evaluation Elements:

Base Score: Benchmark based on likes, downloads, comments, etc.
Additional Score: Rising Rate applied as a weighted factor
Comprehensive assessment using statistical/analytical models applied to AI



3. Visualization Features

Real-time screenshot capture with caching
Intuitive card-based UI
Responsive grid layout
Pastel gradient design

🎯 Applications

AI/ML Project Trend Analysis
Early Discovery of Promising Models/Datasets
Community Activity Monitoring
Research/Development Direction Reference

πŸ’‘ Key Advantages

Real-time TOP 300 ranking
AI-based objective evaluation system
Fast loading with caching system
Intuitive and modern UI/UX
Integrated dashboard for 3 categories

πŸ”„ Update Cycle

Real-time data reflection
Manual refresh option
Minimized server load through screenshot caching

🎁 Future Plans

Addition of detailed analysis report feature
Custom filtering options
Time-series trend analysis
Category-specific detailed statistics

🌐 How to Access
openfree/trending-board

#HuggingFace #AI #MachineLearning #TrendingBoard #DataScience #
  • 3 replies
Β·
posted an update 3 days ago
view post
Post
1223
Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit πŸ› οΈ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?
Reacted to prithivMLmods's post with πŸ”₯ 4 days ago
view post
Post
2855
Weekend Dribble πŸ“¦πŸΊ

Adapters for Product Ad Backdrops, Smooth Polaroids, Minimalist Sketch cards, Super Blends!!

🀏Demo on: prithivMLmods/FLUX-LoRA-DLC

Stranger Zones :
πŸ‘‰πŸΌ{ Super Blend } : strangerzonehf/Flux-Super-Blend-LoRA

πŸ‘‰πŸΌ{ Product Concept Ad } : prithivMLmods/Flux-Product-Ad-Backdrop
πŸ‘‰πŸΌ{ Frosted Mock-ups } : prithivMLmods/Flux.1-Dev-Frosted-Container-LoRA
πŸ‘‰πŸΌ{ Polaroid Plus } : prithivMLmods/Flux-Polaroid-Plus
πŸ‘‰πŸΌ{Sketch Cards} : prithivMLmods/Flux.1-Dev-Sketch-Card-LoRA

πŸ‘‰Stranger Zone: https://huggingface.co/strangerzonehf

πŸ‘‰Flux LoRA Collections: prithivMLmods/flux-lora-collections-66dd5908be2206cfaa8519be

.
.
.
@prithivMLmods πŸ€—
Reacted to merve's post with β€οΈπŸš€ 4 days ago
view post
Post
2747
your hugging face profile now has your recent activities πŸ€—
Reacted to merve's post with πŸ”₯ 5 days ago
view post
Post
2470
What a week! A recap for everything you missed ❄️
merve/nov-22-releases-673fbbcfc1c97c4f411def07
Multimodal ✨
> Mistral AI
released Pixtral 124B, a gigantic open vision language model
> Llava-CoT (formerly known as Llava-o1) was released, a multimodal reproduction of o1 model by PKU
> OpenGVLab released MMPR: a new multimodal reasoning dataset
> Jina has released Jina-CLIP-v2 0.98B multilingual multimodal embeddings
> Apple released new SotA vision encoders AIMv2

LLMs πŸ¦™
> AllenAI dropped a huge release of models, datasets and scripts for TΓΌlu, a family of models based on Llama 3.1 aligned with SFT, DPO and a new technique they have developed called RLVR
> Jina has released embeddings-v3: new multilingual embeddings with longer context
> Hugging Face released SmolTalk: synthetic dataset used to align SmolLM2 using supervised fine-tuning
> Microsoft released orca-agentinstruct-1M-v1: a gigantic instruction dataset of 1M synthetic instruction pairs

Image Generation πŸ–ΌοΈ
> Black Forest Labs released Flux 1. tools: four new models for different image modifications and two LoRAs to do image conditioning and better steer generations

Lastly Hugging Face released a new library Observers: a lightweight SDK for monitoring interactions with AI APIs and easily store and browse them πŸ“š
$ pip install observers
  • 3 replies
Β·
Reacted to ArthurZ's post with πŸ”₯ 8 days ago
Reacted to Xenova's post with πŸ”₯ 3 months ago
view post
Post
13767
I can't believe this... Phi-3.5-mini (3.8B) running in-browser at ~90 tokens/second on WebGPU w/ Transformers.js and ONNX Runtime Web! 🀯 Since everything runs 100% locally, no messages are sent to a server β€” a huge win for privacy!
- πŸ€— Demo: webml-community/phi-3.5-webgpu
- πŸ§‘β€πŸ’» Source code: https://github.com/huggingface/transformers.js-examples/tree/main/phi-3.5-webgpu
Β·
Reacted to dvilasuero's post with πŸš€πŸ”₯ 6 months ago
view post
Post
7973
Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!

We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.

To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and AmΓ©lie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!
Β·
posted an update 6 months ago
view post
Post
5002
🍷 FineWeb technical report is out and so is πŸ“š FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarksΒ such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!
Reacted to thomwolf's post with πŸš€πŸ”₯ 6 months ago
view post
Post
4534
[New crazy blog post alert] We are releasing an extensive blog post on the science of creating high quality web-scale datasets, detailing all the steps and learnings that came in our recent 15 trillion tokens 🍷FineWeb release

Inspired by the distill.pub interactive graphics papers, we settled to write the most extensive, enjoyable and in-depth tech report we could draft on so prepare for a 45-mmin read with interactive graphics and all.

And it's not all, in this article we also introduce πŸ“šFineWeb-Edu a filtered subset of Common Crawl with 1.3T tokens containing only web pages with very high educational content. Up to our knowledge, FineWeb-Edu out-performs all openly release web-scale datasets by a significant margin on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA

We also make a number of surprising observations on the "quality" of the internet it-self which may challenge some of the general assumptions on web data (not saying more, I'll let you draw your conclusions ;)

HuggingFaceFW/blogpost-fineweb-v1
  • 1 reply
Β·
Reacted to clefourrier's post with πŸ”₯ 7 months ago
view post
Post
4251
Contamination free code evaluations with LiveCodeBench! πŸ–₯️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date πŸ“…

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! πŸš€

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!
Reacted to Molbap's post with πŸ€—πŸš€πŸ”₯ 8 months ago
view post
Post
5001
πŸš€πŸš€ Exciting times for the document AI community!

We're thrilled to announce the release of some of the largest OCR datasets available to the public.
πŸ”₯ With over 26 million pages , 18 billion text tokens, and 6TB of data, these resources are a significant leap forward for document AI research.

Here's how to access these datasets quickly:

from datasets import load_dataset

pdfa_dataset = load_dataset('pixparse/pdfa-eng-wds', streaming=True)
IDL_dataset = load_dataset('pixparse/idl-wds', streaming=True)

This enables you to stream them directly, integrating seamlessly with your projects using the Hugging Face datasets library. On the hub, you can find them here:

pixparse/pdfa-eng-wds
pixparse/idl-wds

For lean data loading, the new [chug](https://github.com/huggingface/chug) library offers a solution with pdf decoding:


import chug

task_cfg = chug.DataTaskDocReadCfg(
    page_sampling='all',
)
data_cfg = chug.DataCfg(
    source='pixparse/pdfa-eng-wds',
    split='train',
    batch_size=None,
    format='hfids',
    num_workers=0,
)
data_loader = chug.create_loader(
    data_cfg,
    task_cfg,
)
sample = next(iter(data_loader))



We owe a huge thank you to Peter Wyatt, Kate Tasker, Rachel Taketa, Ali Furkan Biten, Ruben Tito, and their colleagues for their contributions. Their work putting these datasets together has been invaluable. πŸ€—

Looking Ahead:

We're on a mission to enhance document AI capabilities, and these datasets are just the beginning. With your engagement and innovation, we're confident in the community's ability to develop robust OCR solutions. We encourage you to explore these datasets, experiment with the code, and contribute to the collective progress in document AI.

For detailed information on usage and licensing, please refer to the dataset cards on the Hugging Face hub.
Β·