1532 1691 4654

Omar Sanseviero

osanseviero

https://osanseviero.github.io/hackerllama/

AI & ML interests

Llamas, model merging, massive ASR for data collection, 3D ML, on-device ML, quantization, model judging, ML in browser, healthcare applications, education, intersection of art and ML.🦙

Articles

Organizations

osanseviero's activity

reacted to tomaarsen's post with 🚀🔥 29 days ago

Post

6012

📣 Sentence Transformers v3.2.0 is out, marking the biggest release for inference in 2 years! 2 new backends for embedding models: ONNX (+ optimization & quantization) and OpenVINO, allowing for speedups up to 2x-3x AND Static Embeddings for 500x speedups at 10-20% accuracy cost.

1️⃣ ONNX Backend: This backend uses the ONNX Runtime to accelerate model inference on both CPU and GPU, reaching up to 1.4x-3x speedup depending on the precision. We also introduce 2 helper methods for optimizing and quantizing models for (much) faster inference.
2️⃣ OpenVINO Backend: This backend uses Intel their OpenVINO instead, outperforming ONNX in some situations on CPU.

Usage is as simple as SentenceTransformer("all-MiniLM-L6-v2", backend="onnx"). Does your model not have an ONNX or OpenVINO file yet? No worries - it'll be autoexported for you. Thank me later 😉

🔒 Another major new feature is Static Embeddings: think word embeddings like GLoVe and word2vec, but modernized. Static Embeddings are bags of token embeddings that are summed together to create text embeddings, allowing for lightning-fast embeddings that don't require any neural networks. They're initialized in one of 2 ways:

1️⃣ via Model2Vec, a new technique for distilling any Sentence Transformer models into static embeddings. Either via a pre-distilled model with from_model2vec or with from_distillation where you do the distillation yourself. It'll only take 5 seconds on GPU & 2 minutes on CPU, no dataset needed.
2️⃣ Random initialization. This requires finetuning, but finetuning is extremely quick (e.g. I trained with 3 million pairs in 7 minutes). My final model was 6.6% worse than bge-base-en-v1.5, but 500x faster on CPU.

Full release notes: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.2.0
Documentation on Speeding up Inference: https://sbert.net/docs/sentence_transformer/usage/efficiency.html

1 reply

reacted to nyuuzyou's post with 👀 30 days ago

Post

1920

🎓 Introducing Doc4web.ru Documents Dataset - nyuuzyou/doc4web

Dataset highlights:
- 223,739 documents from doc4web.ru, a document hosting platform for students and teachers
- Primarily in Russian, with some English and potentially other languages
- Each entry includes: URL, title, download link, file path, and content (where available)
- Contains original document files in addition to metadata
- Data reflects a wide range of educational topics and materials
- Licensed under Creative Commons Zero (CC0) for unrestricted use

The dataset can be used for analyzing educational content in Russian, text classification tasks, and information retrieval systems. It's also valuable for examining trends in educational materials and document sharing practices in the Russian-speaking academic community. The inclusion of original files allows for in-depth analysis of various document formats and structures.

reacted to merve's post with 🔥 about 1 month ago

Post

3695

Meta AI vision has been cooking @facebook
They shipped multiple models and demos for their papers at @ECCV 🤗

Here's a compilation of my top picks:
- Sapiens is family of foundation models for human-centric depth estimation, segmentation and more, all models have open weights and demos 👏

All models have their demos and even torchscript checkpoints!
A collection of models and demos: facebook/sapiens-66d22047daa6402d565cb2fc
- VFusion3D is state-of-the-art consistent 3D generation model from images

Model: facebook/vfusion3d
Demo: facebook/VFusion3D

- CoTracker is the state-of-the-art point (pixel) tracking model

Demo: facebook/cotracker
Model: facebook/cotracker

reacted to fdaudens's post with 🧠🤗👀🔥 about 1 month ago

Post

3011

The Nobel Prize background for Hopfield and Hinton's work on neural networks is pure gold. It's a masterclass in explaining AI basics.

Key takeaways from the conclusion:
- ML applications are expanding rapidly. We're still figuring out which will stick.
- Ethical discussions are crucial as the tech develops.
- Physics 🤝 AI: A two-way street of innovation.

Some mind-blowing AI applications in physics:
- Discovering the Higgs particle
- Cleaning up gravitational wave data
- Hunting exoplanets
- Predicting molecular structures
- Designing better solar cells

We're just scratching the surface. The interplay between AI and physics is reshaping both fields.

Bonus: The illustrations accompanying the background document are really neat. (Credit: Johan Jarnestad/The Royal Swedish Academy of Sciences)

#AI #MachineLearning #Physics #Ethics #Innovation

1 reply

reacted to reach-vb's post with 🔥👍 about 1 month ago

Post

2021

On-device AI framework ecosystem is blooming these days:

1. llama.cpp - All things Whisper, LLMs & VLMs - run across Metal, CUDA and other backends (AMD/ NPU etc)
https://github.com/ggerganov/llama.cpp

2. MLC - Deploy LLMs across platforms especially WebGPU (fastest WebGPU LLM implementation out there)
https://github.com/mlc-ai/web-llm

3. MLX - Arguably the fastest general purpose framework (Mac only) - Supports all major Image Generation (Flux, SDXL, etc), Transcription (Whisper), LLMs
https://github.com/ml-explore/mlx-examples

4. Candle - Cross-platform general purpose framework written in Rust - wide coverage across model categories
https://github.com/huggingface/candle

Honorable mentions:

1. Transformers.js - Javascript (WebGPU) implementation built on top of ONNXruntimeweb
https://github.com/xenova/transformers.js

2. Mistral rs - Rust implementation for LLMs & VLMs, built on top of Candle
https://github.com/EricLBuehler/mistral.rs

3. Ratchet - Cross platform, rust based WebGPU framework built for battle-tested deployments
https://github.com/huggingface/ratchet

4. Zml - Cross platform, Zig based ML framework
https://github.com/zml/zml

Looking forward to how the ecosystem would look 1 year from now - Quite bullish on the top 4 atm - but open source ecosystem changes quite a bit! 🤗

Also, which frameworks did I miss?

1 reply

reacted to alielfilali01's post with 🔥 about 1 month ago

Post

1799

Why nobdoy is talking about the new training corpus released by MBZUAI today.

TxT360 is +15 Trillion tokens corpus outperforming FineWeb on several metrics. Ablation studies were done up to 1T tokens.

Read blog here : LLM360/TxT360
Dataset : LLM360/TxT360

2 replies

reacted to lucifertrj's post with 👀 about 1 month ago

Post

1484

AI Agents LlamaIndex in 40 minutes

The video covers code and workflow explanations for:

- Function Calling
- Function Calling Agents + Agent Runner
- Agentic RAG
- REAcT Agent: Build your own Search Assistant Agent

Watch: https://youtu.be/bHn4dLJYIqE

reacted to fdaudens's post with 🔥 about 1 month ago

Post

1960

This is how AI can be useful in journalism: Just tested DataTalk - a tool that lets you dig through campaign finance data with just your words.

It's transforming complex FEC filings and OpenSecrets datasets into actionable insights for journalists.

Key features for newsrooms:
- Natural language queries on FEC data
- Rapid insights on donors, spending, special interests
- SQL access for deep dives

Tested it out:
- Retrieved how much Harris and Trump raised
- Found top donors instantly (#1 is Timothy Mellon—have you heard about him?)
- Uncovered big self-funders like David Trone ($62M)

Pros:
- Saves hours of data wrangling
- Surfaces story leads quickly
- Transparent AI retrieving steps makes this tool auditable

Awesome work by Stanford University Open Virtual Assistant Lab, Big Local News, and Columbia University - Graduate School of Journalism. Expert-guided.

Remember: Always verify. Use for leads, not final copy. But this is gold for finding new leads.

How might this change campaign finance reporting? What other datasets need this treatment?

Try it out: https://www.datatalk.genie.stanford.edu/

#AIJournalism #campaignfinance #datajournalism #election2024

reacted to philipp-zettl's post with 👀 about 1 month ago

Post

1362

🚀 Finishing up the prototype of my weekend project called ChessPT 🚀

- The game state is now being rendered. This simplifies coming up with own new moves
- The model space philipp-zettl/ChessPT was updated to provide an interactive mode.
- The space is currently running v0.4 of philipp-zettl/chessPT
- New updates will come this week.
- Training runs will be logged under https://wandb.ai/philipp-zettl/chessPT/

**Note**: The model is still not performing on a level that I want it to. It predicts too frequently invalid moves (according to the game state). In addition to that the post-processing step is a little faulty, so it might be possible that you end up in a state where the model didn't provide a next move.

reacted to alielfilali01's post with 👍 about 1 month ago

Post

2553

Don't you think we should add a tag "Evaluation" for datasets that are meant to be benchmarks and not for training ?

At least, when someone is collecting a group of datasets from an organization or let's say the whole hub can filter based on that tag and avoid somehow contaminating their "training" data.

reacted to nyuuzyou's post with 👀 about 1 month ago

Post

1413

🌐 Subdomain Dataset Update: September 2024 Data Now Available

I have updated the nyuuzyou/subdomains dataset with fresh data for September 2024. This addition further expands this largest collection of subdomain statistics currently available, providing researchers and analysts with even more valuable insights into web infrastructure and domain patterns.

Latest Update Highlights:
- New File: subdomains_2024_09.csv
- Unique Subdomains: 19,191,867
- Total Occurrences: 170,792,927

reacted to davidberenstein1957's post with 👀 about 1 month ago

Post

1207

Thursday 10 October 17:00 CEST, I will show a good way to get started with a text classification project on the Hugging Face Hub with Argilla and Setfit.

Signup here: https://lu.ma/31mecp34

reacted to mmhamdy's post with 👀 about 1 month ago

Post

1814

🔗 Evaluating Long Context #1: Long Range Arena (LRA)

Accurately evaluating how well language models handle long contexts is crucial, but it's also quite challenging to do well. In this series of posts, we're going to examine the various benchmarks that were proposed to assess long context understanding, starting with Long Range Arens (LRA)

Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation.

📌 Key Features of LRA

1️⃣ Diverse Tasks: The LRA benchmark consists of a suite of tasks designed to evaluate model performance on long sequences ranging from 1,000 to 16,000 tokens. These tasks encompass different data types and modalities: Text, Natural and Synthetic Images, and Mathematical Expressions.

2️⃣ Synthetic and Real-world Tasks: LRA is comprised of both synthetic probing tasks and real-world tasks.

3️⃣ Open-Source and Extensible: Implemented in Python using Jax and Flax, the LRA benchmark code is publicly available, making it easy to extend.

📌 Tasks

1️⃣ Long ListOps

2️⃣ Byte-level Text Classification and Document Retrieval

3️⃣ Image Classification

4️⃣ Pathfinder and Pathfinder-X (Long-range spatial dependency)

👨‍💻 Long Range Arena (LRA) Github Repository: https://github.com/google-research/long-range-arena

📄 Long Range Arena (LRA) paper: Long Range Arena: A Benchmark for Efficient Transformers (2011.04006)

reacted to clem's post with 👀 about 1 month ago

Post

2032

What are we thinking about MovieGen from Meta? Are the researchers on Hugging Face to be able to ask them questions?

The paper is here: https://ai.meta.com/static-resource/movie-gen-research-paper

reacted to datatab's post with 👀 about 1 month ago

Post

2434

Explore the Serbian LLM Evaluation Dataset 🚀

datatab/serbian-llm-benchmark

Excited to share the launch of the Serbian LLM Evaluation Dataset—a specialized resource for rigorously testing Serbian Language Models. Perfect for developers, researchers, and tech enthusiasts eager to push the boundaries of AI in language understanding.

Features Include:
Extensive question sets across multiple domains such as science, history, and more.
Integration with Lighteval for streamlined, efficient evaluation.
Diverse challenge levels to thoroughly assess model performance in Serbian.
Start enhancing your language models today! Dive into a universe of data crafted to refine and perfect AI interactions in Serbian.

2 replies

Omar Sanseviero

AI & ML interests

Articles

Llama can now see and run on your device - welcome Llama 3.2

Fine-tuning LLMs to 1.58bit: extreme quantization made easy

Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

WWDC 24: Running Mistral 7B with Core ML

How we leveraged distilabel to create an Argilla 2.0 Chatbot

Welcome Gemma 2 - Google's new open LLM

Welcome Llama 3 - Meta's new open LLM

CodeGemma - an official Google release for code LLMs

🪆 Introduction to Matryoshka Embedding Models

Welcome Gemma - Google's new open LLM

Constitutional AI with Open LLMs

Preference Tuning LLMs with Direct Preference Optimization Methods

Mixture of Experts Explained

Welcome Mixtral - a SOTA Mixture of Experts on Hugging Face

Inference for PROs

Spread Your Wings: Falcon 180B is here

Code Llama: Llama 2 learns to code

Results of the Open Source AI Game Jam

Llama 2 is here - get it on Hugging Face

The Falcon has landed in the Hugging Face ecosystem

Hugging Face Machine Learning Demos on arXiv

What's new in Diffusers? 🎨

Announcing Evaluation on the Hub

An Introduction to Deep Reinforcement Learning

Welcome spaCy to the 🤗 Hub

Sentence Transformers in the 🤗 Hub

Organizations

osanseviero's activity