Llava o1 - vlm capable of spontaneous, systematic reasoning, similar to GPT-o1, 11B model outperforms gemini-1.5-pro, gpt-4o-mini, and llama-3.2-90B-vision Xkev/Llama-3.2V-11B-cot
Jina AI Jina CLIP v2 - general purpose multilingual and multimodal (text & image) embedding model, 900M params, 512 x 512 resolution, matroyoshka representations (1024 to 64) jinaai/jina-clip-v2
Llava o1 - vlm capable of spontaneous, systematic reasoning, similar to GPT-o1, 11B model outperforms gemini-1.5-pro, gpt-4o-mini, and llama-3.2-90B-vision Xkev/Llama-3.2V-11B-cot
Jina AI Jina CLIP v2 - general purpose multilingual and multimodal (text & image) embedding model, 900M params, 512 x 512 resolution, matroyoshka representations (1024 to 64) jinaai/jina-clip-v2
- Pre-training code with nanotron - Evaluation suite with lighteval - Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk) - Post-training scripts with TRL & the alignment handbook - On-device tools with llama.cpp for summarization, rewriting & agents
Apache 2.0 licensed. V2 pre-training data mix coming soon!
Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents Nexusflow/athene-v2-6735b85e505981a794fb02cc
Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed microsoft/orca-agentinstruct-1M-v1
Smol TTS models are here! OuteTTS-0.1-350M - Zero shot voice cloning, built on LLaMa architecture, CC-BY license! 🔥
> Pure language modeling approach to TTS > Zero-shot voice cloning > LLaMa architecture w/ Audio tokens (WavTokenizer) > BONUS: Works on-device w/ llama.cpp ⚡
Three-step approach to TTS:
> Audio tokenization using WavTokenizer (75 tok per second) > CTC forced alignment for word-to-audio token mapping > Structured prompt creation w/ transcription, duration, audio tokens
The model is extremely impressive for 350M parameters! Kudos to the OuteAI team on such a brilliant feat - I'd love to see this be applied on larger data and smarter backbones like SmolLM 🤗
> Trained with 1.3 trillion (dolma 1.7) tokens on 16 nodes, each with 4 MI250 GPUs
> Three checkpoints:
- AMD OLMo 1B: Pre-trained model - AMD OLMo 1B SFT: Supervised fine-tuned on Tulu V2, OpenHermes-2.5, WebInstructSub, and Code-Feedback datasets - AMD OLMo 1B SFT DPO: Aligned with human preferences using Direct Preference Optimization (DPO) on UltraFeedback dataset
Key Insights: > Pre-trained with less than half the tokens of OLMo-1B > Post-training steps include two-phase SFT and DPO alignment > Data for SFT: - Phase 1: Tulu V2 - Phase 2: OpenHermes-2.5, WebInstructSub, and Code-Feedback
> Model checkpoints on the Hub & Integrated with Transformers ⚡️
Congratulations & kudos to AMD on a brilliant smol model release! 🤗
Dive into multi-model evaluations, pinpoint the best model for your needs, and explore insights across top open LLMs all in one place. Ready to level up your model comparison game?
What a great day for Open Science! @AIatMeta released models, datasets, and code for many of its research artefacts! 🔥
1. Meta Segment Anything Model 2.1: An updated checkpoint with improved results on visually similar objects, small objects and occlusion handling. A new developer suite will be added to make it easier for developers to build with SAM 2.
Rhymes AI drops Aria: small Multimodal MoE that beats GPT-4o and Gemini-1.5-Flash ⚡️
New player entered the game! Rhymes AI has just been announced, and unveiled Aria – a multimodal powerhouse that's punching above its weight.
Key insights:
🧠 Mixture-of-Experts architecture: 25.3B total params, but only 3.9B active.
🌈 Multimodal: text/image/video → text.
📚 Novel training approach: “multimodal-native” where multimodal training starts directly during pre-training, not just tacked on later
📏 Long 64K token context window
🔓 Apache 2.0 license, with weights, code, and demos all open
⚡️ On the benchmark side, Aria leaves some big names in the dust.
- It beats Pixtral 12B or Llama-3.2-12B on several vision benchmarks like MMMU or MathVista. - It even overcomes the much bigger GPT-4o on long video tasks and even outshines Gemini 1.5 Flash when it comes to parsing lengthy documents.
But Rhymes AI isn't just showing off benchmarks. They've already got Aria powering a real-world augmented search app called “Beago”. It’s handling even recent events with great accuracy!
And they partnered with AMD to make it much faster than competitors like Perplexity or Gemini search.
A 'small' MobileNet-V4 update, I just pushed weights for the smallest model I've trained in the series, a 0.5 width multiplier version of the MobileNet-V4 Conv Small.
Now you may look at this and say hey, why is this impressive? 64.8% top-1 and 2.2M params? MobileNetV3-Small 0.75, and MobileNet-V2 0.5 are both fewer params (at ~2M) and over 65% top-1, what gives? Well this is where MobileNet-V4 differs from the previous versions of the model family, it trades off (gives up) a little parameter efficiency for some computational efficiency.
Less than two days ago Kyutai Labs open sourced Moshi - an ~7.6B on-device Speech to Speech foundation model and Mimi - SoTA streaming speech codec! 🔥
The release includes:
1. Moshiko & Moshika - Moshi finetuned on synthetic data (CC-BY license) (kyutai/moshi-v01-release-66eaeaf3302bef6bd9ad7acd) 2. Mimi - Streaiming Audio Codec, processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps (CC-BY license) (kyutai/mimi) 3. Model checkpoints & Inference codebase written in Rust (Candle), PyTorch & MLX (Apache license) (https://github.com/kyutai-labs/moshi)
How does Moshi work?
1. Moshi processes two audio streams: one for itself and one for the user, with the user's stream coming from audio input and Moshi's stream generated by the model.
2. Along with these audio streams, Moshi predicts text tokens for its speech, enhancing its generation quality.
3. The model uses a small Depth Transformer for codebook dependencies and a large 7B parameter Temporal Transformer for temporal dependencies.
4. The theoretical latency is 160ms, with a practical latency of around 200ms on an L4 GPU.
Model size & inference:
Moshiko/ka are 7.69B param models
bf16 ~16GB VRAM 8-bit ~8GB VRAM 4-bit ~4GB VRAM
You can run inference via Candle 🦀, PyTorch and MLX - based on your hardware.
The Kyutai team, @adefossez@lmz and team are cracked AF, they're bringing some serious firepower to the open source/ science AI scene, looking forward to what's next! 🐐
🤯 Ghost 8B Beta emerges as a clear leader, surpassing even proprietary models like xAI Grok 1, OpenAI GPT 3.5, and Mistral Mixtral 8x7B. This dominance extends to its parity with Mistral Medium, further solidifying its position as a top-tier language model. Furthermore, Ghost 8B Beta stands out as one of only three models employing the zero-shot method for evaluation, alongside Claude 2 and Claude 3, showcasing its unique capabilities and potential for groundbreaking applications. --- 💬 Chat with the model here: - Playground with Ghost 8B Beta (β, 8k): lamhieu/ghost-8b-beta-8k - Playground with Ghost 8B Beta (β, 128k): lamhieu/ghost-8b-beta-128k - Official website: https://ghost-x.org/docs/models/ghost-8b-beta/