SivilTaram (Qian Liu)

posted an update 5 months ago

Post

2470

Still following your human intuition to mix corpora from different sources for pre-training 🧠? Everyone says that data mixture has a big impact on model performance, but how - and why🕵️? Did you know that web corpora are actually highly impactful for downstream tasks 🏆?

Check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" 📄

🔬 In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! 📈

📄 Paper: RegMix: Data Mixture as Regression for Language Model Pre-training (2407.01492)
💻 Code: https://github.com/sail-sg/regmix
📊 Collection: sail/regmix-data-mixture-as-regression-6682b6caab37b9442877f0ce
🎮 Demo: https://huggingface.co/spaces/sail/RegMix

Reacted to victor's post with 🚀 6 months ago

Post

1539

✨ Tools are now available in HuggingChat (https://hf.co/chat)

In short, Tools allow HuggingChat to plug any ZeroGPU Space as a tool HuggingChat can use, offering limitless possibilities.

For the release we plugged 6 tools that you can use right now on command-R+, we plan to expand to more models.

We'll also allow you to add your own tools (any ZeroGPU space is compatible). For more info check out this discussion: huggingchat/chat-ui#470

Kudos to @nsarrazin @Saghen and @mishig for the release <3

7 replies

·

replied to their post 7 months ago

Thanks the Chief Llama Officer @osanseviero !

posted an update 7 months ago

Post

2266

Introducing Sailor-14B Model and Sailor2 Project 🚢

We're thrilled to announce the release of the Sailor-14B models, including the Base and the Chat versions!

✅Built upon the Qwen1.5-14B model, the Base version follows a similar procedure as our Sailor-7B model.
✅The Chat version is optimized using DPO on our in-house human preference dataset, yielding a better experience than our previous Chat models.

🏠Home: https://sailorllm.github.io
🤗Model: sail/Sailor-14B-Chat
💻Demo: sail/Sailor-14B-Chat

We're also excited to introduce the Sailor2 project, ✨ an open collaboration opportunity for the entire community! ✨

🌐 The Sailor2 project aims to build a LLM with ~30B parameters, optimized for multiple South-East Asian languages, including Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese.

🎯The model will undergo continual pre-training from a base model proficient in both Chinese and English using nearly 800B SEA tokens, with an expected performance comparable to the most advanced business models for the above SEA languages.

🤝 Contribute your data, expertise, and ideas to shape the future of open-source LLMs for the SEA region.

🌍 Everyone passionate about the SEA region is welcome aboard! Join the party and get involved by scanning the QR code! 🔍

Let's sail together and enjoy the journey!⚓

2 replies

·

posted an update 7 months ago

Post

1717

✨ Today, we're excited to share the full data processing script used in developing our Sailor models. The repo provides an end-to-end data processing pipeline for LLM training. 🚀

💻Code: https://github.com/sail-sg/sailcraft
🤗Model: sail/sailor-language-models-65e19a749f978976f1959825
📜Paper: Sailor: Open Language Models for South-East Asia (2404.03608)
🌐Homepage: https://sailorllm.github.io

# Overview 🔍

The pipeline consists of 4 stages🧹:
1️⃣ Initial data cleaning
2️⃣ Near deduplication
3️⃣ Exact deduplication
4️⃣ Second round of data cleaning

A special focus was given to the data cleaning part of South-East Asian (SEA) languages🌍

# Use Case ✨

With this codebase, you can clean your own dataset with:

✅ Get filtered data counts after each processing stage
✅ Easily configure language-specific cleaning rules (we support Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimize for English, Indonesian, Vietnamese, Chinese, Thai, Lao, Malay)
✅ Investigate what data was removed at each processing stage

# Acknowledgement 🙏

The main credit goes to @dreamerdeo , the first author of our Sailor paper ❤️! He put in tremendous effort on the data processing pipeline, enabling the model's great performance. We believe the mini repo will be a valuable resource for researchers working on dataset curation for large language models. 🎉

Sharing the recipe openly aligns with our commitment to open language model development. 💪 And this repo would not have been possible without the contributions from the open community, including the BigScience data cleaning tool, the all-in-one deduplication tool by @chenghao , and the deduplication project from Google. 🧠

# What's Next 🚀

Share your thoughts or leave any comments on what you'd like the Sailor models to do! We also have some exciting news coming soon, and please stay tuned. 🚄

Reacted to thomwolf's post with 🔥 8 months ago

Post

4837

Is is time for the open-source AI robots revolution 🚀?

With @haixuantao and @Leyo we’ve been playing with a low-cost DJI robot controlled by three local open-source AI models (Whisper, Idefics2, Parler-TTS - all Apache2) and orchestrated by Dora-cs.

Links to find all the hardware/software we used in the demo:
- robot control framework – dora-rs: https://github.com/dora-rs/dora
- speech-to-text model – whisper: openai/whisper-base
- vision-text model – Idefics2: HuggingFaceM4/idefics2-8b-AWQ
- text-to-speech model – ParlerTTS mini: parler-tts/parler_tts_mini_v0.1
- robot: https://dji.com/robomaster-s1
- code gist: https://gist.github.com/haixuanTao/860e1740245dc2c8dd85b496150a9320
- Larger codebase: dora-rs/dora-idefics2
- laptop/pc: any with a recent GPU card (our has a RTX 4090)

Enjoy!

4 replies

·

Reacted to akhaliq's post with 🔥 8 months ago

Post

3232

LLM2Vec

Large Language Models Are Secretly Powerful Text Encoders

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders (2404.05961)

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

Reacted to qnguyen3's post with 🚀❤️🔥 8 months ago

Post

5192

🎉 Introducing nanoLLaVA, a powerful multimodal AI model that packs the capabilities of a 1B parameter vision language model into just 5GB of VRAM. 🚀 This makes it an ideal choice for edge devices, bringing cutting-edge visual understanding and generation to your devices like never before. 📱💻

Model: qnguyen3/nanoLLaVA 🔍
Spaces: qnguyen3/nanoLLaVA (thanks to @merve )

Under the hood, nanoLLaVA is based on the powerful vilm/Quyen-SE-v0.1 (my Qwen1.5-0.5B finetune) and Google's impressive google/siglip-so400m-patch14-384. 🧠 The model is trained using a data-centric approach to ensure optimal performance. 📊

In the spirit of transparency and collaboration, all code and model weights are open-sourced under the Apache 2.0 license. 🤝

1 reply

·

Reacted to clefourrier's post with ❤️ 8 months ago

Post

2214

Fun fact about evaluation, part 2!

How much do scores change depending on prompt format choice?

Using different prompts (all present in the literature, from Prompt question? to Question: prompt question?\nChoices: enumeration of all choices\nAnswer: ), we get a score range of...

10 points for a single model!
Keep in mind that we only changed the prompt, not the evaluation subsets, etc.
Again, this confirms that evaluation results reported without their details are basically bullshit.

Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

Incidentally, it also changes model rankings - so a "best" model might only be best on one type of prompt...

Reacted to their post with 🚀🔥 8 months ago

Post

2420

⚓️ Sailor: A New Multilingual Open LLM for South-East Asia 🌏

Last month we have released a new family of multilingual language models called **Sailor**, ranging from 0.5B to 7B parameters, continually pre-trained from the Qwen1.5 models. Based on our extensive benchmarking, the Sailor models demonstrate exceptional performance on South-East Asian languages, taking us one step closer to multilingual LLMs that can serve the diverse needs of the region and beyond.

Today, we're more than excited to share the key technical details behind the Sailor models! 💪

**Key highlights**:
🔍 Data curation: Merging short examples, document-level code-switching, aggressive data cleaning and deduplication.
🤖 Tokenization Robustness: We find that BPE dropout is really effective to deal with prompt variations.
🔍 Optimizing Data Mixture: We propose a new approach to automatically balance capabilities across different languages!
🌟 Recipe in Continual Pre-training: We discover a powerful metric that can help predict how well the Sailor models will perform on the original domain (e.g., English) after continual pre-training.

We are thrilled to share these technical details with the community and invite you to explore the Sailor models. We hope Sailor models take us one step closer to multilingual LLMs in the world! 🌍✨

To learn more, please access our research paper or reach out to our team.
🔗 Paper: Sailor: Open Language Models for South-East Asia (2404.03608)
🧩 Model: sail/sailor-language-models-65e19a749f978976f1959825
💻 Code: https://github.com/sail-sg/sailor-llm

posted an update 8 months ago

Post

2420

⚓️ Sailor: A New Multilingual Open LLM for South-East Asia 🌏

Last month we have released a new family of multilingual language models called **Sailor**, ranging from 0.5B to 7B parameters, continually pre-trained from the Qwen1.5 models. Based on our extensive benchmarking, the Sailor models demonstrate exceptional performance on South-East Asian languages, taking us one step closer to multilingual LLMs that can serve the diverse needs of the region and beyond.

Today, we're more than excited to share the key technical details behind the Sailor models! 💪

**Key highlights**:
🔍 Data curation: Merging short examples, document-level code-switching, aggressive data cleaning and deduplication.
🤖 Tokenization Robustness: We find that BPE dropout is really effective to deal with prompt variations.
🔍 Optimizing Data Mixture: We propose a new approach to automatically balance capabilities across different languages!
🌟 Recipe in Continual Pre-training: We discover a powerful metric that can help predict how well the Sailor models will perform on the original domain (e.g., English) after continual pre-training.

We are thrilled to share these technical details with the community and invite you to explore the Sailor models. We hope Sailor models take us one step closer to multilingual LLMs in the world! 🌍✨

To learn more, please access our research paper or reach out to our team.
🔗 Paper: Sailor: Open Language Models for South-East Asia (2404.03608)
🧩 Model: sail/sailor-language-models-65e19a749f978976f1959825
💻 Code: https://github.com/sail-sg/sailor-llm

Reacted to osanseviero's post with ❤️👍 9 months ago

Post

Diaries of Open Source. Part 2. Open Source is going brrrrr

🚀The European Space Agency releases MajorTOM, a dataset of earth observation covering half the earth. The dataset has 2.5 trillion pixels! Congrats @aliFrancis and @mikonvergence !
Dataset: Major-TOM/Core-S2L2A
Viewer: Major-TOM/MajorTOM-Core-Viewer

🍞Re-ranking models by MixedBreadAI, with very high quality, Apache 2 license, and easy to use!
Models: https://huggingface.co/models?other=reranker&sort=trending&search=mixedbread-ai
Blog: https://www.mixedbread.ai/blog/mxbai-rerank-v1

🧊StabilityAI and TripoAI release TripoSR, a super-fast MIT-licensed image-to-3D model!
Model: stabilityai/TripoSR
Demo: stabilityai/TripoSR

🤝Together AI and HazyResearch release Based
Models and datasets: hazyresearch/based-65d77fb76f9c813c8b94339c
GH repo: https://github.com/HazyResearch/based

🌊LaVague: an open-source pipeline to turn natural language into browser actions! It can run locally with HuggingFaceH4/zephyr-7b-gemma-v0.1
Read more about it at https://huggingface.co/posts/dhuynh95/717319217106504

🏆Berkeley Function-Calling Leaderboard
Read about it: https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html
Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html

🐬Sailor-Chat: chat models built on top of OpenOrca and @sarahooker CohereForAI Aya project. They can be used for South-East Asia languages such as Indonesian, Thai, Vietnamese, Malay and Lao!
Models: sail/sailor-language-models-65e19a749f978976f1959825
Demo: sail/Sailor-7B-Chat

🤗Arabic-OpenHermes-2.5: OpenHermes dataset translated to Arabic 2A2I/Arabic-OpenHermes-2.5

See the previous part here https://huggingface.co/posts/osanseviero/622788932781684

3 replies

·

Reacted to loubnabnl's post with 🤯❤️🤗 9 months ago

Post

⭐ Today we’re releasing The Stack v2 & StarCoder2: a series of 3B, 7B & 15B code generation models trained on 3.3 to 4.5 trillion tokens of code:

- StarCoder2-15B matches or outperforms CodeLlama 34B, and approaches DeepSeek-33B on multiple benchmarks.
- StarCoder2-3B outperforms StarCoderBase-15B and similar sized models.
- The Stack v2 a 4x larger dataset than the Stack v1, resulting in 900B unique code tokens 🚀
As always, we released everything from models and datasets to curation code. Enjoy!

🔗 StarCoder2 collection: bigcode/starcoder2-65de6da6e87db3383572be1a
🔗 Paper: https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view
🔗 BlogPost: https://huggingface.co/blog/starcoder2
🔗 Code Leaderboard: bigcode/bigcode-models-leaderboard

Reacted to clem's post with ❤️ 10 months ago

Post

With the Google announcement last week, I think we're now officially the only AI startup out there who has commercial collaborations with all the major cloud providers (AWS, GCP, Azure) and hardware providers (Nvidia, AMD, Intel, Qualcomm,...), making our vision of being the independent and agnostic platform for all AI builders truer than ever!

Let's go!

Qian Liu PRO

AI & ML interests

Articles

RegMix: Data Mixture as Regression for Language Model Pre-training

BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks

Efficient Table Pre-training without Real Data: An Introduction to TAPEX

Organizations

SivilTaram's activity