Open LLM Leaderboard 2
Track, rank and evaluate open LLMs and chatbots
Gathering benchmark spaces on the hub (beyond the Open LLM Leaderboard)
Track, rank and evaluate open LLMs and chatbots
Note π The π€ Open LLM Leaderboard aims to track, rank and evaluate open LLMs and chatbots. π€ Submit a model for automated evaluation on the π€ GPU cluster on the βSubmitβ page!
Note Massive Text Embedding Benchmark (MTEB) Leaderboard.
Note π This leaderboard is based on the following three benchmarks: Chatbot Arena - a crowdsourced, randomized battle platform. We use 70K+ user votes to compute Elo ratings. MT-Bench - a set of challenging multi-turn questions. We use GPT-4 to grade the model responses. MMLU (5-shot) - a test to measure a modelβs multitask accuracy on 57 tasks.
Note The π€ LLM-Perf Leaderboard ποΈ aims to benchmark the performance (latency, throughput & memory) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benchmark and Optimum flavors. Anyone from the community can request a model or a hardware/backend/optimization configuration for automated benchmarking:
Note Compare performance of base multilingual code generation models on HumanEval benchmark and MultiPL-E. We also measure throughput and provide information about the models. We only compare open pre-trained multilingual code models, that people can start from as base models for their trainings.
Note The π€ Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. We report the Average WER (β¬οΈ) and RTF (β¬οΈ) - lower the better. Models are ranked based on their Average WER, from lowest to highest
Note The MT-Bench Browser (see Chatbot arena)