benchmark - a zzfive Collection

zzfive 's Collections

Reinforcement learning

medical

3d

image

LLMs

video

agent

cv

audio

robot

benchmark

updated 6 days ago

GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Paper • 2411.18499 • Published 29 days ago • 18
VLSBench: Unveiling Visual Leakage in Multimodal Safety

Paper • 2411.19939 • Published 27 days ago • 9
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Paper • 2412.02611 • Published 23 days ago • 22
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Paper • 2412.03205 • Published 22 days ago • 15
ProcessBench: Identifying Process Errors in Mathematical Reasoning

Paper • 2412.06559 • Published 17 days ago • 68
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Paper • 2412.07626 • Published 16 days ago • 20
VisionArena: 230K Real World User-VLM Conversations with Preference Labels

Paper • 2412.08687 • Published 15 days ago • 13
SCBench: A KV Cache-Centric Analysis of Long-Context Methods

Paper • 2412.10319 • Published 13 days ago • 8
Are Your LLMs Capable of Stable Reasoning?

Paper • 2412.13147 • Published 9 days ago • 88
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

Paper • 2412.12606 • Published 10 days ago • 41
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published 8 days ago • 43
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment

Paper • 2412.13746 • Published 8 days ago • 9