DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models Paper • 2410.07331 • Published Oct 9 • 4
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper • 2404.07972 • Published Apr 11 • 46
S3Eval Collection S3Eval: A Synthetic, Scalable and Systematic Evaluation Suite for Large Language Models • 0 items • Updated Jan 19