Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution Paper • 2310.16834 • Published Oct 25, 2023 • 4
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Paper • 2406.07476 • Published Jun 11 • 32
From screenshots to HTML Collection WebSight is a dataset of 823,000 HTML/CSS codes representing synthetically generated English websites, each accompanied by a corresponding screenshot. • 4 items • Updated Apr 15 • 18
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper • 2404.07972 • Published Apr 11 • 46
MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems Paper • 2404.09486 • Published Apr 15 • 1
GOAT-Bench: Safety Insights to Large Multimodal Models through Meme-Based Social Abuse Paper • 2401.01523 • Published Jan 3 • 1
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval Paper • 2302.02908 • Published Feb 6, 2023 • 1