CCMat
's Collections
toread
updated
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced
Training
Paper
•
2311.17049
•
Published
•
1
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts
Language Model
Paper
•
2405.04434
•
Published
•
14
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Paper
•
2303.17376
•
Published
Sigmoid Loss for Language Image Pre-Training
Paper
•
2303.15343
•
Published
•
4
Better & Faster Large Language Models via Multi-token Prediction
Paper
•
2404.19737
•
Published
•
73
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
•
2401.10774
•
Published
•
54
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
Paper
•
2404.19427
•
Published
•
71
CogVLM: Visual Expert for Pretrained Language Models
Paper
•
2311.03079
•
Published
•
23
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
29
InternLM-XComposer2: Mastering Free-form Text-Image Composition and
Comprehension in Vision-Language Large Model
Paper
•
2401.16420
•
Published
•
55
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image
Generation
Paper
•
2404.02733
•
Published
•
20
Demonstration-Regularized RL
Paper
•
2310.17303
•
Published
Vision Transformers Need Registers
Paper
•
2309.16588
•
Published
•
77
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video
Generation
Paper
•
2405.01434
•
Published
•
52
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper
•
2404.19752
•
Published
•
22
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
•
2405.01535
•
Published
•
118
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper
•
2405.00732
•
Published
•
118
RLHF Workflow: From Reward Modeling to Online RLHF
Paper
•
2405.07863
•
Published
•
67
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
100
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with
Fine-Grained Chinese Understanding
Paper
•
2405.08748
•
Published
•
19
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
•
2405.10300
•
Published
•
26
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
•
2405.09798
•
Published
•
26
CAT3D: Create Anything in 3D with Multi-View Diffusion Models
Paper
•
2405.10314
•
Published
•
44
LoRA Learns Less and Forgets Less
Paper
•
2405.09673
•
Published
•
87
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
•
2405.09818
•
Published
•
126
Layer-Condensed KV Cache for Efficient Inference of Large Language
Models
Paper
•
2405.10637
•
Published
•
19
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper
•
2405.11143
•
Published
•
34
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper
•
2405.12130
•
Published
•
46
FIFO-Diffusion: Generating Infinite Videos from Text without Training
Paper
•
2405.11473
•
Published
•
53
Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and
Attribute Control
Paper
•
2405.12970
•
Published
•
22
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper
•
2405.12981
•
Published
•
28
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
•
2405.12399
•
Published
•
27
Your Transformer is Secretly Linear
Paper
•
2405.12250
•
Published
•
150
ReVideo: Remake a Video with Motion and Content Control
Paper
•
2405.13865
•
Published
•
23
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
31
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
86
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
43
Improving the Training of Rectified Flows
Paper
•
2405.20320
•
Published
•
1
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Paper
•
2403.03206
•
Published
•
59
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model
Paper
•
2406.04333
•
Published
•
36
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
•
2406.04325
•
Published
•
72
Step-aware Preference Optimization: Aligning Preference with Denoising
Performance at Each Step
Paper
•
2406.04314
•
Published
•
27
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper
•
2406.02657
•
Published
•
37
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and
Resolution
Paper
•
2307.06304
•
Published
•
27
OpenELM: An Efficient Language Model Family with Open-source Training
and Inference Framework
Paper
•
2404.14619
•
Published
•
126
Multi-Head Mixture-of-Experts
Paper
•
2404.15045
•
Published
•
59
Pegasus-v1 Technical Report
Paper
•
2404.14687
•
Published
•
30
Towards Modular LLMs by Building and Reusing a Library of LoRAs
Paper
•
2405.11157
•
Published
•
26
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
•
2405.15574
•
Published
•
53
Paper
•
2405.18407
•
Published
•
46
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
•
2405.21060
•
Published
•
63
CRAG -- Comprehensive RAG Benchmark
Paper
•
2406.04744
•
Published
•
42
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Paper
•
2406.08552
•
Published
•
23
Paper
•
2406.09414
•
Published
•
92
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
50
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
Inversion and High Quality Image Editing
Paper
•
2406.10601
•
Published
•
65
EvTexture: Event-driven Texture Enhancement for Video Super-Resolution
Paper
•
2406.13457
•
Published
•
16
Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective
Distillation and Unlabeled Data Augmentation
Paper
•
2406.12849
•
Published
•
49
Adam-mini: Use Fewer Learning Rates To Gain More
Paper
•
2406.16793
•
Published
•
67
DreamBench++: A Human-Aligned Benchmark for Personalized Image
Generation
Paper
•
2406.16855
•
Published
•
54
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
Paper
•
2407.01392
•
Published
•
39
No Training, No Problem: Rethinking Classifier-Free Guidance for
Diffusion Models
Paper
•
2407.02687
•
Published
•
22
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
•
2407.03320
•
Published
•
92
Video Diffusion Alignment via Reward Gradients
Paper
•
2407.08737
•
Published
•
47
Paper
•
2407.10671
•
Published
•
157
Theia: Distilling Diverse Vision Foundation Models for Robot Learning
Paper
•
2407.20179
•
Published
•
46
Gemma 2: Improving Open Language Models at a Practical Size
Paper
•
2408.00118
•
Published
•
75
The Llama 3 Herd of Models
Paper
•
2407.21783
•
Published
•
107
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and
Illumination Disentanglement
Paper
•
2408.00653
•
Published
•
28
SAM 2: Segment Anything in Images and Videos
Paper
•
2408.00714
•
Published
•
108
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
•
2408.01800
•
Published
•
78
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation
with Multimodal Generative Pretraining
Paper
•
2408.02657
•
Published
•
32
IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning
using Instruct Prompts
Paper
•
2408.03209
•
Published
•
21
MMIU: Multimodal Multi-image Understanding for Evaluating Large
Vision-Language Models
Paper
•
2408.02718
•
Published
•
60
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards
General Medical AI
Paper
•
2408.03361
•
Published
•
85
An Object is Worth 64x64 Pixels: Generating 3D Object via Image
Diffusion
Paper
•
2408.03178
•
Published
•
36
LLaVA-OneVision: Easy Visual Task Transfer
Paper
•
2408.03326
•
Published
•
59
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from
User's Casual Sketches
Paper
•
2408.04567
•
Published
•
24
Transformer Explainer: Interactive Learning of Text-Generative Models
Paper
•
2408.04619
•
Published
•
155
ControlNeXt: Powerful and Efficient Control for Image and Video
Generation
Paper
•
2408.06070
•
Published
•
52
Qwen2-Audio Technical Report
Paper
•
2407.10759
•
Published
•
54
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill
and Extreme KV-Cache Compression
Paper
•
2407.12077
•
Published
•
54
Compact Language Models via Pruning and Knowledge Distillation
Paper
•
2407.14679
•
Published
•
38
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
•
2407.15841
•
Published
•
39
KAN or MLP: A Fairer Comparison
Paper
•
2407.16674
•
Published
•
42
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
Paper
•
2407.16655
•
Published
•
28
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person
Paper
•
2407.16224
•
Published
•
25
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh
Tokenization
Paper
•
2408.02555
•
Published
•
28
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper
•
2407.19985
•
Published
•
35
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
Paper
•
2407.16982
•
Published
•
40
VILA^2: VILA Augmented VILA
Paper
•
2407.17453
•
Published
•
39
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Paper
•
2408.06072
•
Published
•
35
Paper
•
2408.07009
•
Published
•
61
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
97
MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction
Model
Paper
•
2408.10198
•
Published
•
32
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
•
2408.11039
•
Published
•
56
MegaFusion: Extend Diffusion Models towards Higher-resolution Image
Generation without Further Tuning
Paper
•
2408.11001
•
Published
•
11
Sapiens: Foundation for Human Vision Models
Paper
•
2408.12569
•
Published
•
89
DreamCinema: Cinematic Transfer with Free Camera and 3D Character
Paper
•
2408.12601
•
Published
•
28
Scalable Autoregressive Image Generation with Mamba
Paper
•
2408.12245
•
Published
•
25
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
121
LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation
Paper
•
2408.13252
•
Published
•
23
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its
Teacher
Paper
•
2408.14176
•
Published
•
60
Foundation Models for Music: A Survey
Paper
•
2408.14340
•
Published
•
43
Diffusion Models Are Real-Time Game Engines
Paper
•
2408.14837
•
Published
•
121
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders
Paper
•
2408.15998
•
Published
•
83
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
56
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
Language Modeling
Paper
•
2408.16532
•
Published
•
47
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion
Model
Paper
•
2408.16767
•
Published
•
29
CSGO: Content-Style Composition in Text-to-Image Generation
Paper
•
2408.16766
•
Published
•
17
CoRe: Context-Regularized Text Embedding Learning for Text-to-Image
Personalization
Paper
•
2408.15914
•
Published
•
22
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper
•
2409.02097
•
Published
•
32
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion
Dependency
Paper
•
2409.02634
•
Published
•
90
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free
Real Image Editing
Paper
•
2409.01322
•
Published
•
94
Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with
Image-Based Surface Representation
Paper
•
2409.03718
•
Published
•
25
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
•
2404.12387
•
Published
•
38
Dynamic Typography: Bringing Words to Life
Paper
•
2404.11614
•
Published
•
44
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
Phone
Paper
•
2404.14219
•
Published
•
253
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper
•
2404.16710
•
Published
•
74
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
•
2404.16821
•
Published
•
53
Iterative Reasoning Preference Optimization
Paper
•
2404.19733
•
Published
•
47
KAN: Kolmogorov-Arnold Networks
Paper
•
2404.19756
•
Published
•
108
OmniGen: Unified Image Generation
Paper
•
2409.11340
•
Published
•
108
IFAdapter: Instance Feature Control for Grounded Text-to-Image
Generation
Paper
•
2409.08240
•
Published
•
18
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video
Diffusion Models
Paper
•
2409.07452
•
Published
•
19
Towards a Unified View of Preference Learning for Large Language Models:
A Survey
Paper
•
2409.02795
•
Published
•
72
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper
•
2409.11355
•
Published
•
28
Phidias: A Generative Model for Creating 3D Content from Text, Image,
and 3D Conditions with Reference-Augmented Diffusion
Paper
•
2409.11406
•
Published
•
25
Qwen2.5-Coder Technical Report
Paper
•
2409.12186
•
Published
•
136
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
•
2409.12191
•
Published
•
74
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Paper
•
2312.14125
•
Published
•
44
Training Language Models to Self-Correct via Reinforcement Learning
Paper
•
2409.12917
•
Published
•
135
Imagine yourself: Tuning-Free Personalized Image Generation
Paper
•
2409.13346
•
Published
•
67
Colorful Diffuse Intrinsic Image Decomposition in the Wild
Paper
•
2409.13690
•
Published
•
12
Emu3: Next-Token Prediction is All You Need
Paper
•
2409.18869
•
Published
•
91
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Paper
•
2409.20566
•
Published
•
52
One Token to Seg Them All: Language Instructed Reasoning Segmentation in
Videos
Paper
•
2409.19603
•
Published
•
18
EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis
Paper
•
2410.01804
•
Published
•
5
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
•
2410.02740
•
Published
•
52
Loong: Generating Minute-level Long Videos with Autoregressive Language
Models
Paper
•
2410.02757
•
Published
•
36
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning
Paper
•
2410.01044
•
Published
•
34
PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation
Paper
•
2410.01680
•
Published
•
32
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Paper
•
2410.02073
•
Published
•
40
Baichuan-Omni Technical Report
Paper
•
2410.08565
•
Published
•
84
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image
Generation
Paper
•
2410.08159
•
Published
•
24
Animate-X: Universal Character Image Animation with Enhanced Motion
Representation
Paper
•
2410.10306
•
Published
•
53
Efficient Diffusion Models: A Comprehensive Survey from Principles to
Practices
Paper
•
2410.11795
•
Published
•
16
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a
Training-Free Memory Tree
Paper
•
2410.16268
•
Published
•
65
SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes
Paper
•
2410.17249
•
Published
•
39
Movie Gen: A Cast of Media Foundation Models
Paper
•
2410.13720
•
Published
•
89
Fluid: Scaling Autoregressive Text-to-image Generative Models with
Continuous Tokens
Paper
•
2410.13863
•
Published
•
35
FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without
Learned Priors
Paper
•
2410.16271
•
Published
•
80
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
•
2410.13861
•
Published
•
53
Unbounded: A Generative Infinite Game of Character Life Simulation
Paper
•
2410.18975
•
Published
•
34
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for
Contrastive Loss
Paper
•
2410.17243
•
Published
•
88
Representation Alignment for Generation: Training Diffusion Transformers
Is Easier Than You Think
Paper
•
2410.06940
•
Published
•
6
Addition is All You Need for Energy-efficient Language Models
Paper
•
2410.00907
•
Published
•
144
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
•
2410.13848
•
Published
•
29
Semantic Image Inversion and Editing using Rectified Stochastic
Differential Equations
Paper
•
2410.10792
•
Published
•
26
CLEAR: Character Unlearning in Textual and Visual Modalities
Paper
•
2410.18057
•
Published
•
200
In-Context LoRA for Diffusion Transformers
Paper
•
2410.23775
•
Published
•
10