Cognition
Perception and abstraction. Each modality is tokenized and embedded into vectors for model to comprehend.
Paper • 2407.17453 • Published • 38Note General model is not great at specializing tasks. Narrow-domain fine-tuned checkpoint becomes better at specific tasks, such local improvement can feedback onto the full training dataset, achieving self-augmentation based improvement. This is a interesting idea.
Octopus v4: Graph of language models
Paper • 2404.19296 • Published • 118Note Use small language model to search the graph and route to the doman expert.
Octo-planner: On-device Language Model for Planner-Action Agents
Paper • 2406.18082 • Published • 47Note Automatic Flow Engineering done by 3B fine-tuned LLM, grounded on selective set of API-based functions. Planning model perform task decomposition, but do not do specific calls. Effectively doing flow (prompt) engineering here. Topology in plans are lacking and static plan-ahead approach is less robust (although good according to their curated 1k test dataset)
Recursive Introspection: Teaching Language Model Agents How to Self-Improve
Paper • 2407.18219 • Published • 3Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions
Paper • 2409.08596 • Published • 1What Makes a Maze Look Like a Maze?
Paper • 2409.08202 • Published • 1Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
Paper • 2408.15518 • Published • 41One missing piece in Vision and Language: A Survey on Comics Understanding
Paper • 2409.09502 • Published • 23Iterative Graph Alignment
Paper • 2408.16667 • Published • 2Do Pre-trained Vision-Language Models Encode Object States?
Paper • 2409.10488 • Published • 1Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper • 2408.16725 • Published • 49MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 123DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper • 2403.05525 • Published • 39VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 30MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Paper • 2404.03413 • Published • 25LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Paper • 2409.02889 • Published • 53Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 92VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper • 2408.05211 • Published • 46MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper • 2408.01800 • Published • 74NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 47