Cognition - a Ksgk-fy Collection

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24 • 39

Note General model is not great at specializing tasks. Narrow-domain fine-tuned checkpoint becomes better at specific tasks, such local improvement can feedback onto the full training dataset, achieving self-augmentation based improvement. This is a interesting idea.

Octopus v4: Graph of language models

Paper • 2404.19296 • Published Apr 30 • 117

Note Use small language model to search the graph and route to the doman expert.

Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26 • 47

Note Automatic Flow Engineering done by 3B fine-tuned LLM, grounded on selective set of API-based functions. Planning model perform task decomposition, but do not do specific calls. Effectively doing flow (prompt) engineering here. Topology in plans are lacking and static plan-ahead approach is less robust (although good according to their curated 1k test dataset)

Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

Paper • 2408.15518 • Published Aug 28 • 42

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4 • 54

Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29 • 92

VITA: Towards Open-Source Interactive Omni Multimodal LLM

Paper • 2408.05211 • Published Aug 9 • 46

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3 • 78

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17 • 72

WaveletGPT: Wavelets Meet Large Language Models

Paper • 2409.12924 • Published Sep 4 • 1

Note Treating intermediate embedding sequences as a bunch of signals and apply 1D convolution on temporal axis, similar to ConvMixer's manipulation in some sense, experimentation conducted on pre-training transformer. Interesting result is reported in the paper. Unfortunately no 'wave' is actually applied, no 'periodic' information is captured.

ClaimVer: Explainable Claim-Level Verification and Evidence Attribution of Text Through Knowledge Graphs

Paper • 2403.09724 • Published Mar 12 • 1

Learning Iterative Reasoning through Energy Diffusion

Paper • 2406.11179 • Published Jun 17 • 1

Note Newton's introduction of gravity illustrates how understanding derivatives—knowing how things move rather than just where they are—enhances reasoning about the world. Large language models (LLMs), while excelling at compressing data distributions, struggle with reasoning. Reasoning involves grasping the 'abstract structure' of data. Therefore, by modeling derivatives of data distributions, could we improve LLMs' reasoning capabilities?

Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding

Paper • 2106.02795 • Published Jun 5, 2021 • 1

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25 • 103

Can LLMs Reason in the Wild with Programs?

Paper • 2406.13764 • Published Jun 19 • 1

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

Paper • 2409.17481 • Published Sep 26 • 46

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Paper • 2409.17115 • Published Sep 25 • 59

Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization

Paper • 2403.03419 • Published Mar 6 • 1

Emu3: Next-Token Prediction is All You Need

Paper • 2409.18869 • Published Sep 27 • 91

Note Tokenization unifies perception and generation, end-to-end training with discrete multi-modality signal enables both.

Can Models Learn Skill Composition from Examples?

Paper • 2409.19808 • Published Sep 29 • 8

Not All LLM Reasoners Are Created Equal

Paper • 2410.01748 • Published Oct 2 • 27

RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

Paper • 2410.01044 • Published Oct 1 • 34

Intelligence at the Edge of Chaos

Paper • 2410.02536 • Published Oct 3 • 6

Note Intelligence is very likely the ability to model higher order derivatives given lower order observation.

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Paper • 2410.02155 • Published Oct 3 • 2

Note MLLM usually project a continuous Image embedding onto hidden space of LLM. Vector quantization (VQ) convert an image into discrete codes representing each of its patches, these tokens could be ported into LLM in a more similar fashion as text tokens -- new embedding vectors. Therefore a natural extension is just to re-use the BPE approach onto these image tokens. Which is precisely what happens in this work. However, I

Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation

Paper • 2410.02725 • Published Oct 3 • 1

Selective Attention Improves Transformer

Paper • 2410.02703 • Published Oct 3 • 23

Note "If two computer programs perform the same task, the shorter one is generally better." This principle, known as Occam's Razor, is a critical guideline for scientific discovery. Our best program today is the Transformer. Can we make it more efficient? Selective attention improves the Transformer by allowing each token to decide whether previous context is still relevant for future tokens.

FAN: Fourier Analysis Networks

Paper • 2410.02675 • Published Oct 3 • 24

EmbedLLM: Learning Compact Representations of Large Language Models

Paper • 2410.02223 • Published Oct 3 • 3

Model Comparisons: XNet Outperforms KAN

Paper • 2410.02033 • Published Oct 2 • 1

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL

Paper • 2410.01930 • Published Oct 2 • 1

Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1 • 144

ε-VAE: Denoising as Visual Decoding

Paper • 2410.04081 • Published Oct 5 • 7

Note I find it strange to view encoder which produces embedding vector as a type of tokenization --- then transformer effectively has two tokenization process... a discrete one and then a continuous one ?

Emergent properties with repeated examples

Paper • 2410.07041 • Published Oct 9 • 8

Note Compression requires redundancy, otherwise it's just memorization

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

Paper • 2410.06981 • Published Oct 9 • 2

Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines

Paper • 2410.07896 • Published Oct 10 • 2

Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding

Paper • 2408.08252 • Published Aug 15 • 1

From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions

Paper • 2410.08197 • Published Oct 10 • 1

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Paper • 2410.06940 • Published Oct 9 • 6

LeanAgent: Lifelong Learning for Formal Theorem Proving

Paper • 2410.06209 • Published Oct 8 • 1

SimpleStrat: Diversifying Language Model Generation with Stratification

Paper • 2410.09038 • Published Oct 11 • 4

Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation

Paper • 2410.08821 • Published Oct 11 • 1

Discrete Flow Matching

Paper • 2407.15595 • Published Jul 22 • 11

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Paper • 2410.11081 • Published Oct 14 • 19

EVOLvE: Evaluating and Optimizing LLMs For Exploration

Paper • 2410.06238 • Published Oct 8 • 1

Neural Metamorphosis

Paper • 2410.11878 • Published Oct 10 • 8

Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming

Paper • 2410.12112 • Published Oct 15 • 1

Steering Large Language Models between Code Execution and Textual Reasoning

Paper • 2410.03524 • Published Oct 4 • 1

A Scalable Communication Protocol for Networks of Large Language Models

Paper • 2410.11905 • Published Oct 14 • 1

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

Paper • 2410.12491 • Published Oct 16 • 4

Revealing the Barriers of Language Agents in Planning

Paper • 2410.12409 • Published Oct 16 • 23

Learning to Compress: Local Rank and Information Compression in Deep Neural Networks

Paper • 2410.07687 • Published Oct 10 • 1

Grandmaster-Level Chess Without Search

Paper • 2402.04494 • Published Feb 7 • 67

Instruction-Driven Game Engine: A Poker Case Study

Paper • 2410.13441 • Published Oct 17 • 1

Transformer Guided Coevolution: Improved Team Formation in Multiagent Adversarial Games

Paper • 2410.13769 • Published Oct 17 • 1

Learning Graph Quantized Tokenizers for Transformers

Paper • 2410.13798 • Published Oct 17 • 1

Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design

Paper • 2410.13643 • Published Oct 17

Learning to Route with Confidence Tokens

Paper • 2410.13284 • Published Oct 17 • 1

An Evolved Universal Transformer Memory

Paper • 2410.13166 • Published Oct 17 • 1

Artificial Kuramoto Oscillatory Neurons

Paper • 2410.13821 • Published Oct 17 • 1

TopoLM: brain-like spatio-functional organization in a topographic language model

Paper • 2410.11516 • Published Oct 15 • 1

Autoregressive Image Generation without Vector Quantization

Paper • 2406.11838 • Published Jun 17 • 2

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Paper • 2404.16710 • Published Apr 25 • 74

DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Paper • 2410.12189 • Published Oct 16 • 1

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Paper • 2410.13276 • Published Oct 17 • 25

Do LLMs "know" internally when they follow instructions?

Paper • 2410.14516 • Published Oct 18 • 1

Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models

Paper • 2410.10846 • Published Oct 1 • 2

One-Step Diffusion Distillation through Score Implicit Matching

Paper • 2410.16794 • Published Oct 22 • 2

Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

Paper • 2405.18400 • Published May 28 • 1

Lightweight Neural App Control

Paper • 2410.17883 • Published Oct 23 • 9

Literature Meets Data: A Synergistic Approach to Hypothesis Generation

Paper • 2410.17309 • Published Oct 22 • 1

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

Paper • 2410.18076 • Published Oct 23 • 4

Note Encodes interaction trajectories into "skill vectors" that act like abstract concepts: a skill decoder (low-level policy) translates them into specific actions based on the current state—similar to how our concepts become concrete actions in different situations. By relabeling experiences with these skills, they train a high-level policy to select optimal skills that maximize rewards. This hierarchical approach hints at the possibility for AI systems to formulate and think in their own-curated a