-
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
Paper • 2404.15420 • Published • 7 -
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
Paper • 2404.14619 • Published • 124 -
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Paper • 2404.14219 • Published • 251 -
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
Paper • 2404.14047 • Published • 44
Collections
Discover the best community collections!
Collections including paper arxiv:2403.09919
-
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer
Paper • 2403.10301 • Published • 51 -
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper • 2403.09919 • Published • 20 -
RAFT: Adapting Language Model to Domain Specific RAG
Paper • 2403.10131 • Published • 67 -
Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations
Paper • 2403.09704 • Published • 31
-
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper • 2403.09919 • Published • 20 -
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification
Paper • 2305.09781 • Published • 4 -
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
Paper • 2408.04093 • Published • 4
-
Measuring the Effects of Data Parallelism on Neural Network Training
Paper • 1811.03600 • Published • 2 -
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Paper • 1804.04235 • Published • 2 -
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Paper • 1905.11946 • Published • 3 -
Yi: Open Foundation Models by 01.AI
Paper • 2403.04652 • Published • 62
-
AtP*: An efficient and scalable method for localizing LLM behaviour to components
Paper • 2403.00745 • Published • 11 -
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 602 -
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
Paper • 2402.16840 • Published • 23 -
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Paper • 2402.13753 • Published • 111
-
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
Paper • 2310.04406 • Published • 8 -
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 99 -
ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization
Paper • 2402.09320 • Published • 6 -
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Paper • 2402.03620 • Published • 109
-
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper • 2402.11131 • Published • 41 -
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting
Paper • 2402.13720 • Published • 5 -
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper • 2403.09919 • Published • 20 -
On Speculative Decoding for Multimodal Large Language Models
Paper • 2404.08856 • Published • 13
-
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 78 -
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper • 2305.13245 • Published • 5 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 19 -
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 5
-
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
Paper • 2401.12522 • Published • 11 -
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Paper • 2402.05099 • Published • 18 -
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
Paper • 2402.04291 • Published • 48 -
Shortened LLaMA: A Simple Depth Pruning for Large Language Models
Paper • 2402.02834 • Published • 14