Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2412.09871

Papers - Encoder - Byte-Pair Encoding

Neural Machine Translation of Rare Words with Subword Units

Paper • 1508.07909 • Published Aug 31, 2015 • 4
A Formal Perspective on Byte-Pair Encoding

Paper • 2306.16837 • Published Jun 29, 2023 • 2
Byte-Pair Encoding for Text-to-SQL Generation

Paper • 1910.08962 • Published Oct 20, 2019 • 2
Pattern Discovery in Time Series with Byte Pair Encoding

Paper • 2106.00614 • Published May 30, 2021 • 2

Papers - Chinchilla

Training Compute-Optimal Large Language Models

Paper • 2203.15556 • Published Mar 29, 2022 • 10
Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published 14 days ago • 76

Papers - Encoders

Functional Interpolation for Relative Positions Improves Long Context Transformers

Paper • 2310.04418 • Published Oct 6, 2023 • 4
SPBERT: An Efficient Pre-training BERT on SPARQL Queries for Question Answering over Knowledge Graphs

Paper • 2106.09997 • Published Jun 18, 2021 • 2
Neural Machine Translation of Rare Words with Subword Units

Paper • 1508.07909 • Published Aug 31, 2015 • 4
A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Paper • 2403.14438 • Published Mar 21 • 2

Papers - Training

SELF: Language-Driven Self-Evolution for Large Language Model

Paper • 2310.00533 • Published Oct 1, 2023 • 2
GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length

Paper • 2310.00576 • Published Oct 1, 2023 • 2
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Paper • 2305.13169 • Published May 22, 2023 • 3
Transformers Can Achieve Length Generalization But Not Robustly

Paper • 2402.09371 • Published Feb 14 • 13

Papers - Text - Decoders

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper • 1810.04805 • Published Oct 11, 2018 • 16
Transformers Can Achieve Length Generalization But Not Robustly

Paper • 2402.09371 • Published Feb 14 • 13
A Thorough Examination of Decoding Methods in the Era of LLMs

Paper • 2402.06925 • Published Feb 10 • 1
Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published 14 days ago • 76

Papers - Attention

Linear Transformers with Learnable Kernel Functions are Better In-Context Models

Paper • 2402.10644 • Published Feb 16 • 79
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Paper • 2305.13245 • Published May 22, 2023 • 5
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Paper • 2402.15220 • Published Feb 23 • 19
Sequence Parallelism: Long Sequence Training from System Perspective

Paper • 2105.13120 • Published May 26, 2021 • 5

new architecture

Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM

Paper • 2401.02994 • Published Jan 4 • 49
MambaByte: Token-free Selective State Space Model

Paper • 2401.13660 • Published Jan 24 • 52
Repeat After Me: Transformers are Better than State Space Models at Copying

Paper • 2402.01032 • Published Feb 1 • 22
BlackMamba: Mixture of Experts for State-Space Models

Paper • 2402.01771 • Published Feb 1 • 23

interesting stuff

Chain-of-Verification Reduces Hallucination in Large Language Models

Paper • 2309.11495 • Published Sep 20, 2023 • 37
Adapting Large Language Models via Reading Comprehension

Paper • 2309.09530 • Published Sep 18, 2023 • 77
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

Paper • 2309.09400 • Published Sep 17, 2023 • 84
Language Modeling Is Compression

Paper • 2309.10668 • Published Sep 19, 2023 • 83

Previous
1
...
6
7
8
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs