The TokenFormer is a fully attention-based architecture that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, maximizes the flexibility of neural network.(see paper). It contains four models of sizes 150M, 450M, 900M, 1.5B. For each size, it's trained based on gpt-neox code base and uses Pile with 300B tokens. All 4 model sizes are trained on the exact same data, in the exact same order.

TokenFormer-150M

Model Details

Developed by: Haiyang Wang
Model type: TokenFormer-based Language Model
Language: English
Learn more: TokenFormer's GitHub repository for training procedure, config files, and details on how to use. See paper for more evals and implementation details.
Library: GPT-NeoX
License: Apache 2.0
Contact: to ask questions about this model, please email Haiyang Wang.

TokenFormer model	Layers	#QKV Param Tokens	#Output Param Tokens	#FFN Param Tokens	Model Dim	Heads	Batch Size	Learning Rate	Training Iterations
150M	12	768	768	3072	768	12	2M	6.0 x 10^-4	143000
450M	24	1024	1024	4096	1024	16	2M	6.0 x 10^-4	143000
900M	32	1280	1280	5120	1280	16	2M	6.0 x 10^-4	143000
1.5B	40	1536	1536	6144	1536	16	2M	6.0 x 10^-4	143000

Engineering details for the TokenFormer.

Training

Training data

The Pile is a 825GiB general-purpose dataset in English. It was created by EleutherAI specifically for training large language models. It contains texts from 22 diverse sources, roughly broken down into five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub, Enron Emails). See the Pile paper for a breakdown of all data sources, methodology, and a discussion of ethical implications. Consult the datasheet for more detailed documentation about the Pile and its component datasets. The Pile can be downloaded from the official website, or from a community mirror.

Training procedure

We follow the default training strategy of Pythia in gpt-neox, including the dataset processing, hyper-parameter and code base. All models were trained on the exact same data, in the exact same order. Each model saw 299,892,736,000 tokens during training.

All TokenFormer models trained for 143000 steps at a batch size of 2M (2,097,152 tokens).
See GitHub for more details on training procedure.
TokenFormer uses the same tokenizer as GPT-NeoX- 20B.

Evaluations

All TokenFormer models were evaluated using the LM Evaluation Harness. You can run the evaluation with our instruction.
Expand the sections below to see plots of evaluation results for all TokenFormer compared with Opensource Transformer-based LLMs.

Model	#Param	LAMBADA	HellaSwag	PIQA	Arc-E	Arc-C	WinoGrande	Average
Pythia	150M	35.4	30.3	62.3	43.6	23.6	51.3	40.1
TokenFormer	150M	45.0	35.5	64.9	47.3	24.9	50.4	44.7
Pythia	410M	51.4	40.6	66.9	52.1	24.6	53.8	48.2
TokenFormer	450M	57.3	47.5	69.5	56.2	26.7	54.6	52.0
Pythia	1B	56.1	47.2	70.7	57.0	27.1	53.5	51.9
TokenFormer	900M	64.0	55.3	72.4	59.9	30.6	56.4	56.4
GPT-Neo	1.3B	57.2	48.9	71.1	56.2	25.9	54.9	52.4
OPT	1.3B	58.0	53.7	72.4	56.7	29.6	59.5	55.0
Pythia	1.3B	61.7	52.1	71.0	60.5	28.5	57.2	55.2
GPT-Neo	2.7B	62.2	55.8	71.1	61.1	30.2	57.6	56.5
OPT	2.7B	63.6	60.6	74.8	60.8	31.3	61.0	58.7
Pythia	2.8B	64.7	59.3	74.0	64.1	32.9	59.7	59.1
TokenFormer	1.5B	64.7	60.0	74.8	64.8	32.0	59.7	59.3

Zero-shot evaluation of Language Modeling.