I am not sure if anyone has ever demonstrated a 8-float embeddings, Two four-float attention heads, (approximately 9,526 parameters) one-megabyte GPT Model that produced coherent text (The Gettysburg Address) with correct punctuation, in response to a one-word prompt.
This result is reproducible. I have accomplished this in practially all training runs with 'n_embd': 8, 'n_layer': 1, 'n_head': 2, 'n_inner': 128, (with some undisclosed modifications of the model.py)
The model is believed to be capable of correctly reciting the entire Gettysburg Address. For time purposes, the training was limited to the begining of that famous speech.
This last training run was repeated to generate a pytorch parameter checkpoint on disk, for parameter-measurement purposes.
model_checkpoint_epoch_30000_Nano_Gettysburg_GPT2_v1.5_loss_DisplayPerBatch.py_2024-11-26_00-48-36.pth
Size on disk: 1.08 MB (1,138,688 bytes)
Google Gemini Estimates that this nano LLM has about 9,526 Parameters (I have a different estimate based on my undisclosed modifications)
Essential Hyperparameters (plus undisclosed modifications): 'n_embd': 8, 'n_layer': 1, 'n_head': 2, 'n_inner': 128, ########################## OUTPUT PRINTED TO CMD CONSOLE IN WINDOWS 10 LAPTOP ############################## Epoch 29901/30000, Loss: 0.5203 Current learning rate: 0.000245 ... [Last] Batch Losses: 0.4611 HyperParamters = {'vocab_size': [withheld], 'special_tokens': ['<[withheld]'], 'n_embd': 8, 'n_layer': 1, 'n_head': 2, 'n_inner': 128, 'max_sequence_len': [withehld], 'epochs': 30000, 'learning_rate': 0.001, 'batch_size': [withheld], 'dropout': 0.2}
--- Inference Examples --- at script line 507
Example 1: Recite the Gettysburg Address at script line 511
Prompt: Four Total number of words and punctuation tokenized: 1
Response: four score and seven years ago our fathers brought forth , on this continent , a new nation , conceived in liberty , and dedicated to the proposition that all men are created equal . now we are engaged in a great civil war , testing whether that nation , or any nation so dedicated , or any nation so dedicated , can long endure . we have come to dedicate a new . we are met on a great battle - field of that sense , and so to dedicate a portion of that field , as a final resting - place for us that cause for us that those who here gave their lives , in final task remaining before us - that sense . that sense . it is altogether will us - that sense . but that sense . it is altogether for us - that that sense . it can never forget what they gave the world hatter bewildered badger baked blueberry strawberry biscuits popcorn fought of devotion - that that sense . it died that - that sense . the from that sense . the us - that sense . but that that - that that from . the from that cause for us to that - that that - that nation , for us to that that that that that that that - that - that that cause for us to that nation , and that - that - that that we of devotion - that that that that that that - that this ground . the us to that that that that this nation , to that that nation , to that that that that this war of the us to that ground . their lives , under god , under god , and that that nation , for - that that that that that we gave their lives , by the us the us to that this ground . apple score , by that this nation , under god , under god
1562 Words in the Response
#####################################################
Google Gemini estimates the parameters within this model as follows
Let's break down the parameter count for your Nano GPT-2 model based on its hyperparameters:
- Token Embeddings:
[vocab size withheld]
Embedding Dimension (n_embd): 8
Parameters: 174 * 8 = 1392
- Positional Embeddings:
Max Sequence Length: [withheld]
Embedding Dimension (n_embd): 8
Parameters: 512 * 8 = 4096
- Transformer Layers (n_layer = 1):
For each layer, we have the following components:
a) Self-Attention:
Query (Q), Key (K), Value (V) Matrices: Each has dimensions (n_embd, n_embd / n_head) = (8, 8/2) = (8,4) so each Q, K and V matrix has (8x4 =32 parameters) and we have n_head * 3 matrices per layer so we have 2 * 3 = 6 matrices of size (8x4) or 6 * 32 = 192 parameters
Output Projection: (n_embd, n_embd) = (8, 8) = 64 parameters
Total Self-Attention Parameters per layer: 192 + 64 = 256 parameters
b) Feed-Forward Network:
Layer 1: (n_embd, n_inner) = (8, 128) = 1024 parameters
Layer 2: (n_inner, n_embd) = (128, 8) = 1024 parameters
Bias vectors = n_inner + n_embd = 128 + 8 = 136 Total Feed-Forward Parameters per layer: 1024 + 1024+ 136 = 2184 parameters
c) Layer Normalization:
Two per layer - one in each section and we assume they have affine=True
2 * n_embd (scale) + 2 * n_embd (bias) = 4 * n_embd = 4 * 8 = 32 parameters Total Layer Norm Parameters per layer: 32 parameters
Total Parameters per Transformer Layer: 256 + 2184 + 32 = 2472 parameters
- Output Layer (Language Model Head):
Final Linear Layer: (n_embd, vocab_size) = (8, [withheld]) = [withehld] parameters
Bias vector: vocab_size = 174 parameters Total Output Layer Parameters: []= 1566 parameters
- Total Parameters in the model:
Token Embeddings + Positional Embeddings + (n_layer * Layer Parameters) + Output Layer parameters
1392 + 4096 + [] = 9526 parameters
Therefore, your Nano GPT-2 model has approximately 9,526 parameters.