README.md · wordgrammer/Plato

A bit about me.

Growing up, I have cared about computers, but I was fascinated by language. In college I studied philosophy and math. I cared about the structure of reasoning, the limits of knowledge, and the nature of language.

ChatGPT took the world by storm my senior year. Before 2022, I thought philosophy was the best subject for studying the nature of language and reason. But philosophy hasn't seen much progress over the past 40 years. And transformer-based architectures are taking the world by storm.

The world of tech is trying to teach computers how to reason. I want to contribute to that process.

This is my first contribution, a small langauge model trained from scratch. It's about 3 million parameters. It was pretrained on the entire Western Canon--or, at least, 70 million tokens of the Western Canon. I finetuned it on the essays I wrote in college.

I will add the dataset to Hugging Face. It is still a work in progress, but I believe it represents the best open source repository of its kind.

To set up the environment, ensure you have Conda installed. Then, run the following command to create the environment with all necessary dependencies:

conda env create -f environment.yml

To test the model, just run sample.py. The weights are in test_safetensors/finetuned_plato_v1_step_880.safetensors. For the GPT module, go to gpt_layer.py. To see the code I used to run the model, go to train_on_canon.py.

You can skip trimmingcanon.py. It is a collection of scripts I used to remove noise from the datset.

The dataset for this model is on Hugging Face.

https://huggingface.co/datasets/wordgrammer/The_Entire_Western_Canon

This model is not very good. A 3 million parameter model is not large enough to understand philosophy. In fact, most of the frontier models struggle to understand philosophy. I made this primarily as a way to understand the inner workings of a transformer from first principles. I would only recommend building a language model from scratch for this reason. It is much more cost-effective to finetune an open source model. But as an educational experience, training a model from scratch is invaluable.

I will not update this repository further.

Many thanks to the following resources:

The math behind attention. https://x-dev.pages.jsc.fz-juelich.de/2022/07/13/transformers-matmul.html
Useful diagram for the parts of a Transformer beyond attention: https://en.m.wikipedia.org/wiki/Generative_pre-trained_transformer
The Scaling Laws paper Kaplan et al: https://arxiv.org/pdf/2001.08361
The Chinchilla paper to calculate model hyperparameters: https://arxiv.org/pdf/2203.15556
Andrej Karpathy's nanoGPT repository, plus his various other tutorials. https://github.com/karpathy/nanoGPT
The MLX team.
Open Source ebook providers that wish to remain unnamed.
I didn’t use this paper, but I wish I had it when I started this project: https://arxiv.org/pdf/2207.09238