Abstract
We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.
Community
will the source code with dataset be published?
Any chance of the models being published?
I have been saying this from the start. I don't know why we were training LLMs on garbage conversations from humans.
We could train LLMs to be experts in kung fu and then install it in our brain with a 'neuralink' like the Matrix.
just get every book off of z library (including academic papers) and shove that through a NLP model
Very interesting project, still wonder will the source code and dataset be public?
see the model being deployed as azure openAI service. don't think it will be public.
This idea has been around for a long time. I think the paper should have cited this: https://youtu.be/WnTKllDbu5o?t=41
Is Code or model weights released anywhere? I could not find it on Internet.
Found this dataset that is inspired by the paper, but it is not clear how it was created:
https://huggingface.co/datasets/nampdn-ai/tiny-codes
teleprint-me/phi-1 , a little bit snippet from the Phi-1.
HF code and dataset?
Where can we get the synthetic text book datasets that used to train phi-1