Plans to release the training dataset?

#44
by monology - opened

At time of writing, all community efforts to create synthetic datasets like the one in Phi-1.5 fall short, either in the quality of the synthetic generations or the sheer size of the synthetic corpus.
Releasing the data used to train Phi-1.5 would be greatly beneficial for further research into the impact of synthetic datasets on large language models.
Would love to hear a response from one of the authors of the Phi-1.5 technical report about whether the community can expect to see the dataset or a subset of it released under any license or usage conditions.

Microsoft org

Hello @monology !

Unfortunately, we are not able to release the dataset at the moment, however, there are some amazing attempts to create public versions, such as https://huggingface.co/datasets/nampdn-ai/tiny-textbooks and https://huggingface.co/datasets/emrgnt-cmplxty/sciphi-textbooks-are-all-you-need.

gugarosa changed discussion status to closed

any updates on that topic?

any updates on this topic?

Sign up or log in to comment