Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
its5Q 
posted an update Aug 29
Post
1281
Continuing my streak by releasing the Wikireading dataset: a large collection of scraped non-fiction books predominantly in Russian language.
its5Q/wikireading

Here's the highlights:
- ~7B tokens, or ~28B characters, making it a great candidate for use in pretraining
- Contains non-fiction works from many knowledge domains
- Includes both the original HTML and extracted text of book chapters
In this post