2 Million Bluesky Posts
This dataset contains 2 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data.
The with-language-predictions config contains the same data as the default config but with language predictions added using the glotlid model. Dataset Details Dataset Description
This dataset consists of 2 million public posts from Bluesky Social, collected through the platform's firehose API. Each post contains text content, metadata, and information about media attachments and reply relationships.
- Curated by: Alpin Dale
- Language(s) (NLP): Multiple (primarily English)
- License: Dataset usage is subject to Bluesky's Terms of Service
Uses
This dataset could be used for:
- Training and testing language models on social media content
- Analyzing social media posting patterns
- Studying conversation structures and reply networks
- Research on social media content moderation
- Natural language processing tasks using social media datas
Dataset Structure
The dataset is available in two configurations:
Default Configuration
Contains the following fields for each post:
- text: The main content of the post
- created_at: Timestamp of post creation
- author: The Bluesky handle of the post author
- uri: Unique identifier for the post
- has_images: Boolean indicating if the post contains images
- reply_to: URI of the parent post if this is a reply (null otherwise)
With Language Predictions Configuration
Contains all fields from the default configuration plus:
- predicted_language: The predicted language code (e.g., eng_Latn, deu_Latn)
- language_confidence: Confidence score for the language prediction (0-1)
Language predictions were added using the glotlid model via fasttext.
Bias, Risks, and Limitations
The goal of this dataset is for you to have fun :)