--- license: apache-2.0 language: - en --- ## 2 Million Bluesky Posts This dataset contains 2 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data. The with-language-predictions config contains the same data as the default config but with language predictions added using the glotlid model. Dataset Details Dataset Description This dataset consists of 2 million public posts from Bluesky Social, collected through the platform's firehose API. Each post contains text content, metadata, and information about media attachments and reply relationships. - **Curated by**: Alpin Dale - **Language(s) (NLP)**: Multiple (primarily English) - **License**: Dataset usage is subject to Bluesky's Terms of Service ## Uses This dataset could be used for: - Training and testing language models on social media content - Analyzing social media posting patterns - Studying conversation structures and reply networks - Research on social media content moderation - Natural language processing tasks using social media datas ## Dataset Structure The dataset is available in two configurations: ### Default Configuration Contains the following fields for each post: - **text**: The main content of the post - **created_at**: Timestamp of post creation - **author**: The Bluesky handle of the post author - **uri**: Unique identifier for the post - **has_images**: Boolean indicating if the post contains images - **reply_to**: URI of the parent post if this is a reply (null otherwise) ### With Language Predictions Configuration Contains all fields from the default configuration plus: - **predicted_language**: The predicted language code (e.g., eng_Latn, deu_Latn) - **language_confidence**: Confidence score for the language prediction (0-1) Language predictions were added using the [glotlid](https://huggingface.co/cis-lmu/glotlid) model via fasttext. ## Bias, Risks, and Limitations The goal of this dataset is for you to have fun :)