nyuuzyou (nyuuzyou)

posted an update 1 day ago

Post

801

I had nothing to do, so I fine-tuned Qwen2.5-0.5B on the alpindale/two-million-bluesky-posts dataset.

Base and instruct models:
nyuuzyou/Qwen2.5-0.5B-Bluesky-Instruct
nyuuzyou/Qwen2.5-0.5B-Bluesky
Use cases? There really aren't any. Have fun!

posted an update 3 days ago

Post

429

The word of the day is definitely "bluesky".

PS. Do something for the https://huggingface.co/bluesky-community

Reacted to davanstrien's post with ❤️ 3 days ago

Post

2210

First dataset for the new Hugging Face Bluesky community organisation: bluesky-community/one-million-bluesky-posts 🦋

📊 1M public posts from Bluesky's firehose API
🔍 Includes text, metadata, and language predictions
🔬 Perfect to experiment with using ML for Bluesky 🤗

Excited to see people build more open tools for a more open social media platform!

posted an update 4 days ago

Post

902

Hugging Face recently added Bluesky to profile links, which is cool. It would be great to also support links to alternative Git services like Codeberg, GitLab, and Gitea. Many developers use platforms beyond GitHub, and showcasing repositories from these sites would be a great feature

replied to LukeNeumann's post 4 days ago

I've published almost 70 datasets, and from what I've seen, a combination of downloads and likes seems to be the way to go. My dataset nyuuzyou/subdomains has a few likes, but at its peak it had over 4,000 downloads in a month, and it wasn't in trending at all.

replied to davanstrien's post 4 days ago

I really like that people on hf are so interested in Bluesky 🦋

Reacted to davanstrien's post with 🔥 4 days ago

Post

1319

The Bluesky AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co/bluesky-community

My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API 🚰

bluesky-community/bluesky-posts-over-time

1 reply

·

Reacted to prithivMLmods's post with ❤️ 4 days ago

Post

2972

HF Posts Receipts 🏆🚀

[ HF POSTS RECEIPT ] : prithivMLmods/HF-POSTS-RECEIPT

🥠The one thing that needs to be remembered is the 'username'.

🥠And yeah, thank you, @maxiw , for creating the awesome dataset and sharing them here! 🙌

🥠[ Dataset ] : maxiw/hf-posts

.
.
.
@prithivMLmods

Reacted to fdaudens's post with ❤️ 6 days ago

Post

1861

🦋 Hug the butterfly! You can now add your Bluesky handle to your Hugging Face profile! ✨

Reacted to AkimfromParis's post with ❤️ 10 days ago

Post

1413

🇯🇵 The Open Japanese LLM Leaderboard created by LLM-jp 🌸 in partnership with HuggingFace 🤗 was released today!

Blog: https://huggingface.co/blog/leaderboard-japanese
Space: llm-jp/open-japanese-llm-leaderboard

🌍 The leaderboard is available in both Japanese and English
📚 Based on the evaluation tool, llm-jp-eval with more than 20 datasets for Japanese LLMs
📊 The leaderboard showcases all the metrics for NLP experts, plus averages for NLP beginners
💻 For the comfort of users, we chose a horizontal UI, and implemented it in a light and dark theme on Gradio
🔬 The radar chart provides a very interesting visualization of metrics!
🌱 We are using the Japanese research platform, MDX, so please be patient!
⚡ LLMs bigger than +70B will be evaluated soon…

How do you say “GPUs Go Brrr” in Japanese - > GPUがブンブン～! (To pronounce "GPU ga bunbun!") 🔥

4 replies

·

posted an update 10 days ago

Post

304

🎵 Introducing Tamago Music Dataset - nyuuzyou/tamago

A collection of 1,567 music tracks featuring:

- Complete metadata with audio files and cover artwork
- Rich track information including titles, descriptions, and genres
- User engagement metrics like play counts and reactions
- English language content from independent artists
- Released under Creative Commons Zero (CC0) license

Dataset structure includes:
- Track metadata (titles, descriptions, genres, tags)
- Associated media (audio files, cover images)
- Artist information and engagement metrics

Particularly valuable for:
- Music generation model training
- Cross-modal analysis
- Audio classification tasks
- Music style and genre analysis

replied to their post 11 days ago

Thanks! I license almost all of my datasets under CC0, with different modalities and tasks. Maybe somebody can find something else interesting for them in my profile 😉

posted an update 12 days ago

Post

946

🖼️ Introducing Public Domain Pictures Dataset - nyuuzyou/publicdomainpictures

Dataset highlights:
- 644,412 public domain images with comprehensive metadata from publicdomainpictures.net
- English language metadata including titles, descriptions, and keywords
- Each entry contains rich metadata including:
- Unique image ID and full-size image URLs
- Detailed titles and descriptions
- Keyword/tag collections
- Creator attribution
- Released to the public domain under Creative Commons Zero (CC0) license

2 replies

·

posted an update 19 days ago

Post

2156

🎵 Introducing Suno Music Generation Dataset - nyuuzyou/suno

Dataset highlights:

- 659,788 AI-generated music samples with comprehensive metadata from suno.com
- Multilingual content with English as primary language, including Japanese and other languages
- Each entry contains rich metadata including:
- Unique song ID, audio/video URLs, and thumbnail images
- AI model version and generation parameters
- Song metadata (tags, prompts, duration)
- Creator information and engagement metrics
- Released to the public domain under Creative Commons Zero (CC0) license

The dataset structure includes detailed information about each generated piece, from technical parameters to user engagement metrics, making it particularly valuable for:
- Music generation model training
- Cross-modal analysis (text-to-audio relationships)
- User engagement studies
- Audio classification tasks
- Music style and genre analysis

posted an update 26 days ago

Post

1423

🎓 Introducing Kompy.info Uzbek Educational Dataset - nyuuzyou/kompy

Dataset highlights:
- 584,648 pages of educational content extracted from kompy.info, a comprehensive educational resource website
- Content exclusively in Uzbek language, focusing on technical and scientific topics
- Each entry contains: URL, page title, and extracted main text content
- Data extracted using trafilatura HTML extraction tool
- Covers a wide range of academic and educational materials
- Released to the public domain under Creative Commons Zero (CC0) license

The dataset presents a valuable resource for natural language processing tasks in the Uzbek language, particularly in educational and technical domains. It can be used for text classification, topic modeling, and content analysis of educational materials. The large-scale collection of Uzbek-language academic content makes it especially useful for developing educational technology applications and studying pedagogical approaches in Uzbek-language instruction. The dataset's monolingual nature provides a focused corpus for understanding technical and scientific terminology in Uzbek educational contexts.

Reacted to m-ric's post with 🔥 28 days ago

Post

2367

> Oasis: First Real-Time Video Game Without a Game Engine! 🎮

DecartAI & Etched just released Oasis - a fully AI-generated video game running at 20 FPS (frames per second). The model takes keyboard inputs and generates everything - physics, rules, graphics - on the fly, without any game engine.

⚡️ What makes this special? Current text-to-video models (Mochi-1, Sora, Kling) generate about 1 frame every 10-20 seconds (that's the kind of device I had to play LoL back in the day, thus my low rankings). Oasis is 200 times faster, making it the first playable AI-generated game.

⚙️ Under the hood, it uses a vision transformer to encode space and a diffusion model to generate frames. The secret sauce is "dynamic noising" - a technique that keeps the video stable between frames.

Key insights:
⚡️ Generates 20 FPS, vs 0.2 FPS for other DIT-based video models
‣ The specialized hardware Sohu developed by Etched allows to handle 10x more player than H100

🎮 Features real game mechanics
‣ Movement, jumping, item management
‣ Physics and lighting
‣ Procedurally generated worlds

⚠️ Current limitations
‣ Blurry graphics at a distance
‣ Objects sometimes change appearance
‣ Memory issues in long sessions

Try it yourself, the playable demo is impressive! 👉 https://oasis.decart.ai/welcome
Code 👉 https://github.com/etched-ai/open-oasis
Read it in full 👉 https://oasis-model.github.io/

Reacted to Muhammadreza's post with ❤️ 28 days ago

Post

2580

Hey guys.
This is my first post here on huggingface. I'm glad to be a part of this amazing community!

2 replies

·

posted an update about 1 month ago

Post

2741

🎓 Introducing PPT4Web Educational Materials Dataset - nyuuzyou/ppt4web

Dataset highlights:
- 182,405 presentations from ppt4web.ru, a platform for storing and viewing presentations covering a wide range of educational materials
- Primarily in Russian, with content in English, Kazakh, Ukrainian, and Belarusian
- Each entry includes: URL, title, download URL, and filepath
- Contains original PPTX files (converted from PPT for consistency) in addition to metadata
- Data covers a broad spectrum of educational topics and subjects
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content across various subjects in multiple languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in education, teaching methodologies, and presentation materials used across different academic disciplines. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in educational settings, providing insights into the diverse range of subjects and teaching approaches.

posted an update about 1 month ago

Post

1397

🌐 Introducing Websim.ai User Projects Dataset - nyuuzyou/websim

Dataset highlights:
- 137,452 user projects from Websim.ai, a service for creating small sites using Large Language Models (LLMs)
- Primarily in English, with potential for multilingual content in generated websites
- Each entry includes: project metadata, user information, and generated HTML content
- Contains detailed information about project revisions, site generation, and user interactions
- Data covers a wide range of user-generated website projects created through AI assistance
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing AI-assisted web development trends, studying user behavior in LLM-powered creative tools, and exploring the capabilities of language models in web design.

posted an update about 1 month ago

Post

428

🎓 Introducing Ukr-lit.com.ua Presentations Dataset - nyuuzyou/ukr-lit

Dataset highlights:
- 18,001 presentations from ukr-lit.com.ua, a platform for storing and viewing presentations covering a wide range of subjects in Ukrainian school education
- Primarily in Ukrainian, with some Russian and English content
- Each entry includes: URL, title, download URL, filepath, and extracted text content (where available)
- Contains original PPT/PPTX files in addition to metadata
- Data covers a broad spectrum of educational topics and subjects taught in Ukrainian schools
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content across various subjects in Ukrainian and other languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in Ukrainian school education, teaching methodologies, and presentation materials used across different academic disciplines. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in Ukrainian educational settings, providing insights into the diverse range of subjects and teaching approaches in the Ukrainian school system.

nyuuzyou

AI & ML interests

Recent Activity

Organizations

nyuuzyou's activity