Capybara and SystemChat-1.1 Preferences with SOTA LLMs - a distilabel-internal-testing Collection

distilabel-internal-testing 's Collections

Capybara and SystemChat-1.1 Preferences with SOTA LLMs

updated May 2

This collection contains the preference versions of both `LDJnr/Capybara` and `abacusai/SystemChat-1.1`, in collaboration with Hugging Face and LDJnr

Upvote

LDJnr/Capybara

Viewer • Updated Jun 7 • 16k • 383 • 226

Note Starting `LDJnr/Capybara` dataset
abacusai/SystemChat-1.1

Viewer • Updated Apr 11 • 20.2k • 81 • 29

Note Starting `abacusai/SystemChat-1.1` dataset
distilabel-internal-testing/Capybara-and-SystemChat-1.1

Viewer • Updated Apr 18 • 36.2k • 35

Note Dataset that combines both `LDJnr/Capybara` and `abacusai/SystemChat-1.1` but sharing the same format for the conversations (OpenAI-style), and defining the same columns while keeping source for Capybara, and adding `dataset` as the identifier of the origin dataset
distilabel-internal-testing/Capybara-and-SystemChat-1.1-Text

Viewer • Updated Apr 18 • 36.2k • 38

Note Adds a new column on top of `distilabel-internal-testing/Capybara-and-SystemChat-1.1` which is `text` and contains the values for the column `messages` with the chat template applied using the ChatML format
distilabel-internal-testing/Capybara-and-SystemChat-1.1-MinHash

Viewer • Updated Apr 18 • 35.6k • 35

Note Runs MinHash deduplication (threshold=0.95) on top of `distilabel-internal-testing/Capybara-and-SystemChat-1.1-Text` to remove 588 near duplicates from the dataset, before starting off with the generation
distilabel-internal-testing/Capybara-and-SystemChat-1.1-Filtered

Viewer • Updated Apr 18 • 35.2k • 34

Note Runs URL filtering on the assistant responses, and also filters out the instances with ChatGPT-ish terms, as @LDJnr kindly provided a list of common ChatGPT-like terms that tend to appear within the generated responses that we want to avoid; on top of `distilabel-internal-testing/Capybara-and-SystemChat-1.1-MinHash`

Upvote