hexgrad/Kokoro-82M · [DATA] Synthetic Data Trade Offer

Pinned discussion for parties interested in the trade offer proposed here: https://hf.co/posts/hexgrad/418806998707773

Discord is the best place to discuss this — https://discord.gg/QuGxSWBfQy — but for those not able or willing to use Discord, here is the next best option.

I am seeking synthetic audio for (multiple speakers & languages):

OpenAI: GPT-4o AVM, Realtime API, HD TTS
Gemini 2.0 Flash: Native Audio
ElevenLabs: Full, not Flash or Turbo

To qualify for Voicepack(s):

I need to approve both the audio & text data before it enters the training mix. I may refuse some or all of your data for quality reasons. Please describe the quantity/quality/taxonomy of your data first, if that clears then send some samples, if those also check out, only then should you send the next or whole payload.
At least 1 approved hour per voice/tone. If you have X speaker whispering for 40 minutes, and another 20 minutes of them shouting, it is not enough. This threshold may go up later, but a previously approved contribution will still get Voicepacks delivered in return.
Audio should be clean with minimal artifacts. Text labels are expected to be perfect or near perfect, since you're calling it over API and should know what text you put in.
Text should be aligned to their corresponding audio segments obtained over API. Don't concatenate the segments into a giant multi-hour file and then dump the entire transcript.
You send the data directly to me under an Apache license. Likewise, you will directly receive a corresponding Apache-licensed Voicepack in return, if/when the model finishes training.

Other providers / small quantity / lower quality / unlabeled / unsegmented data can still be contributed, but the likelihood of inclusion in the training mix plummets, and I cannot promise any delivered voicepacks resulting from those.

In addition to the above, here is a disclaimer (adapted from the OpenAI investment disclaimer):

IMPORTANT
Contributing data to the Kokoro training mix is not a guaranteed investment
Contributors could deliver their data and not see any return
It would be wise to view any contribution in the spirit of a donation, with the understanding that there are risk factors that would delay or prevent the return of a trained voicepack, including but not limited to:

GPU access could be disrupted

The model could fail to converge

The model trainer could be given the Boeing whistleblower treatment

Notwithstanding the above, the model trainer will, to the best of his ability, deliver the promised artifacts.

Potential contributors should understand that they are free to pursue the following options instead:

Keep the data to themselves

Train their own models

Continue using vendors

This post may be edited later, but approved contributions will stand unless otherwise notified.