[FAQ] Alternatives to Finetuning Kokoro

#19
by hexgrad - opened

A very frequently asked question is how to finetune Kokoro, i.e. continue training the checkpoint uploaded in this repository. This is currently not feasible: for a number of reasons, the additional models required to facilitate this have not yet been open-sourced.

However, there are a few alternatives available to you in the meantime (staying in the realm of open source speech models). Please be aware that there are varying degrees of difficulty to each option, and some could be unsuitable depending on your technical abilities and/or requirements.

1. Use a Speech-to-Speech model like RVC

First, generate the speech using Kokoro. Then pipe the TTS output into RVC-Project/Retrieval-based-Voice-Conversion-WebUI or a similar speech-to-speech model (Beatrice v2, see also w-okada/voice-changer).

Pros: There are many pre-trained RVC models readily available (search "rvc models"), and you can also train your own RVC model.
Cons: You have to run a separate model after TTS, which impacts latency and increases inference-time compute footprint. Results will also probably fall short of a true base model finetune.

2. Train your own StyleTTS 2 Model

Kokoro v0.19 was trained on relatively little data and transparently uses a StyleTTS 2 architecture, so if you have proficiency in training models and the required compute, you can train your own from the public checkpoints. Here are some resources:

It is also possible to train StyleTTS 2 models in other languages, although this is can be more difficult than English for tokenization & g2p and/or data procurement:

Pros: Full customizability and ownership over your trained model.
Cons: Requires compute, data, and technical skills.

3. Zero-shot or train a different TTS architecture

In no particular order, here are some links to other open-source TTS models (although not all are permissive):

Pros: Training may not be required, or in some cases if the base models were trained on more data, less training may be required to finetune.
Cons: Licenses, parameter counts, and resulting output quality vary.

“Hi, the styleTTS requires diffusion, while Kokoros does not. Does this indicate that there are some differences in their architectures?”

Sign up or log in to comment