Nice ~90x real-time generation on 3090TI. Quickstart provided.
I first tried an ONNX implementation, but the PyTorch implementation seems much faster for my homelab setup.
kokoro-tts pytorch quickstart
Here is how I got the the PyTorch implementation running on CUDA for benchmarking and testing vs this particular ONNX implementation repo.
# grab hf repo code but not large files (or use git lfs or `huggingface-cli` etc)
git clone https://huggingface.co/hexgrad/Kokoro-82M
cd Kokoro-82M
# put the model and at least one voice file into place manually overwriting LFS placeholders
wget -O kokoro-v0_19.pth 'https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/kokoro-v0_19.pth?download=true'
wget -O voices/af_sky.pt 'https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_sky.pt?download=true'
# setup venv
python -m venv ./venv
source ./venv/bin/activate
# install deps (can use `uv pip` instead)
pip install phonemizer torch transformers scipy munch soundfile
# install OS level required binaries
# on debian / ubuntu flavors
sudo apt-get install espeak-ng
# or on ARCH btw...
sudo pacman -Sy extra/espeak-ng
# confirm it is working and in path
espeak-ng --version
eSpeak NG text-to-speech: 1.52.0 Data at: /usr/share/espeak-ng-data
# now run the main.py example like so and note the "real" time (wall-clock time)
time python main.py
Here is the contents of the main.py
example file including naive chunking of input text by using .
punctuation. Need a better chunking implementation to avoid Truncated to 510 tokens
error.
from models import build_model
import torch
import soundfile as sf
from kokoro import generate
SAMPLE_RATE = 24000
OUTPUT_FILE = "output.wav"
TEXT = """
Input a long text here. As long as it has an occasional period.
Then it won't overflow and truncate.
You can do better chunking than this with a little effort.
But this is enough to see how fast it can go!
Are there parallel batching options if you have enough VRAM? Or max tokens options?
I haven't measured latency of time to first generation or tried keeping the model loaded.
"""
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Runnin on device: {device}")
MODEL = build_model("kokoro-v0_19.pth", device)
VOICE_NAME = "af_sky"
VOICEPACK = torch.load(f"voices/{VOICE_NAME}.pt", weights_only=True).to(device)
print(f"Loaded voice: {VOICE_NAME}")
audio = []
for chunk in TEXT.split("."):
print(chunk)
if len(chunk) < 2:
# a try except block for non verbalizable text is probably better than this hack
continue
snippet, _ = generate(MODEL, chunk, VOICEPACK, lang=VOICE_NAME[0])
audio.extend(snippet)
sf.write(OUTPUT_FILE, audio, SAMPLE_RATE)
References
@ubergarm Fantastic stuff.
That reminds me, my Reddit account https://www.reddit.com/user/rzvzn/ has been shadowbanned from r/LocalLLaMA for at least a month now. I have messaged the moderators, but it's been crickets so far. If I need more karma in order to post, then how am I supposed to obtain karma without being able to post?
Naturally, you'd assume that the shadowban is a result of doing really sus things or self-promoting, but 1) I do not think I've been doing such things and 2) All my posts and comments evaporate immediately, regardless of their content.
To the moderators of r/LocalLLaMA: If you could turn off friendly fire and take me out of the sunken place, I'd really appreciate it.
I have also seen posts & comments by others also get shadowbanned by the mere mention of rzvzn
, which is also my Discord handle. One guy even had a post—which was gaining a good number of upvotes and views—get instantly removed by moderation because he edited it to tag me. Is this moderation run by Reddit, LocalLLaMA, or both? I'm obviously biased, but I think whoever is running moderation (including auto-mod) needs to seriously reevaluate what's going on over there.