Nice ~90x real-time generation on 3090TI. Quickstart provided.

#20
by ubergarm - opened

I first tried an ONNX implementation, but the PyTorch implementation seems much faster for my homelab setup.

kokoro-tts pytorch quickstart

Here is how I got the the PyTorch implementation running on CUDA for benchmarking and testing vs this particular ONNX implementation repo.

# grab hf repo code but not large files (or use git lfs or `huggingface-cli` etc)
git clone https://huggingface.co/hexgrad/Kokoro-82M
cd Kokoro-82M

# put the model and at least one voice file into place manually overwriting LFS placeholders
wget -O kokoro-v0_19.pth 'https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/kokoro-v0_19.pth?download=true'
wget -O voices/af_sky.pt 'https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_sky.pt?download=true'

# setup venv
python -m venv ./venv
source ./venv/bin/activate

# install deps (can use `uv pip` instead)
pip install phonemizer torch transformers scipy munch soundfile

# install OS level required binaries
# on debian / ubuntu flavors
sudo apt-get install espeak-ng
# or on ARCH btw...
sudo pacman -Sy extra/espeak-ng
# confirm it is working and in path
espeak-ng --version
eSpeak NG text-to-speech: 1.52.0  Data at: /usr/share/espeak-ng-data

# now run the main.py example like so and note the "real" time (wall-clock time)
time python main.py

Here is the contents of the main.py example file including naive chunking of input text by using . punctuation. Need a better chunking implementation to avoid Truncated to 510 tokens error.

from models import build_model
import torch
import soundfile as sf
from kokoro import generate

SAMPLE_RATE = 24000
OUTPUT_FILE = "output.wav"

TEXT = """
Input a long text here. As long as it has an occasional period.
Then it won't overflow and truncate.
You can do better chunking than this with a little effort.
But this is enough to see how fast it can go!

Are there parallel batching options if you have enough VRAM? Or max tokens options?
I haven't measured latency of time to first generation or tried keeping the model loaded.
"""

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Runnin on device: {device}")
MODEL = build_model("kokoro-v0_19.pth", device)
VOICE_NAME = "af_sky"
VOICEPACK = torch.load(f"voices/{VOICE_NAME}.pt", weights_only=True).to(device)
print(f"Loaded voice: {VOICE_NAME}")

audio = []
for chunk in TEXT.split("."):
    print(chunk)
    if len(chunk) < 2:
        # a try except block for non verbalizable text is probably better than this hack
        continue
    snippet, _ = generate(MODEL, chunk, VOICEPACK, lang=VOICE_NAME[0])
    audio.extend(snippet)

sf.write(OUTPUT_FILE, audio, SAMPLE_RATE)

References

@ubergarm Fantastic stuff.

That reminds me, my Reddit account https://www.reddit.com/user/rzvzn/ has been shadowbanned from r/LocalLLaMA for at least a month now. I have messaged the moderators, but it's been crickets so far. If I need more karma in order to post, then how am I supposed to obtain karma without being able to post?

Naturally, you'd assume that the shadowban is a result of doing really sus things or self-promoting, but 1) I do not think I've been doing such things and 2) All my posts and comments evaporate immediately, regardless of their content.

To the moderators of r/LocalLLaMA: If you could turn off friendly fire and take me out of the sunken place, I'd really appreciate it.

I have also seen posts & comments by others also get shadowbanned by the mere mention of rzvzn, which is also my Discord handle. One guy even had a post—which was gaining a good number of upvotes and views—get instantly removed by moderation because he edited it to tag me. Is this moderation run by Reddit, LocalLLaMA, or both? I'm obviously biased, but I think whoever is running moderation (including auto-mod) needs to seriously reevaluate what's going on over there.

Sign up or log in to comment