Nice ~90x real-time generation on 3090TI. Quickstart provided.

#20

by ubergarm - opened 2 days ago

Discussion

ubergarm

2 days ago

•

edited about 6 hours ago

I first tried an ONNX implementation, but the PyTorch implementation seems much faster for my homelab setup.

kokoro-tts pytorch quickstart

Here is how I got the the PyTorch implementation running on CUDA for benchmarking and testing vs this particular ONNX implementation repo.

# grab hf repo code but not large files (or use git lfs or `huggingface-cli` etc)
git clone https://huggingface.co/hexgrad/Kokoro-82M
cd Kokoro-82M

# put the model and at least one voice file into place manually overwriting LFS placeholders
wget -O kokoro-v0_19.pth 'https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/kokoro-v0_19.pth?download=true'
wget -O voices/af_sky.pt 'https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_sky.pt?download=true'

# setup venv
python -m venv ./venv
source ./venv/bin/activate

# install deps (can use `uv pip` instead)
pip install phonemizer torch transformers scipy munch soundfile

# install OS level required binaries
# on debian / ubuntu flavors
sudo apt-get install espeak-ng
# or on ARCH btw...
sudo pacman -Sy extra/espeak-ng
# confirm it is working and in path
espeak-ng --version
eSpeak NG text-to-speech: 1.52.0  Data at: /usr/share/espeak-ng-data

# now run the main.py example like so and note the "real" time (wall-clock time)
time python main.py

Here is the contents of the main.py example file including naive chunking of input text by using . punctuation. Need a better chunking implementation to avoid Truncated to 510 tokens error.

from models import build_model
import torch
import soundfile as sf
from kokoro import generate

SAMPLE_RATE = 24000
OUTPUT_FILE = "output.wav"

TEXT = """
Input a long text here. As long as it has an occasional period.
Then it won't overflow and truncate.
You can do better chunking than this with a little effort.
But this is enough to see how fast it can go!

Are there parallel batching options if you have enough VRAM? Or max tokens options?
I haven't measured latency of time to first generation or tried keeping the model loaded.
"""

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Runnin on device: {device}")
MODEL = build_model("kokoro-v0_19.pth", device)
VOICE_NAME = "af_sky"
VOICEPACK = torch.load(f"voices/{VOICE_NAME}.pt", weights_only=True).to(device)
print(f"Loaded voice: {VOICE_NAME}")

audio = []
for chunk in TEXT.split("."):
    print(chunk)
    if len(chunk) < 2:
        # a try except block for non verbalizable text is probably better than this hack
        continue
    snippet, _ = generate(MODEL, chunk, VOICEPACK, lang=VOICE_NAME[0])
    audio.extend(snippet)

sf.write(OUTPUT_FILE, audio, SAMPLE_RATE)

References

hexgrad

Owner about 22 hours ago

@ubergarm Fantastic stuff.

That reminds me, my Reddit account https://www.reddit.com/user/rzvzn/ has been shadowbanned from r/LocalLLaMA for at least a month now. I have messaged the moderators, but it's been crickets so far. If I need more karma in order to post, then how am I supposed to obtain karma without being able to post?

Naturally, you'd assume that the shadowban is a result of doing really sus things or self-promoting, but 1) I do not think I've been doing such things and 2) All my posts and comments evaporate immediately, regardless of their content.

To the moderators of r/LocalLLaMA: If you could turn off friendly fire and take me out of the sunken place, I'd really appreciate it.

I have also seen posts & comments by others also get shadowbanned by the mere mention of rzvzn, which is also my Discord handle. One guy even had a post—which was gaining a good number of upvotes and views—get instantly removed by moderation because he edited it to tag me. Is this moderation run by Reddit, LocalLLaMA, or both? I'm obviously biased, but I think whoever is running moderation (including auto-mod) needs to seriously reevaluate what's going on over there.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment