Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
hexgrad 
posted an update 2 days ago
Post
2553
self.brag(): Kokoro finally got 300 votes in Pendrokar/TTS-Spaces-Arena after @Pendrokar was kind enough to add it 3 weeks ago.
Discounting the small sample size of votes, I think it is safe to say that hexgrad/Kokoro-TTS is currently a top 3 model among the contenders in that Arena. This is notable because:
- At 82M params, Kokoro is one of the smaller models in the Arena
- MeloTTS has 52M params
- F5 TTS has 330M params
- XTTSv2 has 467M params

The original Arena's threshold is at 700 votes. But I am sure Kokoro will hold the position. The voice quality actually sounds close to ElevenLabs.

But StyleTTS usually is not very emotional. So it will fail where Edge TTS does. The phrases where the voice has to be sad or angry. For example Parler Expresso was overly jolly.

·

The voice quality actually sounds close to ElevenLabs.

I might've mentioned this elsewhere, but if you plug Kokoro outputs for named ElevenLabs voices into https://elevenlabs.io/ai-speech-classifier you should get very reliable positives (98% confident generated by ElevenLabs).

By ear, I think Kokoro is indeed close to ElevenLabs, especially on certain voices. For Nicole, they are indistinguishable to me. Michael is pretty close; Adam is still somewhat weak.

But StyleTTS usually is not very emotional.

I agree. Kokoro also has 2 specific issues in this area: (1) little to no emotional audio seen during training, and (2) even if there was, the stock voices are average style vectors over 10-100 samples, creating an average/neutral style anyway.

Kokoro TTS is quite possibly the best TTS model i've used thus far, I hope weights are released soon!

·

bro can you suggest some lightweight models for Natural TTS