Respair/styletts2_Japanese · Fantastic project!

Apr 15

Fantastic project! It's the first I've seen in another language using StyleTTS2 and it turned out very good. Could you let me know how many hours of dataset you managed to train this model with? Did you start from scratch or did you do finetuning? I am looking to train it for my language but I'm unsure whether to start from the initial training stage or to finetune. I tried to finetune with 20 hours but without success. Could you give me a basic example of training, or if possible, share the notebook you used for training, even if it's in Japanese? Thank you and congratulations on your project!

Respair

Owner Apr 16

Hey, happy you liked it.
Roughly 20 hours. I trained it from scratch, both the PL-Bert (that's used as the text-encoder) and the TTS itself.

You shouldn't fine tune on a new language if your base model is English or anything unrelated to your task.

I don't mind sharing my scripts but I don't think that's gonna help you, the pre-processing steps for Japanese is widely different from other languages. if you speak in one of the indo-european languages (Spanish, French, etc.) then I highly recommend following the original repo, it's actually far easier to train on those than something like Japanese.

Good luck!

traderpedroso

Apr 16

Thank you very much for your feedback. After your comment, I did some research and indeed, fine-tuning for my language, which is Portuguese, is much simpler. I even found a PL-BERT already trained. Currently, I have a customer service pipeline running with Whisper, Asterisk, and xTTSv2, and a pre-trained Mistral LLM model for my task. The xTTSv2 causes many hallucinations and is quite slow in inference. When I came across StyleTTS, I was impressed after seeing your model running with Japanese, which is undoubtedly one of the most complex languages for tts, and it greatly encouraged me.