Umamusume DeBERTA-VITS2 TTS

📅 2023.10.24 📅

Updated current Generator to 270K steps' checkpoint

👌 Currently, ONLY Japanese is supported. 👌

💪 Based on Bert-VITS2, this work tightly follows Akito/umamusume_bert_vits2, from which the Japanese text preprocessor is provided. ❤

Instruction for use | 使用说明 | 使用についての説明

✋ Please do NOT enter a really LOOOONG sentence or sentences in a single row. Splitting your inputs into multiple rows makes each row to be inferenced separately. Please avoid completely empty rows, which will lead to weird sounds in the corresponding spaces in the generated audio. ✋

✋ 请不要在一行内输入超长文本，模型会将每行的输入视为一句话进行推理。在不影响语意连贯的情况下，请将多句话分别放入不同的行中来减少推理时间。请删除输入中的空白行，这会导致在生成的语音的对应位置中产生奇怪的声音。 ✋

✋ 長すぎるテキストを一行に入力しないでください。モデルは各行を一つの文として推理します。意味が繋がる範囲で、複数の文を異なる行に分けて推理時間を短縮してください。空白行は削除してください。これが生成された音声の対応部分で奇妙な音を生じる原因となります。 ✋

👏 When encountering situations where an error occurs, please check if there's rare and difficult CHINISE CHARACTERS in your inputs, and replace them with Hiragana or Katakana. 👏

👏 如果生成出现了错误，请首先检查输入中是否存在非常少见的生僻汉字，如果有，请将其替换为平假名或者片假名。 👏

👏 生成に誤りがある場合は、まず入力に非常に珍しい難解な漢字がないか確認してください。もし存在する場合、それを平仮名または片仮名に置き換えてください。 👏

🎈 Please make good use of punctuation marks. 🎈

🎈 请善用标点符号的神奇力量。 🎈

🎈 句読点の魔法の力をうまく活用してください。 🎈

📚 What is the Chinese name for the character name? Please refer to Umamusume Bilibili Wiki. 📚

📚 キャラの中国語名は何ですか？ここにご覧ください：ウマ娘ビリビリWiki. 📚

Training Details - For those who may be interested

🎈 This work switches cl-tohoku/bert-base-japanese-v3 to ku-nlp/deberta-v2-base-japanese expecting potentially better performance, and, just for fun. 🥰

❤ Thanks to SUSTech Center for Computational Science and Engineering. ❤ This model is trained on A100 (40GB) x 2 with batch size 32 in total.

💪 This model has been trained for 3 cycles, 270K steps (=180 epoch) . 💪

📕 This work uses linear with warmup (7.5% of total steps) LR scheduler with max_lr=1e-4. 📕

✂ This work clips gradient value to 10 ✂.

⚠ Finetuning the model on single-speaker datasets separately will definitely reach better result than training on a huge dataset comprising of many speakers. Sharing a same model leads to unexpected mixing of the speaker's voice line. ⚠