Linly-Talker / TTS /README.md
linxianzhong0128's picture
Upload folder using huggingface_hub
7088d16 verified

A newer version of the Gradio SDK is available: 5.7.1

Upgrade

TTS 赋予数字人真实的语音交互能力

Edge-TTS

Edge-TTS是一个Python库,它使用微软的Azure Cognitive Services来实现文本到语音转换(TTS)。

该库提供了一个简单的API,可以将文本转换为语音,并且支持多种语言和声音。要使用Edge-TTS库,首先需要安装上Edge-TTS库,安装直接使用pip 进行安装即可。

pip install -U edge-tts

如果想更细究使用方式,可参考https://github.com/rany2/edge-tts

根据源代码,我编写了一个 EdgeTTS 的类,能够更好的使用,并且增加了保存字幕文件的功能,能增加体验感

class EdgeTTS:
    def __init__(self, list_voices = False, proxy = None) -> None:
        voices = list_voices_fn(proxy=proxy)
        self.SUPPORTED_VOICE = [item['ShortName'] for item in voices]
        self.SUPPORTED_VOICE.sort(reverse=True)
        if list_voices:
            print(", ".join(self.SUPPORTED_VOICE))

    def preprocess(self, rate, volume, pitch):
        if rate >= 0:
            rate = f'+{rate}%'
        else:
            rate = f'{rate}%'
        if pitch >= 0:
            pitch = f'+{pitch}Hz'
        else:
            pitch = f'{pitch}Hz'
        volume = 100 - volume
        volume = f'-{volume}%'
        return rate, volume, pitch

    def predict(self,TEXT, VOICE, RATE, VOLUME, PITCH, OUTPUT_FILE='result.wav', OUTPUT_SUBS='result.vtt', words_in_cue = 8):
        async def amain() -> None:
            """Main function"""
            rate, volume, pitch = self.preprocess(rate = RATE, volume = VOLUME, pitch = PITCH)
            communicate = Communicate(TEXT, VOICE, rate = rate, volume = volume, pitch = pitch)
            subs: SubMaker = SubMaker()
            sub_file: Union[TextIOWrapper, TextIO] = (
                open(OUTPUT_SUBS, "w", encoding="utf-8")
            )
            async for chunk in communicate.stream():
                if chunk["type"] == "audio":
                    # audio_file.write(chunk["data"])
                    pass
                elif chunk["type"] == "WordBoundary":
                    # print((chunk["offset"], chunk["duration"]), chunk["text"])
                    subs.create_sub((chunk["offset"], chunk["duration"]), chunk["text"])
            sub_file.write(subs.generate_subs(words_in_cue))
            await communicate.save(OUTPUT_FILE)
            
        
        # loop = asyncio.get_event_loop_policy().get_event_loop()
        # try:
        #     loop.run_until_complete(amain())
        # finally:
        #     loop.close()
        asyncio.run(amain())
        with open(OUTPUT_SUBS, 'r', encoding='utf-8') as file:
            vtt_lines = file.readlines()

        # 去掉每一行文字中的空格
        vtt_lines_without_spaces = [line.replace(" ", "") if "-->" not in line else line for line in vtt_lines]
        # print(vtt_lines_without_spaces)
        with open(OUTPUT_SUBS, 'w', encoding='utf-8') as output_file:
            output_file.writelines(vtt_lines_without_spaces)
        return OUTPUT_FILE, OUTPUT_SUBS

同时在src文件夹下,写了一个简易的WebUI

python app.py

TTS

PaddleTTS

在实际使用过程中,可能会遇到需要离线操作的情况。由于Edge TTS需要在线环境才能生成语音,因此我们选择了同样开源的PaddleSpeech作为文本到语音(TTS)的替代方案。虽然效果可能有所不同,但PaddleSpeech支持离线操作。更多信息可参考PaddleSpeech的GitHub页面:PaddleSpeech

声码器说明

PaddleSpeech预置了三种声码器:【PWGan】【WaveRnn】【HifiGan】。这三种声码器在音质和生成速度上有较大差异,用户可根据需求进行选择。我们建议仅使用前两种声码器,因为WaveRNN的生成速度非常慢。

声码器 音频质量 生成速度
PWGan 中等 中等
WaveRnn 非常慢(耐心等待)
HifiGan

TTS数据集

PaddleSpeech中的样例主要按数据集分类,我们主要使用的TTS数据集有:

  • CSMCS(普通话单发音人)
  • AISHELL3(普通话多发音人)
  • LJSpeech(英文单发音人)
  • VCTK(英文多发音人)

PaddleSpeech的TTS模型映射

PaddleSpeech的TTS模型与以下模型相对应:

  • tts0 - Tacotron2
  • tts1 - TransformerTTS
  • tts2 - SpeedySpeech
  • tts3 - FastSpeech2
  • voc0 - WaveFlow
  • voc1 - Parallel WaveGAN
  • voc2 - MelGAN
  • voc3 - MultiBand MelGAN
  • voc4 - Style MelGAN
  • voc5 - HiFiGAN
  • vc0 - Tacotron2 Voice Clone with GE2E
  • vc1 - FastSpeech2 Voice Clone with GE2E

预训练模型列表

以下是PaddleSpeech提供的可通过命令行和Python API使用的预训练模型列表:

声学模型

模型 语言
speedyspeech_csmsc zh
fastspeech2_csmsc zh
fastspeech2_ljspeech en
fastspeech2_aishell3 zh
fastspeech2_vctk en
fastspeech2_cnndecoder_csmsc zh
fastspeech2_mix mix
tacotron2_csmsc zh
tacotron2_ljspeech en
fastspeech2_male zh
fastspeech2_male en
fastspeech2_male mix
fastspeech2_canton canton

声码器

模型 语言
pwgan_csmsc zh
pwgan_ljspeech en
pwgan_aishell3 zh
pwgan_vctk en
mb_melgan_csmsc zh
style_melgan_csmsc zh
hifigan_csmsc zh
hifigan_ljspeech en
hifigan_aishell3 zh
hifigan_vctk en
wavernn_csmsc zh
pwgan_male zh
hifigan_male zh

根据PaddleSpeech,我编写了一个 PaddleTTS 的类,能够更好的使用和运行结果

class PaddleTTS:
    def __init__(self) -> None:
        pass
        
    def predict(self, text, am, voc, spk_id = 174, lang = 'zh', male=False, save_path = 'output.wav'):
        self.tts = TTSExecutor()
        
        use_onnx = True
        voc = voc.lower()
        am = am.lower()
        
        if male:
            assert voc in ["pwgan", "hifigan"], "male voc must be 'pwgan' or 'hifigan'"
            wav_file = self.tts(
            text = text,
            output = save_path,
            am='fastspeech2_male',
            voc= voc + '_male',
            lang=lang,
            use_onnx=use_onnx
            )
            return wav_file
    
        assert am in ['tacotron2', 'fastspeech2'], "am must be 'tacotron2' or 'fastspeech2'"
        
        # 混合中文英文语音合成
        if lang == 'mix':
            # mix只有fastspeech2
            am = 'fastspeech2_mix'
            voc += '_csmsc'
        # 英文语音合成
        elif lang == 'en':
            am += '_ljspeech'
            voc += '_ljspeech'
        # 中文语音合成
        elif lang == 'zh':
            assert voc in ['wavernn', 'pwgan', 'hifigan', 'style_melgan', 'mb_melgan'], "voc must be 'wavernn' or 'pwgan' or 'hifigan' or 'style_melgan' or 'mb_melgan'"
            am += '_csmsc'
            voc += '_csmsc'
        elif lang == 'canton':
            am = 'fastspeech2_canton'
            voc = 'pwgan_aishell3'
            spk_id = 10
        print("am:", am, "voc:", voc, "lang:", lang, "male:", male, "spk_id:", spk_id)
        try:
            cmd = f'paddlespeech tts --am {am} --voc {voc} --input "{text}" --output {save_path} --lang {lang} --spk_id {spk_id} --use_onnx {use_onnx}'
            os.system(cmd)
            wav_file = save_path
        except:
            # 语音合成
            wav_file = self.tts(
                text = text,
                output = save_path,
                am = am,
                voc = voc,
                lang = lang,
                spk_id = spk_id,
                use_onnx=use_onnx
                )
        return wav_file