Text-to-Speech
F5-TTS

On macOS, the voice output I generated with f5tts sounds terrible.

#12
by srkngl - opened

On macOS, the voice output I generated with f5tts sounds terrible.
I didn't apply any additional voice customization and tested it using basic_ref_en from the examples. The output is incredibly bad. Is this normal, or is there some other issue?
You can access the resulting audio file via the following link: https://voca.ro/1jnxTUxSAF7i

f5-tts_infer-cli --model "F5-TTS" --ref_audio "/path/to//F5-TTS/src/f5_tts/infer/examples/basic/basic_ref_en.wav" --ref_text "The content, subtitle or transcription of reference audio." --gen_text "Some text you want TTS model generate for you."

Output:

Ref:

Hi, you need to provide a ref_text with the content same to ref_audio, or use leave blank "" if want to have ASR model to do transcription.
The current ref_text you have provided "The content, subtitle or transcription of reference audio." is different from the reference audio content

Thanks for the reply but nothing changed.
My last attempt is as follows

f5-tts_infer-cli
--model "F5-TTS"
--ref_audio "/path/to/F5-TTS/src/f5_tts/infer/examples/basic/basic_ref_en.wav"
--ref_text "Some call me nature, others call me mother nature"
--gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring.."

Output:

Ref:

what is the output if simply run f5-tts_infer-cli

f5-tts_infer-cli

Download Vocos from huggingface charactr/vocos-mel-24khz
Using F5-TTS...

vocab : /path/to/F5-TTS/src/f5_tts/infer/examples/vocab.txt
token : custom
model : /path/to/.cache/huggingface/hub/models--SWivid--F5-TTS/snapshots/4dcc16f297f2ff98a17b3726b16f5de5a5e45672/F5TTS_Base/model_1200000.safetensors

Converting audio...
Using custom reference text...
ref_text Some call me nature, others call me mother nature.
Voice: main
Ref_audio: /var/folders/71/g62n33f17hg8r2_tq0x0_ghh0000gn/T/tmp3odgw64i.wav
Ref_text: Some call me nature, others call me mother nature.
No voice tag found, using main.
Voice: main
gen_text 0 I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring.
Generating audio in 1 batches...
0%| | 0/1 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/71/g62n33f17hg8r2_tq0x0_ghh0000gn/T/jieba.cache
Loading model cost 0.242 seconds.
Prefix dict has been built successfully.
/path/to/miniconda3/envs/f5-tts/lib/python3.10/site-packages/vocos/spectral_ops.py:46: UserWarning: The operator 'aten::unfold_backward' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:13.)
return torch.istft(spec, self.n_fft, self.hop_length, self.win_length, self.window, center=True)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [01:03<00:00, 63.89s/it]
tests/infer_cli_out.wav

maybe you could check if you have modified the .toml file
I noticed that you used a path like: --ref_audio "/path/to//F5-TTS/src/f5_tts/infer/examples/basic/basic_ref_en.wav"
with which I am not sure if the model will work

I did not make any changes to the toml files.
I installed it according to the instructions on git. Then I tried the examples on git, but the output was strangely problematic.

The directory and file are accessible and appear to be correct( If the directory is invalid, it throws an error.
). If you have a different use case that you suggest, I can try it.

Yes, if you have a Linux or Windows device, that would be better for us to help with potential problem, sry that we don't have available Mac at hand to test.
And you could try with online demo e.g.
https://huggingface.co/spaces/mrfakename/E2-F5-TTS (some issues with the synced demo currently; if not work, try the latter two)
https://huggingface.co/spaces/abidlabs/E2-F5-TTS
https://modelscope.cn/studios/modelscope/E2-F5-TTS

unfortunately i only have a mac at hand.

thanks for the demos, i'm looking into it.
in addition; if you need it, i can help you with your tests on a mac.

thanks :>
might be hard for remote test lol, try the online demos, we hope you like it.
we will test on Mac surely as we keep on progressing

I reinstalled without conda and managed to get a proper sound file.
I think NVIDIA CUDA (cu118) is not supported on Mac M2.

For those who have similar problems, they can install it as follows.

pip install torch torchvision torchaudio
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
pip install -e .

#optional (If you are getting a PYTORCH MPS error)
export PYTORCH_ENABLE_MPS_FALLBACK=1

Let me state this, I voiced a 50-second text with a 10-second ref. audio file on the English model. The cloning it did with the 10-second ref is surprisingly good. I did not expect it to give such a successful result in such a simple way. Congratulations.

Output I produced with cli:

srkngl changed discussion status to closed

Sign up or log in to comment