SWivid/F5-TTS · On macOS, the voice output I generated with f5tts sounds terrible.

3 days ago

•

On macOS, the voice output I generated with f5tts sounds terrible.
I didn't apply any additional voice customization and tested it using basic_ref_en from the examples. The output is incredibly bad. Is this normal, or is there some other issue?
You can access the resulting audio file via the following link: https://voca.ro/1jnxTUxSAF7i


f5-tts_infer-cli 
--model "F5-TTS" 
--ref_audio "/path/to//F5-TTS/src/f5_tts/infer/examples/basic/basic_ref_en.wav" 
--ref_text "The content, subtitle or transcription of reference audio." 
--gen_text "Some text you want TTS model generate for you."

Output:

Ref:

SWivid

Owner 3 days ago

Hi, you need to provide a ref_text with the content same to ref_audio, or use leave blank "" if want to have ASR model to do transcription.
The current ref_text you have provided "The content, subtitle or transcription of reference audio." is different from the reference audio content

srkngl

3 days ago

•

edited 3 days ago

Thanks for the reply but nothing changed.
My last attempt is as follows

f5-tts_infer-cli
--model "F5-TTS"
--ref_audio "/path/to/F5-TTS/src/f5_tts/infer/examples/basic/basic_ref_en.wav"
--ref_text "Some call me nature, others call me mother nature"
--gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring.."

Output:

Ref:

SWivid

Owner 3 days ago

what is the output if simply run f5-tts_infer-cli

srkngl

3 days ago

•

edited 3 days ago


f5-tts_infer-cli

Download Vocos from huggingface charactr/vocos-mel-24khz
Using F5-TTS...
vocab :  /path/to/F5-TTS/src/f5_tts/infer/examples/vocab.txt
token :  custom
model :  /path/to/.cache/huggingface/hub/models--SWivid--F5-TTS/snapshots/4dcc16f297f2ff98a17b3726b16f5de5a5e45672/F5TTS_Base/model_1200000.safetensors
Converting audio...
Using custom reference text...
ref_text   Some call me nature, others call me mother nature.
Voice: main
Ref_audio: /var/folders/71/g62n33f17hg8r2_tq0x0_ghh0000gn/T/tmp3odgw64i.wav
Ref_text: Some call me nature, others call me mother nature.
No voice tag found, using main.
Voice: main
gen_text 0 I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring.
Generating audio in 1 batches...
  0%|                                                                                                                             | 0/1 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/71/g62n33f17hg8r2_tq0x0_ghh0000gn/T/jieba.cache
Loading model cost 0.242 seconds.
Prefix dict has been built successfully.
/path/to/miniconda3/envs/f5-tts/lib/python3.10/site-packages/vocos/spectral_ops.py:46: UserWarning: The operator 'aten::unfold_backward' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:13.)
  return torch.istft(spec, self.n_fft, self.hop_length, self.win_length, self.window, center=True)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:03<00:00, 63.89s/it]
tests/infer_cli_out.wav

SWivid

Owner 3 days ago

maybe you could check if you have modified the .toml file
I noticed that you used a path like: --ref_audio "/path/to//F5-TTS/src/f5_tts/infer/examples/basic/basic_ref_en.wav"
with which I am not sure if the model will work

srkngl

3 days ago

I did not make any changes to the toml files.
I installed it according to the instructions on git. Then I tried the examples on git, but the output was strangely problematic.

The directory and file are accessible and appear to be correct( If the directory is invalid, it throws an error.
). If you have a different use case that you suggest, I can try it.

SWivid

Owner 3 days ago

•

edited 3 days ago

Yes, if you have a Linux or Windows device, that would be better for us to help with potential problem, sry that we don't have available Mac at hand to test.
And you could try with online demo e.g.
https://huggingface.co/spaces/mrfakename/E2-F5-TTS (some issues with the synced demo currently; if not work, try the latter two)
https://huggingface.co/spaces/abidlabs/E2-F5-TTS
https://modelscope.cn/studios/modelscope/E2-F5-TTS

srkngl

3 days ago

unfortunately i only have a mac at hand.

thanks for the demos, i'm looking into it.
in addition; if you need it, i can help you with your tests on a mac.

SWivid

Owner 3 days ago

thanks :>
might be hard for remote test lol, try the online demos, we hope you like it.
we will test on Mac surely as we keep on progressing

srkngl

2 days ago

•

edited 2 days ago

I reinstalled without conda and managed to get a proper sound file.
I think NVIDIA CUDA (cu118) is not supported on Mac M2.

For those who have similar problems, they can install it as follows.
pip install torch torchvision torchaudio git clone https://github.com/SWivid/F5-TTS.git cd F5-TTS pip install -e .

#optional (If you are getting a PYTORCH MPS error) export PYTORCH_ENABLE_MPS_FALLBACK=1

Let me state this, I voiced a 50-second text with a 10-second ref. audio file on the English model. The cloning it did with the 10-second ref is surprisingly good. I did not expect it to give such a successful result in such a simple way. Congratulations.

Output I produced with cli:

srkngl changed discussion status to closed 2 days ago