metadata

language:
  - id
license: mit
base_model: microsoft/speecht5_tts
tags:
  - text-to-speech
datasets:
  - mozilla-foundation/common_voice_16_1
model-index:
  - name: speecht5_finetuned_commonvoice_id
    results: []

speecht5_finetuned_commonvoice_id

This model is a fine-tuned version of microsoft/speecht5_tts on the mozilla-foundation/common_voice_16_1 dataset. It achieves the following results on the evaluation set:

Loss: 0.4675

How to use/inference

Follow the example below and adapt with your own need.

# ft_t5_id_inference.py


import sounddevice as sd
import torch
import torchaudio
from datasets import Audio, load_dataset
from transformers import (
    SpeechT5ForTextToSpeech,
    SpeechT5HifiGan,
    SpeechT5Processor,
)
from utils import create_speaker_embedding

# load dataset and pre-trained model
dataset = load_dataset(
    "mozilla-foundation/common_voice_16_1", "id", split="test")
model = SpeechT5ForTextToSpeech.from_pretrained(
    "Bagus/speecht5_finetuned_commonvoice_id")


# process the text using checkpoint

checkpoint = "microsoft/speecht5_tts"
processor = SpeechT5Processor.from_pretrained(checkpoint)

sampling_rate = processor.feature_extractor.sampling_rate
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))


def prepare_dataset(example):
    audio = example["audio"]

    example = processor(
        text=example["sentence"],
        audio_target=audio["array"],
        sampling_rate=audio["sampling_rate"],
        return_attention_mask=False,
    )

    # strip off the batch dimension
    example["labels"] = example["labels"][0]

    # use SpeechBrain to obtain x-vector
    example["speaker_embeddings"] = create_speaker_embedding(audio["array"])

    return example


# prepare the speaker embeddings from the dataset and text
example = prepare_dataset(dataset[30])
speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)

# prepare text to be converted to speech
text = "Saya suka baju yang berwarna merah tua."

inputs = processor(text=text, return_tensors="pt")


vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
speech = model.generate_speech(
    inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

sampling_rate = 16000
sd.play(speech, samplerate=sampling_rate, blocking=True)

# save the audio, signal needs to be in 2D tensor
torchaudio.save("output_t5_ft_cv16_id.wav", speech.unsqueeze(0), 16000)

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 4
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 4000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss
0.5394	4.28	1000	0.4908
0.5062	8.56	2000	0.4730
0.5074	12.83	3000	0.4700
0.5023	17.11	4000	0.4675

Framework versions

Transformers 4.35.2
Pytorch 2.1.1+cu121
Datasets 2.15.0
Tokenizers 0.15.0