How to use it with Transformer?
I am using it with this code
from transformers import (
HubertModel,
Wav2Vec2FeatureExtractor,
)
import torch
from datasets import load_dataset
dataset = load_dataset(
"hf-internal-testing/librispeech_asr_demo", "clean", split="validation"
)
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
processor = Wav2Vec2FeatureExtractor.from_pretrained("utter-project/mHuBERT-147")
model = HubertModel.from_pretrained("utter-project/mHuBERT-147")
# audio file is decoded on the fly
inputs = processor(
dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt"
)
with torch.no_grad():
outputs = model(**inputs)
Would it work as intended? since other hubert is used like this.
Hi! Thanks for the interest. Please note this checkpoint is not an ASR model: this is a speech representation model.
In order to use it as you want, you need to first fine-tune it using ASR data. You can check huggingface tutorials for this, such as this one: https://huggingface.co/blog/fine-tune-xlsr-wav2vec2 (just replace the wav2vec2 classes by hubert).
Best,
Hi! Thanks for the interest. Please note this checkpoint is not an ASR model: this is a speech representation model.
In order to use it as you want, you need to first fine-tune it using ASR data. You can check huggingface tutorials for this, such as this one: https://huggingface.co/blog/fine-tune-xlsr-wav2vec2 (just replace the wav2vec2 classes by hubert).
Best,
Hi, Thank you for your quick response!
I know that it is a speech representation model. What I am doing here is to extract audio feature. I would use this feature in a downstream task. What I would like to do is extract its feature and build a text to speech system with the feature it extract. I think during the training, the mHubert-147 parameters are freezed. In this case, am I doing the right step to extract the feature of an audio?
Best
Hi again,
I see! Sorry for the confusion.
Feature extraction should work as for any other HuBERT model.
mHuBERT-147 is a base architecture, so your output features will be of dimension 768.
If you face any issues, let me know. :)
@mzboito
It'd be really helpful if you can provide a colab for the ASR training. I tried to follow the tutorial (https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) but I got eval_wer=1.0
for hours of training. I was able to successfully train my dataset with wav2vec2-xls-r-300m
. Thanks!
tokenizer = Wav2Vec2CTCTokenizer(
"tokenizer/vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|"
)
model_id = "utter-project/mHuBERT-147"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_id)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
model = HubertForCTC.from_pretrained(
model_id,
pad_token_id=processor.tokenizer.pad_token_id,
vocab_size=len(processor.tokenizer),
)
)
model.freeze_feature_encoder()
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
def prepare_dataset(batch):
audio = batch["audio"]
# batched output is "un-batched"
batch["input_values"] = processor(
audio["array"], sampling_rate=audio["sampling_rate"]
).input_values[0]
batch["input_length"] = len(batch["input_values"])
with processor.as_target_processor():
batch["labels"] = processor(batch["transcription"]).input_ids
return batch
# ...
training_args = TrainingArguments(
output_dir="training",
report_to="tensorboard",
per_device_train_batch_size=32,
per_device_eval_batch_size=8,
eval_strategy="steps",
num_train_epochs=50,
fp16=True,
save_steps=400,
eval_steps=400,
logging_steps=100,
learning_rate=3e-4,
warmup_steps=500,
save_total_limit=5,
push_to_hub=False,
load_best_model_at_end=True,
)
Hi. Which dataset/language are you fine-tuning on?
I recommend you tune your hyper-parameters for mHuBERT, specially because XLS-R is much larger.
If you are using the same datasets we use in the paper, you can check our hyper-parameters there (https://arxiv.org/abs/2406.06371).
In my experience, the optimal training hyper-parameters tend to vary quite a lot across different SSL backbones.
Modifying things like learning rate, warm-up ratio, and increasing dropout (if dataset is small) will help you better converge.
@mzboito I used OpenSLR 42 dataset (https://www.openslr.org/42/) or (https://huggingface.co/datasets/openslr/openslr). I honestly don't know what to do next. The train loss and eval loss doesn't seem to go down.
Hyper-parameters are very important. They can be the difference between a good ASR model and 100% WER/no convergence.
It's not because an ASR fine-tuning recipe with given hyper-parameters work for a given SSL backbone, that they will automatically work on a different model. On my previous message I mentioned important hyper-parameters you could explore for better convergence.
I unfortunately do not have the time to provide you with a recipe for this particular dataset, but I did train few-shot ASR models on Khmer using fleurs-102.
The hyper-parameters are very probably not optimal for Khmer (I optimized them using other languages), but they worked. They were:
evaluation_strategy: "steps"
num_train_epochs: 100
fp16: False
gradient_checkpointing: True
eval_steps: 500
save_steps: 500
logging_steps: 500
learning_rate: 1e-5
adam_beta1: 0.9
adam_beta2: 0.98
adam_epsilon: 1e-08
warmup_ratio: 0.1
save_total_limit: 2
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
push_to_hub: False
load_best_model_at_end: True
metric_for_best_model: "loss"
greater_is_better: False
}
Moreover, for those experiments, the mHuBERT model.final_dropout was set to 0.3.
This is my colab fine tuning mHuBERT-147 OpenSLR 42
https://colab.research.google.com/drive/1m9Wdbnv8S7G4UzIu1QDYho6bCyPFeJWP?usp=sharing
After 10 hours of training, I managed to get to 44.02 WER and 0.57 eval loss. I think the previous issue was because I use fp16 instead of fp32.
After 10 hours of training, I managed to get to 44.02 WER and 0.57 eval loss. I think the previous issue was because I use fp16 instead of fp32.
Hi, in comparison, what is the result of xls-r?
After 10 hours of training, I managed to get to 44.02 WER and 0.57 eval loss. I think the previous issue was because I use fp16 instead of fp32.
Hi, in comparison, what is the result of xls-r?
Still in training. Currently, 27% WER. xls-r got around ~16% WER
Hi, increasing final_dropout from 0.1 to 0.3 could increase the results a little bit. mHuBERT is x3 smaller than XLS-R.
Hi @mzboito , I am also having troubles converging the mHubert to my target language (Arabic). I am using a custom dataset but I have already trained succesfully using wav2vec-XLS-R in around 8 hours. However for mHubert's case, it has not yet converged even after 20 hours (using Colab t4)
Here are the arguments I used:
model = HubertForCTC.from_pretrained(
saved_model,
attention_dropout=0.3,
hidden_dropout=0.3,
feat_proj_dropout=0.3,
mask_time_prob=0.05,
layerdrop=0.3,
final_dropout=0.3,
ctc_loss_reduction="mean",
pad_token_id=processor.tokenizer.pad_token_id,
vocab_size=len(processor.tokenizer)
)
training_args = TrainingArguments(
output_dir=saved_model,
group_by_length=False,
per_device_train_batch_size=16,
gradient_accumulation_steps=3,
gradient_checkpointing=True,
evaluation_strategy="steps",
num_train_epochs=20,
logging_dir="hubert",
fp16=False,
save_steps=400,
eval_steps=100,
logging_steps=100,
learning_rate=3e-4,
# learning_rate=1e-5,
adam_beta2=0.98,
warmup_ratio=0.1,
save_total_limit=3,
)
Please suggest any solution if possible, thanks again for open sourcing this small model!!
Hi,
I would suggest to sort your data by length to minimize padding.
You can also increase warm-up ratio and/or number of epochs. How are the curves looking like?
I wrote this blog post, which includes an ASR component. The difference is that I add a couple of simple MLP layers before the vocabulary projection, which I find to help the model converge faster, and to a better result. Maybe it can help:
https://huggingface.co/blog/mzboito/naver-demo-french-slu