utter-project/mHuBERT-147 · How to use it with Transformer?

Jun 24

I am using it with this code

from transformers import (
    HubertModel,
    Wav2Vec2FeatureExtractor,
)
import torch
from datasets import load_dataset

dataset = load_dataset(
    "hf-internal-testing/librispeech_asr_demo", "clean", split="validation"
)
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

processor = Wav2Vec2FeatureExtractor.from_pretrained("utter-project/mHuBERT-147")
model = HubertModel.from_pretrained("utter-project/mHuBERT-147")

# audio file is decoded on the fly
inputs = processor(
    dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt"
)
with torch.no_grad():
    outputs = model(**inputs)

Would it work as intended? since other hubert is used like this.

mzboito

UTTER - Unified Transcription and Translation for Extended Reality org Jun 24

Hi! Thanks for the interest. Please note this checkpoint is not an ASR model: this is a speech representation model.
In order to use it as you want, you need to first fine-tune it using ASR data. You can check huggingface tutorials for this, such as this one: https://huggingface.co/blog/fine-tune-xlsr-wav2vec2 (just replace the wav2vec2 classes by hubert).
Best,

Junity

Jun 24

Hi! Thanks for the interest. Please note this checkpoint is not an ASR model: this is a speech representation model.
In order to use it as you want, you need to first fine-tune it using ASR data. You can check huggingface tutorials for this, such as this one: https://huggingface.co/blog/fine-tune-xlsr-wav2vec2 (just replace the wav2vec2 classes by hubert).
Best,

Hi, Thank you for your quick response!
I know that it is a speech representation model. What I am doing here is to extract audio feature. I would use this feature in a downstream task. What I would like to do is extract its feature and build a text to speech system with the feature it extract. I think during the training, the mHubert-147 parameters are freezed. In this case, am I doing the right step to extract the feature of an audio?

Best

mzboito

UTTER - Unified Transcription and Translation for Extended Reality org Jun 27

Hi again,

I see! Sorry for the confusion.
Feature extraction should work as for any other HuBERT model.

mHuBERT-147 is a base architecture, so your output features will be of dimension 768.
If you face any issues, let me know. :)

seanghay

Jun 28

@mzboito It'd be really helpful if you can provide a colab for the ASR training. I tried to follow the tutorial (https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) but I got eval_wer=1.0 for hours of training. I was able to successfully train my dataset with wav2vec2-xls-r-300m. Thanks!

seanghay

Jun 28

•

edited Jun 28

tokenizer = Wav2Vec2CTCTokenizer(
  "tokenizer/vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|"
)

model_id = "utter-project/mHuBERT-147"

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_id)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

model = HubertForCTC.from_pretrained(
  model_id,
  pad_token_id=processor.tokenizer.pad_token_id,
  vocab_size=len(processor.tokenizer),
)
)
model.freeze_feature_encoder()
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)


def prepare_dataset(batch):
  audio = batch["audio"]

  # batched output is "un-batched"
  batch["input_values"] = processor(
    audio["array"], sampling_rate=audio["sampling_rate"]
  ).input_values[0]
  batch["input_length"] = len(batch["input_values"])

  with processor.as_target_processor():
    batch["labels"] = processor(batch["transcription"]).input_ids
  return batch


# ...

training_args = TrainingArguments(
  output_dir="training",
  report_to="tensorboard",
  per_device_train_batch_size=32,
  per_device_eval_batch_size=8,
  eval_strategy="steps",
  num_train_epochs=50,
  fp16=True,
  save_steps=400,
  eval_steps=400,
  logging_steps=100,
  learning_rate=3e-4,
  warmup_steps=500,
  save_total_limit=5,
  push_to_hub=False,
  load_best_model_at_end=True,
)

mzboito

UTTER - Unified Transcription and Translation for Extended Reality org Jun 28

•

edited Jun 28

Hi. Which dataset/language are you fine-tuning on?
I recommend you tune your hyper-parameters for mHuBERT, specially because XLS-R is much larger.
If you are using the same datasets we use in the paper, you can check our hyper-parameters there (https://arxiv.org/abs/2406.06371).

In my experience, the optimal training hyper-parameters tend to vary quite a lot across different SSL backbones.
Modifying things like learning rate, warm-up ratio, and increasing dropout (if dataset is small) will help you better converge.

seanghay

Jun 28

@mzboito I used OpenSLR 42 dataset (https://www.openslr.org/42/) or (https://huggingface.co/datasets/openslr/openslr). I honestly don't know what to do next. The train loss and eval loss doesn't seem to go down.

mzboito

UTTER - Unified Transcription and Translation for Extended Reality org Jun 28

Hyper-parameters are very important. They can be the difference between a good ASR model and 100% WER/no convergence.
It's not because an ASR fine-tuning recipe with given hyper-parameters work for a given SSL backbone, that they will automatically work on a different model. On my previous message I mentioned important hyper-parameters you could explore for better convergence.

I unfortunately do not have the time to provide you with a recipe for this particular dataset, but I did train few-shot ASR models on Khmer using fleurs-102.
The hyper-parameters are very probably not optimal for Khmer (I optimized them using other languages), but they worked. They were:

evaluation_strategy: "steps"
num_train_epochs: 100
fp16: False
gradient_checkpointing: True
eval_steps: 500
save_steps: 500
logging_steps: 500
learning_rate: 1e-5
adam_beta1: 0.9
adam_beta2: 0.98
adam_epsilon: 1e-08
warmup_ratio: 0.1
save_total_limit: 2
per_device_train_batch_size: 2
per_device_eval_batch_size: 2 
push_to_hub: False
load_best_model_at_end: True
metric_for_best_model: "loss"
greater_is_better: False
}

Moreover, for those experiments, the mHuBERT model.final_dropout was set to 0.3.

seanghay

Jun 28

@mzboito Thank you! I will try to train again with the provided parameters.

seanghay

Jun 28

This is my colab fine tuning mHuBERT-147 OpenSLR 42
https://colab.research.google.com/drive/1m9Wdbnv8S7G4UzIu1QDYho6bCyPFeJWP?usp=sharing

seanghay

Jun 29

After 10 hours of training, I managed to get to 44.02 WER and 0.57 eval loss. I think the previous issue was because I use fp16 instead of fp32.

Junity

Jun 29

After 10 hours of training, I managed to get to 44.02 WER and 0.57 eval loss. I think the previous issue was because I use fp16 instead of fp32.

Hi, in comparison, what is the result of xls-r?

seanghay

Jun 29

After 10 hours of training, I managed to get to 44.02 WER and 0.57 eval loss. I think the previous issue was because I use fp16 instead of fp32.

Hi, in comparison, what is the result of xls-r?

Still in training. Currently, 27% WER. xls-r got around ~16% WER

mzboito

UTTER - Unified Transcription and Translation for Extended Reality org Jul 3

Hi, increasing final_dropout from 0.1 to 0.3 could increase the results a little bit. mHuBERT is x3 smaller than XLS-R.

dathudeptrai

Jul 3

This comment has been hidden

mzboito

UTTER - Unified Transcription and Translation for Extended Reality org Jul 3

This comment has been hidden

seanghay

Jul 3

Hi, increasing final_dropout from 0.1 to 0.3 could increase the results a little bit. mHuBERT is x3 smaller than XLS-R.

Let me try. Thank you @mzboito !

mzboito changed discussion status to closed Aug 27

nabil6391

Sep 6

•

edited Sep 6

Hi @mzboito , I am also having troubles converging the mHubert to my target language (Arabic). I am using a custom dataset but I have already trained succesfully using wav2vec-XLS-R in around 8 hours. However for mHubert's case, it has not yet converged even after 20 hours (using Colab t4)

Here are the arguments I used:

model = HubertForCTC.from_pretrained(
    saved_model,
    attention_dropout=0.3,
    hidden_dropout=0.3,
    feat_proj_dropout=0.3,
    mask_time_prob=0.05,
    layerdrop=0.3,
    final_dropout=0.3,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
)

training_args = TrainingArguments(
  output_dir=saved_model,
  group_by_length=False,
  per_device_train_batch_size=16,
  gradient_accumulation_steps=3,
  gradient_checkpointing=True,
  evaluation_strategy="steps",
  num_train_epochs=20,
  logging_dir="hubert",
  fp16=False,
  save_steps=400,
  eval_steps=100,
  logging_steps=100,
  learning_rate=3e-4,
  # learning_rate=1e-5,
   adam_beta2=0.98,
  warmup_ratio=0.1,
  save_total_limit=3,
)

Please suggest any solution if possible, thanks again for open sourcing this small model!!

mzboito

UTTER - Unified Transcription and Translation for Extended Reality org Sep 6

Hi,

I would suggest to sort your data by length to minimize padding.
You can also increase warm-up ratio and/or number of epochs. How are the curves looking like?

I wrote this blog post, which includes an ASR component. The difference is that I add a couple of simple MLP layers before the vocabulary projection, which I find to help the model converge faster, and to a better result. Maybe it can help:
https://huggingface.co/blog/mzboito/naver-demo-french-slu