|
--- |
|
license: apache-2.0 |
|
tags: |
|
- generated_from_trainer |
|
metrics: |
|
- f1 |
|
model-index: |
|
- name: weights |
|
results: [] |
|
datasets: |
|
- librispeech_asr |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# wav2vec2-large-xlsr-53-gender-recognition-librispeech |
|
|
|
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on Librispeech-clean-100 for gender recognition. |
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.0061 |
|
- F1: 0.9993 |
|
|
|
### Compute your inferences |
|
|
|
```python |
|
class DataColletor: |
|
def __init__( |
|
self, |
|
processor: Wav2Vec2Processor, |
|
sampling_rate: int = 16000, |
|
padding: Union[bool, str] = True, |
|
max_length: Optional[int] = None, |
|
pad_to_multiple_of: Optional[int] = None, |
|
label2id: Dict = None, |
|
max_audio_len: int = 5 |
|
): |
|
|
|
self.processor = processor |
|
self.sampling_rate = sampling_rate |
|
|
|
self.padding = padding |
|
self.max_length = max_length |
|
self.pad_to_multiple_of = pad_to_multiple_of |
|
|
|
self.label2id = label2id |
|
|
|
self.max_audio_len = max_audio_len |
|
|
|
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: |
|
# split inputs and labels since they have to be of different lenghts and need |
|
# different padding methods |
|
input_features = [] |
|
label_features = [] |
|
for feature in features: |
|
speech_array, sampling_rate = torchaudio.load(feature["input_values"]) |
|
|
|
# Transform to Mono |
|
speech_array = torch.mean(speech_array, dim=0, keepdim=True) |
|
|
|
if sampling_rate != self.sampling_rate: |
|
transform = torchaudio.transforms.Resample(sampling_rate, self.sampling_rate) |
|
speech_array = transform(speech_array) |
|
sampling_rate = self.sampling_rate |
|
|
|
effective_size_len = sampling_rate * self.max_audio_len |
|
|
|
if speech_array.shape[-1] > effective_size_len: |
|
speech_array = speech_array[:, :effective_size_len] |
|
|
|
speech_array = speech_array.squeeze().numpy() |
|
input_tensor = self.processor(speech_array, sampling_rate=sampling_rate).input_values |
|
input_tensor = np.squeeze(input_tensor) |
|
|
|
input_features.append({"input_values": input_tensor}) |
|
|
|
batch = self.processor.pad( |
|
input_features, |
|
padding=self.padding, |
|
max_length=self.max_length, |
|
pad_to_multiple_of=self.pad_to_multiple_of, |
|
return_tensors="pt", |
|
) |
|
|
|
return batch |
|
|
|
|
|
label2id = { |
|
"female": 0, |
|
"male": 1 |
|
} |
|
|
|
id2label = { |
|
0: "female", |
|
1: "male" |
|
} |
|
|
|
num_labels = 2 |
|
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained("alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech") |
|
model = AutoModelForAudioClassification.from_pretrained( |
|
pretrained_model_name_or_path="alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech", |
|
num_labels=num_labels, |
|
label2id=label2id, |
|
id2label=id2label, |
|
) |
|
|
|
data_collator = DataColletorTrain( |
|
feature_extractor, |
|
sampling_rate=16000, |
|
padding=True, |
|
label2id=label2id |
|
) |
|
|
|
test_dataloader = DataLoader( |
|
dataset=test_dataset, |
|
batch_size=16, |
|
collate_fn=data_collator, |
|
shuffle=False, |
|
num_workers=10 |
|
) |
|
|
|
preds = predict(test_dataloader=test_dataloader, model=model) |
|
``` |
|
|
|
|
|
## Training and evaluation data |
|
|
|
The Librispeech-clean-100 dataset was used to train the model, with 70% of the data used for training, 10% for validation, and 20% for testing. |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 3e-05 |
|
- train_batch_size: 4 |
|
- eval_batch_size: 4 |
|
- seed: 42 |
|
- gradient_accumulation_steps: 4 |
|
- total_train_batch_size: 16 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_ratio: 0.1 |
|
- num_epochs: 1 |
|
- mixed_precision_training: Native AMP |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | F1 | |
|
|:-------------:|:-----:|:----:|:---------------:|:------:| |
|
| 0.002 | 1.0 | 1248 | 0.0061 | 0.9993 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.28.0 |
|
- Pytorch 2.0.0+cu118 |
|
- Tokenizers 0.13.3 |