Edit model card

SER_wav2vec2-large-xlsr-53_240304_fine-tuned_2

This model is a fine-tuned version of hughlan1214/SER_wav2vec2-large-xlsr-53_240304_fine-tuned1.1 on a Speech Emotion Recognition (en) dataset.

This dataset includes the 4 most popular datasets in English: Crema, Ravdess, Savee, and Tess, containing a total of over 12,000 .wav audio files. Each of these four datasets includes 6 to 8 different emotional labels.

This achieves the following results on the evaluation set:

  • Loss: 1.0601
  • Accuracy: 0.6731
  • Precision: 0.6761
  • Recall: 0.6794
  • F1: 0.6738

Model description

The model was obtained through feature extraction using facebook/wav2vec2-large-xlsr-53 and underwent several rounds of fine-tuning. It predicts the 7 types of emotions contained in speech, aiming to lay the foundation for subsequent use of human micro-expressions on the visual level and context semantics under LLMS to infer user emotions in real-time.

Although the model was trained on purely English datasets, post-release testing showed that it also performs well in predicting emotions in Chinese and French, demonstrating the powerful cross-linguistic capability of the facebook/wav2vec2-large-xlsr-53 pre-trained model.

emotions = ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']

Intended uses & limitations

More information needed

Training and evaluation data

70/30 of entire dataset.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 8
  • eval_batch_size: 4
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 10

Training results

Training Loss Epoch Step Validation Loss Accuracy Precision Recall F1
0.8904 1.0 1048 1.1923 0.5773 0.6162 0.5563 0.5494
1.1394 2.0 2096 1.0143 0.6071 0.6481 0.6189 0.6057
0.9373 3.0 3144 1.0585 0.6126 0.6296 0.6254 0.6119
0.7405 4.0 4192 0.9580 0.6514 0.6732 0.6562 0.6576
1.1638 5.0 5240 0.9940 0.6486 0.6485 0.6627 0.6435
0.6741 6.0 6288 1.0307 0.6628 0.6710 0.6711 0.6646
0.604 7.0 7336 1.0248 0.6667 0.6678 0.6751 0.6682
0.6835 8.0 8384 1.0396 0.6722 0.6803 0.6790 0.6743
0.5421 9.0 9432 1.0493 0.6714 0.6765 0.6785 0.6736
0.5728 10.0 10480 1.0601 0.6731 0.6761 0.6794 0.6738

Framework versions

  • Transformers 4.38.1
  • Pytorch 2.2.1
  • Datasets 2.17.1
  • Tokenizers 0.15.2
Downloads last month
50
Safetensors
Model size
316M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for hughlan1214/Speech_Emotion_Recognition_wav2vec2-large-xlsr-53_240304_SER_fine-tuned2.0

Unable to build the model tree, the base model loops to the model itself. Learn more.