Model Card for Emotion Classification from Voice
This model performs emotion classification from voice data using fine-tuned Wav2Vec2Model
from Facebook. The model predicts one of seven emotion labels: Angry, Disgust, Fear, Happy, Neutral, Sad, and Surprise.
Model Details
- Developed by: Lingesh
- Model type: Fine-tuned Wav2Vec2Model
- Language(s): English (en), Tamil (ta), French (fr), Malayalam (ml)
- Finetuned from model: facebook/wav2vec2-base
Model Sources
Uses
Direct Use
This model can be directly used for emotion detection in speech audio files, which can have applications in call centers, virtual assistants, and mental health monitoring.
Out-of-Scope Use
The model is not intended for general speech recognition or other NLP tasks outside emotion classification.
Datasets Used
The model has been trained on a combination of the following datasets:
- CREMA-D: 7,442 clips of actors speaking with various emotions
- Torrento: Emotional speech in Spanish, captured from various environments
- RAVDESS: 24 professional actors, 7 emotions
- Emo-DB: 535 utterances, covering 7 emotions
The combination of these datasets allows the model to generalize across multiple languages and accents.
Bias, Risks, and Limitations
- Bias: The model might underperform on speech data with accents or languages not present in the training data.
- Limitations: The model is trained specifically for emotion detection and might not generalize well for other speech tasks.
How to Get Started with the Model
import torch
import numpy as np
from transformers import Wav2Vec2Model
from torchaudio.transforms import Resample
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
wav2vec2_model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base", output_hidden_states=True).to(device)
class FineTunedWav2Vec2Model(torch.nn.Module):
def __init__(self, wav2vec2_model, output_size):
super(FineTunedWav2Vec2Model, self).__init__()
self.wav2vec2 = wav2vec2_model
self.fc = torch.nn.Linear(self.wav2vec2.config.hidden_size, output_size)
def forward(self, x):
self.wav2vec2 = self.wav2vec2.double()
self.fc = self.fc.double()
outputs = self.wav2vec2(x.double())
out = outputs.hidden_states[-1]
out = self.fc(out[:, 0, :])
return out
def preprocess_audio(audio):
sample_rate, waveform = audio
if isinstance(waveform, np.ndarray):
waveform = torch.from_numpy(waveform)
if waveform.dim() == 2:
waveform = waveform.mean(dim=0)
# Normalize audio
if waveform.dtype != torch.float32:
waveform = waveform.float() / torch.iinfo(waveform.dtype).max
# Resample to 16kHz
if sample_rate != 16000:
resampler = Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
return waveform
def predict(audio):
model_path = "model.pth" # Path to your fine-tuned model
model = FineTunedWav2Vec2Model(wav2vec2_model, 7).to(device)
model.load_state_dict(torch.load(model_path, map_location=device))
model.eval()
waveform = preprocess_audio(audio)
waveform = waveform.unsqueeze(0).to(device)
with torch.no_grad():
output = model(waveform)
predicted_label = torch.argmax(output, dim=1).item()
emotion_labels = ["Angry", "Disgust", "Fear", "Happy", "Neutral", "Sad", "Surprise"]
return emotion_labels[predicted_label]
# Example usage
audio_data = (sample_rate, waveform) # Replace with your actual audio data
emotion = predict(audio_data)
print(f"Predicted Emotion: {emotion}")
Training Procedure
- Preprocessing: Resampled all audio to 16kHz.
- Training: Fine-tuned facebook/wav2vec2-base with emotion labels.
- Hyperparameters: Batch size: 16, Learning rate: 5e-5, Epochs: 50
Evaluation
Testing Data Evaluation was performed on a held-out test set from the CREMA-D and RAVDESS datasets.
Metrics
Accuracy: 85% F1-score: 82% (weighted average across all classes)
Model tree for Lingeshg/SpeechEmotionDetector
Base model
facebook/wav2vec2-base