--- language: - en - ta - fr - ml - hi pipeline_tag: voice-activity-detection base_model: facebook/wav2vec2-base --- # Model Card for Emotion Classification from Voice This model performs emotion classification from voice data using fine-tuned `Wav2Vec2Model` from Facebook. The model predicts one of seven emotion labels: Angry, Disgust, Fear, Happy, Neutral, Sad, and Surprise. ## Model Details - **Developed by:** Lingesh - **Model type:** Fine-tuned Wav2Vec2Model - **Language(s):** English (en), Tamil (ta), French (fr), Malayalam (ml) - **Finetuned from model:** [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) ### Model Sources - **Repository:** https://github.com/githubLINGESH/SpeechEmo_Recogintion/ - ## Uses ### Direct Use This model can be directly used for emotion detection in speech audio files, which can have applications in call centers, virtual assistants, and mental health monitoring. ### Out-of-Scope Use The model is not intended for general speech recognition or other NLP tasks outside emotion classification. ## Datasets Used The model has been trained on a combination of the following datasets: - **CREMA-D:** 7,442 clips of actors speaking with various emotions - **Torrento:** Emotional speech in Spanish, captured from various environments - **RAVDESS:** 24 professional actors, 7 emotions - **Emo-DB:** 535 utterances, covering 7 emotions The combination of these datasets allows the model to generalize across multiple languages and accents. ## Bias, Risks, and Limitations - **Bias:** The model might underperform on speech data with accents or languages not present in the training data. - **Limitations:** The model is trained specifically for emotion detection and might not generalize well for other speech tasks. ## How to Get Started with the Model ```python import torch import numpy as np from transformers import Wav2Vec2Model from torchaudio.transforms import Resample device = torch.device("cuda" if torch.cuda.is_available() else "cpu") wav2vec2_model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base", output_hidden_states=True).to(device) class FineTunedWav2Vec2Model(torch.nn.Module): def __init__(self, wav2vec2_model, output_size): super(FineTunedWav2Vec2Model, self).__init__() self.wav2vec2 = wav2vec2_model self.fc = torch.nn.Linear(self.wav2vec2.config.hidden_size, output_size) def forward(self, x): self.wav2vec2 = self.wav2vec2.double() self.fc = self.fc.double() outputs = self.wav2vec2(x.double()) out = outputs.hidden_states[-1] out = self.fc(out[:, 0, :]) return out def preprocess_audio(audio): sample_rate, waveform = audio if isinstance(waveform, np.ndarray): waveform = torch.from_numpy(waveform) if waveform.dim() == 2: waveform = waveform.mean(dim=0) # Normalize audio if waveform.dtype != torch.float32: waveform = waveform.float() / torch.iinfo(waveform.dtype).max # Resample to 16kHz if sample_rate != 16000: resampler = Resample(orig_freq=sample_rate, new_freq=16000) waveform = resampler(waveform) return waveform def predict(audio): model_path = "model.pth" # Path to your fine-tuned model model = FineTunedWav2Vec2Model(wav2vec2_model, 7).to(device) model.load_state_dict(torch.load(model_path, map_location=device)) model.eval() waveform = preprocess_audio(audio) waveform = waveform.unsqueeze(0).to(device) with torch.no_grad(): output = model(waveform) predicted_label = torch.argmax(output, dim=1).item() emotion_labels = ["Angry", "Disgust", "Fear", "Happy", "Neutral", "Sad", "Surprise"] return emotion_labels[predicted_label] # Example usage audio_data = (sample_rate, waveform) # Replace with your actual audio data emotion = predict(audio_data) print(f"Predicted Emotion: {emotion}") ``` ## Training Procedure - Preprocessing: Resampled all audio to 16kHz. - Training: Fine-tuned facebook/wav2vec2-base with emotion labels. - Hyperparameters: Batch size: 16, Learning rate: 5e-5, Epochs: 50 ## Evaluation Testing Data Evaluation was performed on a held-out test set from the CREMA-D and RAVDESS datasets. ## Metrics Accuracy: 85% F1-score: 82% (weighted average across all classes)