classla/wav2vec2-large-slavic-voxpopuli-v2_hr_SER

This model for Croatian SER (speech emotion recognition) is based on the facebook/wav2vec2-large-slavic-voxpopuli-v2 and was fine-tuned on the CrES 2.1 dataset (Croatian Emotional Speech corpus).

If you use this model, please cite the following paper describing the dataset:

 @inproceedings{Dropuljić_Chmura_Kolak_Petrinović_2011, title={Emotional speech corpus of Croatian language}, ISSN={1845-5921}, booktitle={2011 7th International Symposium on Image and Signal Processing and Analysis (ISPA)}, author={Dropuljić, Branimir and Chmura, Miłosz Thomasz and Kolak, Antonio and Petrinović, Davor}, year={2011}, month={Sep}, pages={95–100} }

Metrics

Evaluation is performed on the dev and test portions of the CrES 2.1 dataset. The splitting was performed anew, stratified on emotion and with no leakage (i.e. no speaker is present in more than one split).

accuracy	macro F1	split
0.6796	0.6461	test
0.7277	0.7232	dev

Confusion matrix on test:

Training hyperparameters

In fine-tuning, the following arguments were used:

arg	value
`per_device_train_batch_size`	2
`per_device_eval_batch_size`	2
`gradient_accumulation_steps`	2
`num_train_epochs`	20
`learning_rate`	1e-4