Model Card for LatAm Accent Determination

Wav2Vec2 Model to classify audio based on the accent of the speaker as Puerto Rican, Colombian, Venezuelan, Peruvian, or Chilean

Model Card for LatAm Accent Determination
Table of Contents
Model Details
- Model Description
Uses
- Direct Use
- Out-of-Scope Use
Bias, Risks, and Limitations
Training Details
- Training Data
- Training Procedure
  - Preprocessing
  - Speeds, Sizes, Times
Evaluation
- Testing Data, Factors & Metrics
- Results
Model Examination
Technical Specs
- Model Architecture and Objective
- Compute Infrastructure
  - Hardware
  - Software
Citation
Model Card Authors
Model Card Contact
How to Get Started with the Model

Model Details

Model Description

Wav2Vec2 Model to classify audio based on the accent of the speaker as Puerto Rican, Colombian, Venezuelan, Peruvian, or Chilean

Developed by: Henry Savich
Shared by [Optional]: Henry Savich
Model type: Language model
Language(s) (NLP): es
License: openrail
Parent Model: Wav2Vec2 Base
Resources for more information:
- GitHub Repo

Uses

Direct Use

Classify an audio clip as Puerto Rican, Peruvian, Venezuelan, Colombian, or Chilean Spanish

Out-of-Scope Use

The model was trained on speakers reciting pre-chosen sentences, thus it does not reflect any knowledge of lexical differences between dialects.

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

Training Details

Training Data

OpenSLR 71,72,73,74,75,76

Training Procedure

Preprocessing

Data was Train-Test split on speakers, so as to prevent the model from achieving high test accuracy by matching voices.

Speeds, Sizes, Times

Trained on ~3000 5-second audio clips, Training is lightwegiht taking < 1 hr on using Google Colaboratory Premium GPUs

Evaluation

Testing Data, Factors & Metrics

Testing Data

OpenSLR 71,72,73,74,75,76 https://huggingface.co/datasets/openslr

Factors

Audio Quality - training and testing data was higher quality than can be expected from found audio

Metrics

Accuracy

Results

~85% depending on random train-test split

Model Examination

Even splitting on speakers, our model achieves excellent accuracy on the testing set. This is interesting because it indicates that accent classification, at least at this granularity, is an easier task than voice identification, which could have just as easily met the training objective.

The confusion matrix shows that Basque is the most easily distinguished, which should be expecting as it is the only language that isn't Spanish. Puerto Rican was the hardest to identify in the testing set, but I think this is more having to do with PR having the least data moreso than something about the accent itself.

I think if this same size of dataset was used for this same experiment, but there were more speakers (and so not as much fitting on individual voices), we could expect near perfect accuracy.

Technical Specifications

Model Architecture and Objective

Wav2Vec2

Compute Infrastructure

Google Colaboratory Pro+

Hardware

Google Colaboratory Pro+ Premium GPUS

Software

Pytorch via huggingface

Model Card Authors

Henry Savich

Model Card Contact

henry.h.savich@vanderbilt.edu