File size: 5,529 Bytes
d27efe8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13c1e51
d27efe8
 
 
 
 
 
13c1e51
d27efe8
 
 
13c1e51
 
 
 
 
d27efe8
 
 
13c1e51
d27efe8
 
 
13c1e51
 
 
 
 
 
 
d27efe8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13c1e51
 
 
38bff40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: apache-2.0
base_model: facebook/hubert-base-ls960
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: hubert-finetuned-animals
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# hubert-finetuned-animals

This model, `hubert-finetuned-animals`, is a fine-tuned version of `facebook/hubert-base-ls960` specifically for the task of animal sound classification. The model has been trained to identify various animal sounds from a subset of the ESC-50 dataset, focusing exclusively on animal categories.
It achieves the following results on the evaluation set:
- Loss: 0.5596
- Accuracy: 0.95

## Model description

The HuBERT model, originally trained on large amounts of unlabelled audio data, has been fine-tuned here for a downstream task of animal sound classification. This fine-tuning allows the model to specialize in recognizing distinct animal sounds, such as those of dogs, cats, birds, etc., which can be particularly useful in applications such as bioacoustic monitoring, educational tools, and more interactive forms of wildlife conservation efforts.

## Intended uses & limitations

This model is intended for the classification of specific animal sounds within audio clips. It can be used in software applications related to wildlife research, educational content related to animals, or for entertainment purposes where animal sound recognition is needed.

### Limitations

While the model shows high accuracy, it is trained on a limited set of categories from the ESC-50 dataset, which may not cover all possible animal sounds. The performance can vary significantly with audio quality, background noise, and animal sound variations not represented in the training data.

## Training and evaluation data

The model was fine-tuned on a subset of the ESC-50 dataset, which is a publicly available collection designed for environmental sound classification tasks. This subset specifically includes only the categories relevant to animal sounds. Each category in the dataset contains 40 examples, providing a diverse set of samples for model training and evaluation.

## Training procedure

The model was fine-tuned using the following procedure:

1. Preprocessing: Audio files were converted into spectrograms.
2. Data Split: The data was split into 70% training, 20% testing sets and 10% validation sets.
3. Fine-tuning: The model was fine-tuned for 10 epochs on the training set.
4. Evaluation: The model's performance was evaluated on the validation set after each epoch to monitor improvement and prevent overfitting.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 10

### Training results

| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:--------:|
| 2.1934        | 1.0   | 45   | 2.1765          | 0.3      |
| 2.0239        | 2.0   | 90   | 1.8169          | 0.45     |
| 1.7745        | 3.0   | 135  | 1.4817          | 0.65     |
| 1.3787        | 4.0   | 180  | 1.2497          | 0.75     |
| 1.2168        | 5.0   | 225  | 1.0048          | 0.85     |
| 1.0359        | 6.0   | 270  | 0.9969          | 0.775    |
| 0.7983        | 7.0   | 315  | 0.7467          | 0.9      |
| 0.7466        | 8.0   | 360  | 0.7698          | 0.85     |
| 0.6284        | 9.0   | 405  | 0.6097          | 0.9      |
| 0.8365        | 10.0  | 450  | 0.5596          | 0.95     |


### Framework versions

- Transformers 4.33.2
- Pytorch 2.0.1+cu118
- Datasets 2.14.5
- Tokenizers 0.13.3

### Github Repository

[Animal Sound Classification](https://github.com/rawbeen248/audio_classification_finetuning)


### To try it locally

```
import librosa
import torch
from transformers import HubertForSequenceClassification, Wav2Vec2FeatureExtractor

# Load the fine-tuned model and feature extractor
model_name = "ardneebwar/wav2vec2-animal-sounds-finetuned-hubert-finetuned-animals"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model = HubertForSequenceClassification.from_pretrained(model_name)

# Prepare the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()  # Set the model to evaluation mode

# Function to predict the class of an audio file
def predict_audio_class(audio_file, feature_extractor, model, device):
    # Load and preprocess the audio file
    speech, sr = librosa.load(audio_file, sr=16000)
    input_values = feature_extractor(speech, return_tensors="pt", sampling_rate=16000).input_values
    input_values = input_values.to(device)

    # Predict
    with torch.no_grad():
        logits = model(input_values).logits

    # Get the predicted class ID
    predicted_id = torch.argmax(logits, dim=-1)
    # Convert the predicted ID to the class name
    predicted_class = model.config.id2label[predicted_id.item()]
    
    return predicted_class

# Replace 'path_to_your_new_audio_file.wav' with the actual path to the new audio file
audio_file_path = "path_to_audio_file.wav"
predicted_class = predict_audio_class(audio_file_path, feature_extractor, model, device)
print(f"Predicted class: {predicted_class}")

```