File size: 1,865 Bytes
6316d1e
 
b3c2544
6316d1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b3c2544
 
 
6316d1e
 
 
 
 
3aca3e9
 
6316d1e
b3c2544
6316d1e
b3c2544
 
 
 
6316d1e
b3c2544
 
 
6316d1e
b451d48
 
 
 
 
b3c2544
 
 
 
 
6316d1e
b3c2544
 
6316d1e
b3c2544
 
 
 
6316d1e
b3c2544
6316d1e
b3c2544
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: apache-2.0
base_model: facebook/wav2vec2-large-xlsr-53
tags:
- generated_from_trainer
datasets:
- common_voice_13_0
metrics:
- wer
model-index:
- name: wav2vec2-large-xlsr-mvc-swahili
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: common_voice_13_0
      type: common_voice_13_0
      config: sw
      split: test
      args: sw
    metrics:
    - name: Wer
      type: wer
      value: 0.2
language:
- sw
---


# wav2vec2-large-xlsr-mvc-swahili

This model is a finetuned version of facebook/wav2vec2-large-xlsr-53. 
<!--Following inspiration from [alamsher/wav2vec2-large-xlsr-53-common-voice-s](https://huggingface.co/alamsher/wav2vec2-large-xlsr-53-common-voice-sw)-->

# How to use the model

There was an issue with vocab, seems like there are special characters included and they were not considered during training  
You could try 
```python
from transformers import AutoProcessor, AutoModelForCTC

repo_name = "eddiegulay/wav2vec2-large-xlsr-mvc-swahili"
processor = AutoProcessor.from_pretrained(repo_name)
model = AutoModelForCTC.from_pretrained(repo_name)

# if you have GPU
# move model to CUDA
model = model.to("cuda")


def transcribe(audio_path):
  # Load the audio file
  audio_input, sample_rate = torchaudio.load(audio_path)
  target_sample_rate = 16000
  audio_input = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)(audio_input)

  # Preprocess the audio data
  input_dict = processor(audio_input[0], return_tensors="pt", padding=True, sampling_rate=16000)

  # Perform inference and transcribe
  logits = model(input_dict.input_values.to("cuda")).logits
  pred_ids = torch.argmax(logits, dim=-1)[0]
  transcription = processor.decode(pred_ids)

  return transcription

transcript = transcribe('your_audio.mp3')
```