--- license: apache-2.0 datasets: - mozilla-foundation/common_voice_16_1 - openslr/librispeech_asr language: - en metrics: - wer library_name: transformers model-index: - name: SpeechLLM results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: clean split: test args: language: en metrics: - name: Test WER type: wer value: 12.3 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (other) type: librispeech_asr config: other split: test args: language: en metrics: - name: Test WER type: wer value: 18.9 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 16.1 type: common_voice_16_1 split: test args: language: en metrics: - name: Test WER type: wer value: 25.01 --- # SpeechLLM ## Usage ```python # Load model directly from huggingface from transformers import AutoModel model = AutoModel.from_pretrained("skit-ai/SpeechLLM", trust_remote_code=True) model.generate_meta( audio_path="path-to-audio.wav", instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]", max_new_tokens=500, return_special_tokens=False ) # Model Generation ''' { "SpeechActivity" : "True", "Transcript": "Yes, I got it. I'll make the payment now.", "Gender": "Female", "Emotion": "Neutral", "Age": "Young", "Accent" : "America", } ''' ``` ## Checkpoint Result | Dataset | Word Error Rate(%) | Gender(%) | |:----------------------:|:------------------:|:---------:| | librispeech-test-clean | 0.1230 | 0.8778 | | librispeech-test-other | 0.1890 | 0.8908 | | CommonVoice test | 0.2501 | 0.8753 |