File size: 4,356 Bytes
8d28ba8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: agpl-3.0
language:
- en
- zh
- ja
- ko
base_model: lovemefan/SenseVoice-onnx
tags:
- rknn
---

# SenseVoiceSmall-RKNN2

### (English README see below)

SenseVoice是具有音频理解能力的音频基础模型, 包括语音识别(ASR)、语种识别(LID)、语音情感识别(SER)和声学事件分类(AEC)或声学事件检测(AED)。

当前SenseVoice-small支持中、粤、英、日、韩语的多语言语音识别,情感识别和事件检测能力,具有极低的推理延迟。 

- 推理速度(RKNN2):RK3588上单核NPU推理速度约20倍 (每秒识别20秒的音频), 比官方rknn-model-zoo中提供的whisper约快6倍.
- 内存占用(RKNN2):约1.1GB

## 使用方法

1. 克隆项目到本地

2. 安装依赖

```bash
pip install kaldi_native_fbank onnxruntime sentencepiece soundfile pyyaml numpy<2
```

你还需要手动安装rknn-toolkit2-lite2. 

3. 运行

```bash
python ./sensevoice_rknn.py --audio_file output.wav
```

如果使用自己的音频文件测试发现识别不正常,你可能需要提前将它转换为16kHz, 16bit, 单声道的wav格式。

```bash
ffmpeg -i input.mp3 -f wav -acodec pcm_s16le -ac 1 -ar 16000 output.wav
```

## RKNN模型转换

你需要提前安装rknn-toolkit2 v2.1.0或更高版本. 

1. 下载或转换onnx模型

可以从 https://huggingface.co/lovemefan/SenseVoice-onnx 下载到onnx模型.  
应该也可以根据 https://github.com/FunAudioLLM/SenseVoice 中的文档从Pytorch模型转换得到onnx模型.

模型文件应该命名为'sense-voice-encoder.onnx', 放在转换脚本所在目录.

2. 转换为rknn模型
```bash
python convert_rknn.py 
```

## 已知问题

- RKNN2使用fp16推理时可能会出现溢出,导致结果为inf,可以尝试修改输入数据的缩放比例来解决.  
  在`sensevoice_rknn.py`中将`SPEECH_SCALE`设置为更小的值.

## 参考
- [FunAudioLLM/SenseVoiceSmall](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
- [lovemefan/SenseVoice-python](https://github.com/lovemefan/SenseVoice-python)

# English README

# SenseVoiceSmall-RKNN2

SenseVoice is an audio foundation model with audio understanding capabilities, including Automatic Speech Recognition (ASR), Language Identification (LID), Speech Emotion Recognition (SER), and Acoustic Event Classification (AEC) or Acoustic Event Detection (AED).

Currently, SenseVoice-small supports multilingual speech recognition, emotion recognition, and event detection for Chinese, Cantonese, English, Japanese, and Korean, with extremely low inference latency.

- Inference speed (RKNN2): About 20x real-time on a single NPU core of RK3588 (processing 20 seconds of audio per second), approximately 6 times faster than the official whisper model provided in the rknn-model-zoo.
- Memory usage (RKNN2): About 1.1GB

## Usage

1. Clone the project to your local machine

2. Install dependencies

```bash
pip install kaldi_native_fbank onnxruntime sentencepiece soundfile pyyaml numpy<2
```

You also need to manually install rknn-toolkit2-lite2.

3. Run

```bash
python ./sensevoice_rknn.py --audio_file output.wav
```

If you find that recognition is not working correctly when testing with your own audio files, you may need to convert them to 16kHz, 16-bit, mono WAV format in advance.

```bash
ffmpeg -i input.mp3 -f wav -acodec pcm_s16le -ac 1 -ar 16000 output.wav
```

## RKNN Model Conversion

You need to install rknn-toolkit2 v2.1.0 or higher in advance.

1. Download or convert the ONNX model

You can download the ONNX model from https://huggingface.co/lovemefan/SenseVoice-onnx.
It should also be possible to convert from a PyTorch model to an ONNX model according to the documentation at https://github.com/FunAudioLLM/SenseVoice.

The model file should be named 'sense-voice-encoder.onnx' and placed in the same directory as the conversion script.

2. Convert to RKNN model
```bash
python convert_rknn.py 
```

## Known Issues

- When using fp16 inference with RKNN2, overflow may occur, resulting in inf values. You can try modifying the scaling ratio of the input data to resolve this.  
  Set `SPEECH_SCALE` to a smaller value in `sensevoice_rknn.py`.

## References
- [FunAudioLLM/SenseVoiceSmall](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
- [lovemefan/SenseVoice-python](https://github.com/lovemefan/SenseVoice-python)