Qwen2-Audio
Qwen2-Audio is a SOTA small-scale multimodal model that handles audio and text inputs, allowing you to have voice interactions without ASR modules. Qwen2-Audio supports English, Chinese, and major European languages,and also provides robust audio analysis for local use cases like:
- Speaker identification and response
- Speech translation and transcription
- Mixed audio and noise detection
- Music and sound analysis
We're bringing Qwen2-Audio to edge devices with Nexa SDK, offering various quantization options.
- Voice Chat: Users can freely engage in voice interactions with Qwen2-Audio without text input.
- Audio Analysis: Users can provide both audio and text instructions for analysis during the interaction.
Demo
How to Run Locally On-Device
In the following, we demonstrate how to run Qwen2-Audio locally on your device.
Step 1: Install Nexa-SDK (local on-device inference framework)
Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.
Step 2: Then run the following code in your terminal to run with local streamlit UI
nexa run qwen2audio -st
or to use in terminal:
nexa run qwen2audio
Usage Instructions
For terminal:
- Drag and drop your audio file into the terminal (or enter file path on Linux)
- Add text prompt to guide analysis or leave empty for direct voice input
System Requirements
💻 RAM Requirements:
- Default q4_K_M version requires 4.2GB of RAM
- Check the RAM requirements table for different quantization versions
🎵 Audio Format:
- Optimal: 16kHz
.wav
format - Other formats and sample rates are supported with automatic conversion
Use Cases
Voice Chat
- Answer daily questions
- Offer suggestions
- Speaker identification and response
- Speech translation
- Detecting background noise and responding accordingly
Audio Analysis
- Information Extraction
- Audio summary
- Speech Transcription and Expansion
- Mixed audio and noise detection
- Music and sound analysis
Performance Benchmark
Results demonstrate that Qwen2-Audio significantly outperforms either previous SOTAs or Qwen-Audio across all tasks.