NexaAIDev
/

Qwen2-Audio-7B-GGUF

Audio-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

Qwen2-Audio-7B-GGUF / README.md

alanzhuly's picture

Update README.md

77948dc verified 17 days ago

|

3.19 kB

	# Qwen2-Audio

	<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/ThcKJj7LcWCZPwN1So05f.png" alt="Example" style="width:700px;"/>

	Qwen2-Audio is a SOTA small-scale multimodal model that handles audio and text inputs, allowing you to have voice interactions without ASR modules. Qwen2-Audio supports English, Chinese, and major European languages,and also provides robust audio analysis for local use cases like:
	- Speaker identification and response
	- Speech translation and transcription
	- Mixed audio and noise detection
	- Music and sound analysis

	## We're bringing Qwen2-Audio to edge devices with Nexa SDK, offering various quantization options.
	- Voice Chat: Users can freely engage in voice interactions with Qwen2-Audio without text input.
	- Audio Analysis: Users can provide both audio and text instructions for analysis during the interaction.
	### Demo

	<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/02XDwJe3bhZHYptor-b2_.mp4"></video>

	## How to Run Locally On-Device

	In the following, we demonstrate how to run Qwen2-Audio locally on your device.

	Step 1: Install Nexa-SDK (local on-device inference framework)

	[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)

	> Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.

	Step 2: Then run the following code in your terminal to run with local streamlit UI

	```bash
	nexa run qwen2audio -st
	```

	or to use in terminal:

	```bash
	nexa run qwen2audio
	```

	### Usage Instructions

	For terminal:
	1. Drag and drop your audio file into the terminal (or enter file path on Linux)
	2. Add text prompt to guide analysis or leave empty for direct voice input

	### System Requirements

	💻 RAM Requirements:
	- Default q4_K_M version requires 4.2GB of RAM
	- Check the RAM requirements table for different quantization versions

	🎵 Audio Format:
	- Optimal: 16kHz `.wav` format
	- Other formats and sample rates are supported with automatic conversion

	## Use Cases

	### Voice Chat
	- Answer daily questions
	- Offer suggestions
	- Speaker identification and response
	- Speech translation
	- Detecting background noise and responding accordingly

	### Audio Analysis
	- Information Extraction
	- Audio summary
	- Speech Transcription and Expansion
	- Mixed audio and noise detection
	- Music and sound analysis

	## Performance Benchmark

	<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/lax8bLpR5uK2_Za0G6G3j.png" alt="Example" style="width:700px;"/>

	Results demonstrate that Qwen2-Audio significantly outperforms either previous SOTAs or Qwen-Audio across all tasks.

	<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/2vACK_gD_MAuZ7Hn4Yfiv.png" alt="Example" style="width:700px;"/>


	## Follow Nexa AI to run more models on-device
	[Website](https://nexa.ai/)