File size: 2,320 Bytes

# Salt - Speech and Langauage transformer

Vikhr Salt is an advanced speech and language transformer model designed for seamless handling of Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. Built upon a pre-trained large language model, Vikhr Salt extends its vocabulary to include new audio tokens, enabling it to process multimodal data effectively while leveraging the rich prior knowledge embedded in the model.


|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.2691|±  |0.0037|
| - humanities     |      2|none  |      |acc   |↑  |0.2442|±  |0.0063|
| - other          |      2|none  |      |acc   |↑  |0.2478|±  |0.0076|
| - social sciences|      2|none  |      |acc   |↑  |0.3094|±  |0.0083|
| - stem           |      2|none  |      |acc   |↑  |0.2880|±  |0.0080|



Key Features:

	•	Unified Multimodal Approach: Combines text and audio processing in a single framework with one LM loss, ensuring coherent and efficient learning.
	•	Dual Tokenization System: Supports both Encodec and SpeechTokenizer tokens, enabling flexibility in training and inference.
	•	Optimized Training Pipeline: Achieves stable training with mixed-precision settings, utilizing tf32 for improved numerical stability.
	•	Comprehensive Metrics: Evaluated using industry benchmarks such as PESQ, STOI, and SI-SDR for audio quality, and SIMO for zero-shot TTS.

Model Highlights:

	•	Multimodal Compatibility: Handles both semantic and acoustic token sequences, supporting diverse TTS and ASR scenarios.
	•	Training Efficiency: Trained on 150 A100 GPU hours, balancing performance and computational cost.
	•	Competitive Benchmarks: Demonstrates strong performance on MMLU and other standard benchmarks.

Applications:

	•	Text-to-Speech synthesis with customizable styles and tones.
	•	Automatic Speech Recognition for accurate transcription.
	•	Multimodal research and development in speech and language understanding.

Example Use Cases:

	•	Generate expressive, natural-sounding speech from text.
	•	Transcribe audio recordings into text with high accuracy.
	•	Explore emergent multimodal capabilities in large-scale models.