OuteTTS
Collection
4 items
β’
Updated
β’
11
OuteTTS-0.2-500M is our improved successor to the v0.1 release. The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself. Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
Special thanks to Hugging Face for providing GPU grant that supported the training of this model.
pip install outetts
import outetts
# Configure the model
model_config = outetts.HFModelConfig_v1(
model_path="OuteAI/OuteTTS-0.2-500M",
language="en", # Supported languages in v0.2: en, zh, ja, ko
)
# Initialize the interface
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
# Optional: Create a speaker profile (use a 10-15 second audio clip)
# speaker = interface.create_speaker(
# audio_path="path/to/audio/file",
# transcript="Transcription of the audio file."
# )
# Optional: Save and load speaker profiles
# interface.save_speaker(speaker, "speaker.json")
# speaker = interface.load_speaker("speaker.json")
# Optional: Load speaker from default presets
interface.print_default_speakers()
speaker = interface.load_default_speaker(name="male_1")
output = interface.generate(
text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
# Lower temperature values may result in a more stable tone,
# while higher values can introduce varied and expressive speech
temperature=0.1,
repetition_penalty=1.1,
max_length=4096,
# Optional: Use a speaker profile for consistent voice characteristics
# Without a speaker profile, the model will generate a voice with random characteristics
speaker=speaker,
)
# Save the synthesized speech to a file
output.save("output.wav")
# Optional: Play the synthesized speech
# output.play()
# Configure the GGUF model
model_config = outetts.GGUFModelConfig_v1(
model_path="local/path/to/model.gguf",
language="en", # Supported languages in v0.2: en, zh, ja, ko
n_gpu_layers=0,
)
# Initialize the GGUF interface
interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
import outetts
import torch
model_config = outetts.HFModelConfig_v1(
model_path="OuteAI/OuteTTS-0.2-500M",
language="en", # Supported languages in v0.2: en, zh, ja, ko
dtype=torch.bfloat16,
additional_model_config={
'attn_implementation': "flash_attention_2"
}
)
To achieve the best results when creating a speaker profile, consider the following recommendations:
Audio Clip Duration:
Audio Quality:
Accurate Transcription:
Speaker Familiarity:
Parameter Adjustments:
temperature
in the generate
function to refine the expressive quality and consistency of the synthesized voice.