--- license: cc-by-4.0 datasets: - cdminix/libritts-aligned language: - en tags: - speech recognition, speech synthesis, text-to-speech --- [![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://vocex-demo.streamlit.app) This model requires the Vocex library, which is available using ```pip install vocex``` Vocex extracts several measures (as well as d-vectors) from audio. ![summary](https://raw.githubusercontent.com/MiniXC/vocex/main/demo/summary.png) You can read more here: https://github.com/minixc/vocex ## Usage ```python from vocex import Vocex import torchaudio # or any other audio loading library model = Vocex.from_pretrained('cdminix/vocex') # an fp16 model is loaded by default model = Vocex.from_pretrained('cdminix/vocex', fp16=False) # to load a fp32 model model = Vocex.from_pretrained('some/path/model.ckpt') # to load local checkpoint audio = ... # a numpy or torch array is required with shape [batch_size, length_in_samples] or just [length_in_samples] sample_rate = ... # we need to specify a sample rate if the audio is not sampled at 22050 outputs = model(audio, sample_rate) pitch, energy, snr, srmr = ( outputs["measures"]["pitch"], outputs["measures"]["energy"], outputs["measures"]["snr"], outputs["measures"]["srmr"], ) d_vector = outputs["d_vector"] # a torch tensor with shape [batch_size, 256] # you can also get activations and attention weights at all layers of the model outputs = model(audio, sample_rate, return_activations=True, return_attention=True) activations = outputs["activations"] # a list of torch tensors with shape [batch_size, layers, ...] attention = outputs["attention"] # a list of torch tensors with shape [batch_size, layers, ...] # there are also speaker avatars, which are a 2D representation of the speaker's voice outputs = model(audio, sample_rate, return_avatar=True) avatar = outputs["avatars"] # a torch tensor with shape [batch_size, 256, 256] ```