cdminix
/

vocex

speech recognition, speech synthesis, text-to-speech

Model card Files Files and versions Community

vocex / README.md

cdminix's picture

Update README.md

94d242d verified 6 months ago

|

history blame contribute delete

No virus

1.99 kB

	---
	license: cc-by-4.0
	datasets:
	- cdminix/libritts-aligned
	language:
	- en
	tags:
	- speech recognition, speech synthesis, text-to-speech
	---

	[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://vocex-demo.streamlit.app)


	This model requires the Vocex library, which is available using

	```pip install vocex```



	Vocex extracts several measures (as well as d-vectors) from audio.
	![summary](https://raw.githubusercontent.com/MiniXC/vocex/main/demo/summary.png)
	You can read more here:
	https://github.com/minixc/vocex

	## Usage
	```python
	from vocex import Vocex
	import torchaudio # or any other audio loading library

	model = Vocex.from_pretrained('cdminix/vocex') # an fp16 model is loaded by default
	model = Vocex.from_pretrained('cdminix/vocex', fp16=False) # to load a fp32 model
	model = Vocex.from_pretrained('some/path/model.ckpt') # to load local checkpoint

	audio = ... # a numpy or torch array is required with shape [batch_size, length_in_samples] or just [length_in_samples]
	sample_rate = ... # we need to specify a sample rate if the audio is not sampled at 22050

	outputs = model(audio, sample_rate)
	pitch, energy, snr, srmr = (
	outputs["measures"]["pitch"],
	outputs["measures"]["energy"],
	outputs["measures"]["snr"],
	outputs["measures"]["srmr"],
	)
	d_vector = outputs["d_vector"] # a torch tensor with shape [batch_size, 256]

	# you can also get activations and attention weights at all layers of the model
	outputs = model(audio, sample_rate, return_activations=True, return_attention=True)
	activations = outputs["activations"] # a list of torch tensors with shape [batch_size, layers, ...]
	attention = outputs["attention"] # a list of torch tensors with shape [batch_size, layers, ...]

	# there are also speaker avatars, which are a 2D representation of the speaker's voice
	outputs = model(audio, sample_rate, return_avatar=True)
	avatar = outputs["avatars"] # a torch tensor with shape [batch_size, 256, 256]
	```