lina-speech
/

all-models

Model card Files Files and versions Community

all-models / README.md

reach-vb's picture

reach-vb HF staff

Create README.md

63954d9 verified 6 months ago

|

2.54 kB

	# lina-speech (beta)

	Exploring "linear attention" for text-to-speech.

	It predicts audio codec "à la" [MusicGen](https://arxiv.org/abs/2306.05284) : delayed residual vector quantizers so that we do not need multiple models.

	Featuring [RWKV](https://github.com/BlinkDL/RWKV-LM), [Mamba](https://github.com/state-spaces/mamba), [Gated Linear Attention](https://github.com/sustcsonglin/flash-linear-attention).

	Compared to other LM TTS model :
	- Can be easily pretrained and finetuned on midrange GPUs.
	- Tiny memory footprint.
	- Trained on long context (up to 2000 tokens : ~27s).

	### Models

	\| Model \| #Params \| Dataset \| Checkpoint \| Steps \| Note \|
	\| :---: \| :---: \|:---: \|:---: \|:---: \|:---: \|
	\| GLA \| 60M, 130M \| Librilight-medium \| [Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9) \| 300k \| GPU inference only \|
	\| Mamba\| 60M \| Librilight-medium \|[Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9)\| 300k \| GPU inference only \|
	\| RWKV v6 \| 60M \| LibriTTS \|[Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9) \| 150k \| GPU inference only \|

	### Installation
	Following the linear complexity LM you choose, follow respective instructions first:
	- For Mamba check the [official repo](https://github.com/state-spaces/mamba).
	- For GLA/RWKV inference check [flash-linear-attention](https://github.com/sustcsonglin/flash-linear-attention).
	- For RWKV training check [RWKV-LM](https://github.com/BlinkDL/RWKV-LM)

	### Inference

	Download configuration and weights above, then check `Inference.ipynb`.

	### TODO

	- [x] Fix RWKV6 inference and/or switch to FLA implem.
	- [ ] Provide a Datamodule for training (_lhotse_ might also work well).
	- [ ] Implement CFG.
	- [ ] Scale up.

	### Acknowledgment

	- The RWKV authors and the community around for carrying high-level truly opensource research.
	- @SmerkyG for making my life easy at testing cutting edge language model.
	- @lucidrains for its huge codebase.
	- @sustcsonglin who made [GLA and FLA](https://github.com/sustcsonglin/flash-linear-attention).
	- @harrisonvanderbyl fixing RWKV inference.

	### Cite
	```bib
	@software{lemerle2024linaspeech,
	title = {LinaSpeech: Exploring "linear attention" for text-to-speech.},
	author = {Lemerle, Théodor},
	url = {https://github.com/theodorblackbird/lina-speech},
	month = april,
	year = {2024}
	}
	```
	### IRCAM

	This work takes place at IRCAM, and is part of the following project :
	[ANR Exovoices](https://anr.fr/Projet-ANR-21-CE23-0040)

	<img align="left" width="200" height="200" src="logo_ircam.jpeg">