wr
commited on
Commit
•
0233e7e
1
Parent(s):
604eca0
add manifest and pretrained vocoders
Browse files- README.md +45 -0
- manifest/.DS_Store +0 -0
- manifest/arctic_bdl_parallel_wavegan.v1/.DS_Store +0 -0
- manifest/arctic_bdl_parallel_wavegan.v1/config.yml +104 -0
- manifest/arctic_bdl_parallel_wavegan.v1/pwg-arctic-bdl-400000steps.pkl +3 -0
- manifest/arctic_bdl_parallel_wavegan.v1/stats.npy +3 -0
- manifest/arctic_clb_parallel_wavegan.v1/.DS_Store +0 -0
- manifest/arctic_clb_parallel_wavegan.v1/config.yml +104 -0
- manifest/arctic_clb_parallel_wavegan.v1/pwg-arctic-clb-400000steps.pkl +3 -0
- manifest/arctic_clb_parallel_wavegan.v1/stats.npy +3 -0
- manifest/arctic_rms_parallel_wavegan.v1/.DS_Store +0 -0
- manifest/arctic_rms_parallel_wavegan.v1/config.yml +104 -0
- manifest/arctic_rms_parallel_wavegan.v1/pwg-arctic-rms-400000steps.pkl +3 -0
- manifest/arctic_rms_parallel_wavegan.v1/stats.npy +3 -0
- manifest/arctic_slt_parallel_wavegan.v1/.DS_Store +0 -0
- manifest/arctic_slt_parallel_wavegan.v1/config.yml +94 -0
- manifest/arctic_slt_parallel_wavegan.v1/pwg-arctic-slt-400000steps.pkl +3 -0
- manifest/arctic_slt_parallel_wavegan.v1/stats.npy +3 -0
- manifest/dict.txt +3 -0
- manifest/test.tsv +3 -0
- manifest/train.tsv +3 -0
- manifest/utils/cmu_arctic_manifest.py +90 -0
- manifest/utils/make_tsv.sh +10 -0
- manifest/utils/prep_cmu_arctic_spkemb.py +68 -0
- manifest/utils/spec2wav.sh +0 -0
- manifest/valid.tsv +3 -0
README.md
CHANGED
@@ -1,3 +1,48 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
tags:
|
4 |
+
- speech
|
5 |
+
- text
|
6 |
+
- cross-modal
|
7 |
+
- unified model
|
8 |
+
- self-supervised learning
|
9 |
+
- SpeechT5
|
10 |
+
- Voice Conversion
|
11 |
+
datasets:
|
12 |
+
- CMU ARCTIC
|
13 |
+
- bdl
|
14 |
+
- clb
|
15 |
+
- rms
|
16 |
+
- slt
|
17 |
---
|
18 |
+
|
19 |
+
## SpeechT5 TTS Manifest
|
20 |
+
|
21 |
+
| [**Github**](https://github.com/microsoft/SpeechT5) | [**Huggingface**](https://huggingface.co/mechanicalsea/speecht5-vc) |
|
22 |
+
|
23 |
+
This manifest is an attempt to recreate the Voice Conversion recipe used for training [SpeechT5](https://aclanthology.org/2022.acl-long.393). This manifest was constructed using [CMU ARCTIC](http://www.festvox.org/cmu_arctic/) four speakers, e.g., bdl, clb, rms, slt. There are 932 utterances for training, 100 utterances for validation, and 100 utterance for evaluation.
|
24 |
+
|
25 |
+
### Requirements
|
26 |
+
|
27 |
+
- [SpeechBrain](https://github.com/speechbrain/speechbrain) for extracting speaker embedding
|
28 |
+
- [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) for implementing vocoder.
|
29 |
+
|
30 |
+
### Tools
|
31 |
+
|
32 |
+
- [manifest/utils](./manifest/utils/) is used to extract speaker embedding, generate manifest, and apply vocoder.
|
33 |
+
- [manifest/arctic*](./manifest/) provides the pre-trained vocoder for each speaker.
|
34 |
+
|
35 |
+
### Reference
|
36 |
+
|
37 |
+
If you find our work is useful in your research, please cite the following paper:
|
38 |
+
|
39 |
+
```bibtex
|
40 |
+
@inproceedings{ao-etal-2022-speecht5,
|
41 |
+
title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
|
42 |
+
author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
|
43 |
+
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
|
44 |
+
month = {May},
|
45 |
+
year = {2022},
|
46 |
+
pages={5723--5738},
|
47 |
+
}
|
48 |
+
```
|
manifest/.DS_Store
ADDED
Binary file (8.2 kB). View file
|
|
manifest/arctic_bdl_parallel_wavegan.v1/.DS_Store
ADDED
Binary file (6.15 kB). View file
|
|
manifest/arctic_bdl_parallel_wavegan.v1/config.yml
ADDED
@@ -0,0 +1,104 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
allow_cache: true
|
2 |
+
batch_max_steps: 15360
|
3 |
+
batch_size: 10
|
4 |
+
config: conf/parallel_wavegan.v1.yaml
|
5 |
+
dev_dumpdir: dump/dev_bdl/norm
|
6 |
+
dev_feats_scp: null
|
7 |
+
dev_segments: null
|
8 |
+
dev_wav_scp: null
|
9 |
+
discriminator_grad_norm: 1
|
10 |
+
discriminator_optimizer_params:
|
11 |
+
eps: 1.0e-06
|
12 |
+
lr: 5.0e-05
|
13 |
+
weight_decay: 0.0
|
14 |
+
discriminator_params:
|
15 |
+
bias: true
|
16 |
+
conv_channels: 64
|
17 |
+
in_channels: 1
|
18 |
+
kernel_size: 3
|
19 |
+
layers: 10
|
20 |
+
nonlinear_activation: LeakyReLU
|
21 |
+
nonlinear_activation_params:
|
22 |
+
negative_slope: 0.2
|
23 |
+
out_channels: 1
|
24 |
+
use_weight_norm: true
|
25 |
+
discriminator_scheduler_params:
|
26 |
+
gamma: 0.5
|
27 |
+
step_size: 200000
|
28 |
+
discriminator_train_start_steps: 100000
|
29 |
+
distributed: false
|
30 |
+
eval_interval_steps: 1000
|
31 |
+
fft_size: 1024
|
32 |
+
fmax: 7600
|
33 |
+
fmin: 80
|
34 |
+
format: npy
|
35 |
+
generator_grad_norm: 10
|
36 |
+
generator_optimizer_params:
|
37 |
+
eps: 1.0e-06
|
38 |
+
lr: 0.0001
|
39 |
+
weight_decay: 0.0
|
40 |
+
generator_params:
|
41 |
+
aux_channels: 80
|
42 |
+
aux_context_window: 2
|
43 |
+
dropout: 0.0
|
44 |
+
gate_channels: 128
|
45 |
+
in_channels: 1
|
46 |
+
kernel_size: 3
|
47 |
+
layers: 30
|
48 |
+
out_channels: 1
|
49 |
+
residual_channels: 64
|
50 |
+
skip_channels: 64
|
51 |
+
stacks: 3
|
52 |
+
upsample_net: ConvInUpsampleNetwork
|
53 |
+
upsample_params:
|
54 |
+
upsample_scales:
|
55 |
+
- 4
|
56 |
+
- 4
|
57 |
+
- 4
|
58 |
+
- 4
|
59 |
+
use_weight_norm: true
|
60 |
+
generator_scheduler_params:
|
61 |
+
gamma: 0.5
|
62 |
+
step_size: 200000
|
63 |
+
global_gain_scale: 1.0
|
64 |
+
hop_size: 256
|
65 |
+
lambda_adv: 4.0
|
66 |
+
log_interval_steps: 100
|
67 |
+
num_mels: 80
|
68 |
+
num_save_intermediate_results: 4
|
69 |
+
num_workers: 2
|
70 |
+
outdir: exp/train_nodev_bdl_arctic_parallel_wavegan.v1
|
71 |
+
pin_memory: true
|
72 |
+
pretrain: ''
|
73 |
+
rank: 0
|
74 |
+
remove_short_samples: true
|
75 |
+
resume: /mnt/default/v-junyiao/vc_vocoder2/train_nodev_bdl_arctic_parallel_wavegan.v1/checkpoint-135000steps.pkl
|
76 |
+
sampling_rate: 16000
|
77 |
+
save_interval_steps: 5000
|
78 |
+
stft_loss_params:
|
79 |
+
fft_sizes:
|
80 |
+
- 1024
|
81 |
+
- 2048
|
82 |
+
- 512
|
83 |
+
hop_sizes:
|
84 |
+
- 120
|
85 |
+
- 240
|
86 |
+
- 50
|
87 |
+
win_lengths:
|
88 |
+
- 600
|
89 |
+
- 1200
|
90 |
+
- 240
|
91 |
+
window: hann_window
|
92 |
+
train_dumpdir: dump/train_nodev_bdl/norm
|
93 |
+
train_feats_scp: null
|
94 |
+
train_max_steps: 400000
|
95 |
+
train_segments: null
|
96 |
+
train_wav_scp: null
|
97 |
+
trim_frame_size: 2048
|
98 |
+
trim_hop_size: 512
|
99 |
+
trim_silence: false
|
100 |
+
trim_threshold_in_db: 60
|
101 |
+
verbose: 1
|
102 |
+
version: 0.4.8
|
103 |
+
win_length: null
|
104 |
+
window: hann
|
manifest/arctic_bdl_parallel_wavegan.v1/pwg-arctic-bdl-400000steps.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f92557c6c61c2acc3a7f74533b291f03eae891963adee06d2e901922886c803c
|
3 |
+
size 5918653
|
manifest/arctic_bdl_parallel_wavegan.v1/stats.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7c186bca19c4ed7bc4d93dd7aacd3db9d8ca6186fd5d5e8d64b7b19cde03637c
|
3 |
+
size 768
|
manifest/arctic_clb_parallel_wavegan.v1/.DS_Store
ADDED
Binary file (6.15 kB). View file
|
|
manifest/arctic_clb_parallel_wavegan.v1/config.yml
ADDED
@@ -0,0 +1,104 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
allow_cache: true
|
2 |
+
batch_max_steps: 15360
|
3 |
+
batch_size: 10
|
4 |
+
config: conf/parallel_wavegan.v1.yaml
|
5 |
+
dev_dumpdir: dump/dev_clb/norm
|
6 |
+
dev_feats_scp: null
|
7 |
+
dev_segments: null
|
8 |
+
dev_wav_scp: null
|
9 |
+
discriminator_grad_norm: 1
|
10 |
+
discriminator_optimizer_params:
|
11 |
+
eps: 1.0e-06
|
12 |
+
lr: 5.0e-05
|
13 |
+
weight_decay: 0.0
|
14 |
+
discriminator_params:
|
15 |
+
bias: true
|
16 |
+
conv_channels: 64
|
17 |
+
in_channels: 1
|
18 |
+
kernel_size: 3
|
19 |
+
layers: 10
|
20 |
+
nonlinear_activation: LeakyReLU
|
21 |
+
nonlinear_activation_params:
|
22 |
+
negative_slope: 0.2
|
23 |
+
out_channels: 1
|
24 |
+
use_weight_norm: true
|
25 |
+
discriminator_scheduler_params:
|
26 |
+
gamma: 0.5
|
27 |
+
step_size: 200000
|
28 |
+
discriminator_train_start_steps: 100000
|
29 |
+
distributed: false
|
30 |
+
eval_interval_steps: 1000
|
31 |
+
fft_size: 1024
|
32 |
+
fmax: 7600
|
33 |
+
fmin: 80
|
34 |
+
format: npy
|
35 |
+
generator_grad_norm: 10
|
36 |
+
generator_optimizer_params:
|
37 |
+
eps: 1.0e-06
|
38 |
+
lr: 0.0001
|
39 |
+
weight_decay: 0.0
|
40 |
+
generator_params:
|
41 |
+
aux_channels: 80
|
42 |
+
aux_context_window: 2
|
43 |
+
dropout: 0.0
|
44 |
+
gate_channels: 128
|
45 |
+
in_channels: 1
|
46 |
+
kernel_size: 3
|
47 |
+
layers: 30
|
48 |
+
out_channels: 1
|
49 |
+
residual_channels: 64
|
50 |
+
skip_channels: 64
|
51 |
+
stacks: 3
|
52 |
+
upsample_net: ConvInUpsampleNetwork
|
53 |
+
upsample_params:
|
54 |
+
upsample_scales:
|
55 |
+
- 4
|
56 |
+
- 4
|
57 |
+
- 4
|
58 |
+
- 4
|
59 |
+
use_weight_norm: true
|
60 |
+
generator_scheduler_params:
|
61 |
+
gamma: 0.5
|
62 |
+
step_size: 200000
|
63 |
+
global_gain_scale: 1.0
|
64 |
+
hop_size: 256
|
65 |
+
lambda_adv: 4.0
|
66 |
+
log_interval_steps: 100
|
67 |
+
num_mels: 80
|
68 |
+
num_save_intermediate_results: 4
|
69 |
+
num_workers: 2
|
70 |
+
outdir: exp/train_nodev_clb_arctic_parallel_wavegan.v1
|
71 |
+
pin_memory: true
|
72 |
+
pretrain: ''
|
73 |
+
rank: 0
|
74 |
+
remove_short_samples: true
|
75 |
+
resume: /mnt/default/v-junyiao/vc_vocoder2/train_nodev_clb_arctic_parallel_wavegan.v1/checkpoint-135000steps.pkl
|
76 |
+
sampling_rate: 16000
|
77 |
+
save_interval_steps: 5000
|
78 |
+
stft_loss_params:
|
79 |
+
fft_sizes:
|
80 |
+
- 1024
|
81 |
+
- 2048
|
82 |
+
- 512
|
83 |
+
hop_sizes:
|
84 |
+
- 120
|
85 |
+
- 240
|
86 |
+
- 50
|
87 |
+
win_lengths:
|
88 |
+
- 600
|
89 |
+
- 1200
|
90 |
+
- 240
|
91 |
+
window: hann_window
|
92 |
+
train_dumpdir: dump/train_nodev_clb/norm
|
93 |
+
train_feats_scp: null
|
94 |
+
train_max_steps: 400000
|
95 |
+
train_segments: null
|
96 |
+
train_wav_scp: null
|
97 |
+
trim_frame_size: 2048
|
98 |
+
trim_hop_size: 512
|
99 |
+
trim_silence: false
|
100 |
+
trim_threshold_in_db: 60
|
101 |
+
verbose: 1
|
102 |
+
version: 0.4.8
|
103 |
+
win_length: null
|
104 |
+
window: hann
|
manifest/arctic_clb_parallel_wavegan.v1/pwg-arctic-clb-400000steps.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e80e448926a2b5b38de076fa8cc9e38589712d95ed08705bc7f242910c15ec4e
|
3 |
+
size 5918653
|
manifest/arctic_clb_parallel_wavegan.v1/stats.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:23ef7d65275668849dc7c5bb876d78b8e3657f5e1ca299b76eb3ca6ce9c2370e
|
3 |
+
size 768
|
manifest/arctic_rms_parallel_wavegan.v1/.DS_Store
ADDED
Binary file (6.15 kB). View file
|
|
manifest/arctic_rms_parallel_wavegan.v1/config.yml
ADDED
@@ -0,0 +1,104 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
allow_cache: true
|
2 |
+
batch_max_steps: 15360
|
3 |
+
batch_size: 10
|
4 |
+
config: conf/parallel_wavegan.v1.yaml
|
5 |
+
dev_dumpdir: dump/dev_rms/norm
|
6 |
+
dev_feats_scp: null
|
7 |
+
dev_segments: null
|
8 |
+
dev_wav_scp: null
|
9 |
+
discriminator_grad_norm: 1
|
10 |
+
discriminator_optimizer_params:
|
11 |
+
eps: 1.0e-06
|
12 |
+
lr: 5.0e-05
|
13 |
+
weight_decay: 0.0
|
14 |
+
discriminator_params:
|
15 |
+
bias: true
|
16 |
+
conv_channels: 64
|
17 |
+
in_channels: 1
|
18 |
+
kernel_size: 3
|
19 |
+
layers: 10
|
20 |
+
nonlinear_activation: LeakyReLU
|
21 |
+
nonlinear_activation_params:
|
22 |
+
negative_slope: 0.2
|
23 |
+
out_channels: 1
|
24 |
+
use_weight_norm: true
|
25 |
+
discriminator_scheduler_params:
|
26 |
+
gamma: 0.5
|
27 |
+
step_size: 200000
|
28 |
+
discriminator_train_start_steps: 100000
|
29 |
+
distributed: false
|
30 |
+
eval_interval_steps: 1000
|
31 |
+
fft_size: 1024
|
32 |
+
fmax: 7600
|
33 |
+
fmin: 80
|
34 |
+
format: npy
|
35 |
+
generator_grad_norm: 10
|
36 |
+
generator_optimizer_params:
|
37 |
+
eps: 1.0e-06
|
38 |
+
lr: 0.0001
|
39 |
+
weight_decay: 0.0
|
40 |
+
generator_params:
|
41 |
+
aux_channels: 80
|
42 |
+
aux_context_window: 2
|
43 |
+
dropout: 0.0
|
44 |
+
gate_channels: 128
|
45 |
+
in_channels: 1
|
46 |
+
kernel_size: 3
|
47 |
+
layers: 30
|
48 |
+
out_channels: 1
|
49 |
+
residual_channels: 64
|
50 |
+
skip_channels: 64
|
51 |
+
stacks: 3
|
52 |
+
upsample_net: ConvInUpsampleNetwork
|
53 |
+
upsample_params:
|
54 |
+
upsample_scales:
|
55 |
+
- 4
|
56 |
+
- 4
|
57 |
+
- 4
|
58 |
+
- 4
|
59 |
+
use_weight_norm: true
|
60 |
+
generator_scheduler_params:
|
61 |
+
gamma: 0.5
|
62 |
+
step_size: 200000
|
63 |
+
global_gain_scale: 1.0
|
64 |
+
hop_size: 256
|
65 |
+
lambda_adv: 4.0
|
66 |
+
log_interval_steps: 100
|
67 |
+
num_mels: 80
|
68 |
+
num_save_intermediate_results: 4
|
69 |
+
num_workers: 2
|
70 |
+
outdir: exp/train_nodev_rms_arctic_parallel_wavegan.v1
|
71 |
+
pin_memory: true
|
72 |
+
pretrain: ''
|
73 |
+
rank: 0
|
74 |
+
remove_short_samples: true
|
75 |
+
resume: /mnt/default/v-junyiao/vc_vocoder2/train_nodev_rms_arctic_parallel_wavegan.v1/checkpoint-110000steps.pkl
|
76 |
+
sampling_rate: 16000
|
77 |
+
save_interval_steps: 5000
|
78 |
+
stft_loss_params:
|
79 |
+
fft_sizes:
|
80 |
+
- 1024
|
81 |
+
- 2048
|
82 |
+
- 512
|
83 |
+
hop_sizes:
|
84 |
+
- 120
|
85 |
+
- 240
|
86 |
+
- 50
|
87 |
+
win_lengths:
|
88 |
+
- 600
|
89 |
+
- 1200
|
90 |
+
- 240
|
91 |
+
window: hann_window
|
92 |
+
train_dumpdir: dump/train_nodev_rms/norm
|
93 |
+
train_feats_scp: null
|
94 |
+
train_max_steps: 400000
|
95 |
+
train_segments: null
|
96 |
+
train_wav_scp: null
|
97 |
+
trim_frame_size: 2048
|
98 |
+
trim_hop_size: 512
|
99 |
+
trim_silence: false
|
100 |
+
trim_threshold_in_db: 60
|
101 |
+
verbose: 1
|
102 |
+
version: 0.4.8
|
103 |
+
win_length: null
|
104 |
+
window: hann
|
manifest/arctic_rms_parallel_wavegan.v1/pwg-arctic-rms-400000steps.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d70ed1c03eada2e8616731292a885e9bbb8406f5859afee5003704725f23d876
|
3 |
+
size 5918653
|
manifest/arctic_rms_parallel_wavegan.v1/stats.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3332906cb47d19988579ddb6c513a7f5fd3bb4ba3b1704c1327e11726a47cac8
|
3 |
+
size 768
|
manifest/arctic_slt_parallel_wavegan.v1/.DS_Store
ADDED
Binary file (6.15 kB). View file
|
|
manifest/arctic_slt_parallel_wavegan.v1/config.yml
ADDED
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
batch_max_steps: 15360
|
2 |
+
batch_size: 10
|
3 |
+
config: conf/parallel_wavegan.v1.yaml
|
4 |
+
dev_dumpdir: dump/dev/norm
|
5 |
+
discriminator_grad_norm: 1
|
6 |
+
discriminator_optimizer_params:
|
7 |
+
eps: 1.0e-06
|
8 |
+
lr: 5.0e-05
|
9 |
+
weight_decay: 0.0
|
10 |
+
discriminator_params:
|
11 |
+
bias: true
|
12 |
+
conv_channels: 64
|
13 |
+
in_channels: 1
|
14 |
+
kernel_size: 3
|
15 |
+
layers: 10
|
16 |
+
nonlinear_activation: LeakyReLU
|
17 |
+
nonlinear_activation_params:
|
18 |
+
negative_slope: 0.2
|
19 |
+
out_channels: 1
|
20 |
+
use_weight_norm: true
|
21 |
+
discriminator_scheduler_params:
|
22 |
+
gamma: 0.5
|
23 |
+
step_size: 200000
|
24 |
+
discriminator_train_start_steps: 100000
|
25 |
+
eval_interval_steps: 1000
|
26 |
+
fft_size: 1024
|
27 |
+
fmax: 7600
|
28 |
+
fmin: 80
|
29 |
+
format: npy
|
30 |
+
# hdf5
|
31 |
+
generator_grad_norm: 10
|
32 |
+
generator_optimizer_params:
|
33 |
+
eps: 1.0e-06
|
34 |
+
lr: 0.0001
|
35 |
+
weight_decay: 0.0
|
36 |
+
generator_params:
|
37 |
+
aux_channels: 80
|
38 |
+
aux_context_window: 2
|
39 |
+
dropout: 0.0
|
40 |
+
gate_channels: 128
|
41 |
+
in_channels: 1
|
42 |
+
kernel_size: 3
|
43 |
+
layers: 30
|
44 |
+
out_channels: 1
|
45 |
+
residual_channels: 64
|
46 |
+
skip_channels: 64
|
47 |
+
stacks: 3
|
48 |
+
upsample_net: ConvInUpsampleNetwork
|
49 |
+
upsample_params:
|
50 |
+
upsample_scales:
|
51 |
+
- 4
|
52 |
+
- 4
|
53 |
+
- 4
|
54 |
+
- 4
|
55 |
+
use_weight_norm: true
|
56 |
+
generator_scheduler_params:
|
57 |
+
gamma: 0.5
|
58 |
+
step_size: 200000
|
59 |
+
global_gain_scale: 1.0
|
60 |
+
hop_size: 256
|
61 |
+
lambda_adv: 4.0
|
62 |
+
log_interval_steps: 100
|
63 |
+
num_mels: 80
|
64 |
+
num_save_intermediate_results: 4
|
65 |
+
num_workers: 8
|
66 |
+
outdir: exp/train_nodev_arctic_slt_parallel_wavegan.v1
|
67 |
+
pin_memory: true
|
68 |
+
remove_short_samples: true
|
69 |
+
resume: exp/train_nodev_arctic_slt_parallel_wavegan.v1/checkpoint-300000steps.pkl
|
70 |
+
sampling_rate: 16000
|
71 |
+
save_interval_steps: 5000
|
72 |
+
stft_loss_params:
|
73 |
+
fft_sizes:
|
74 |
+
- 1024
|
75 |
+
- 2048
|
76 |
+
- 512
|
77 |
+
hop_sizes:
|
78 |
+
- 120
|
79 |
+
- 240
|
80 |
+
- 50
|
81 |
+
win_lengths:
|
82 |
+
- 600
|
83 |
+
- 1200
|
84 |
+
- 240
|
85 |
+
window: hann_window
|
86 |
+
train_dumpdir: dump/train_nodev/norm
|
87 |
+
train_max_steps: 400000
|
88 |
+
trim_frame_size: 2048
|
89 |
+
trim_hop_size: 512
|
90 |
+
trim_silence: false
|
91 |
+
trim_threshold_in_db: 60
|
92 |
+
verbose: 0
|
93 |
+
win_length: null
|
94 |
+
window: hann
|
manifest/arctic_slt_parallel_wavegan.v1/pwg-arctic-slt-400000steps.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:477686935b56f0eed684de9a31fb0f35600e4ce84b81e488c2b850fd07e630db
|
3 |
+
size 5918525
|
manifest/arctic_slt_parallel_wavegan.v1/stats.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8af46bfcde0d79c2d3936e25fbc7b59fb5043f064fb9fa53cd2323c8ea64abe1
|
3 |
+
size 768
|
manifest/dict.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:036438c7cb5fc860b1d1066a3b111542515b1d4ac1f5a79a15a2322e8f79f402
|
3 |
+
size 309
|
manifest/test.tsv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9126dfb852be724b1d595ea69dc2adf96eaf2dd5ee2fe113a30229de3539491c
|
3 |
+
size 170418
|
manifest/train.tsv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:067e049d317083e49ae22c7f5582a28253c1b24ba7988cb95b362eb1938e3553
|
3 |
+
size 1588164
|
manifest/utils/cmu_arctic_manifest.py
ADDED
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import argparse
|
2 |
+
import os
|
3 |
+
|
4 |
+
from torchaudio.datasets import CMUARCTIC
|
5 |
+
from tqdm import tqdm
|
6 |
+
|
7 |
+
|
8 |
+
SPLITS = {
|
9 |
+
"train": list(range( 0, 932)),
|
10 |
+
"valid": list(range( 932, 1032)),
|
11 |
+
"test": list(range(1032, 1132)),
|
12 |
+
}
|
13 |
+
|
14 |
+
|
15 |
+
def get_parser():
|
16 |
+
parser = argparse.ArgumentParser()
|
17 |
+
parser.add_argument(
|
18 |
+
"root", metavar="DIR", help="root directory containing wav files to index"
|
19 |
+
)
|
20 |
+
parser.add_argument(
|
21 |
+
"--dest", default=".", type=str, metavar="DIR", help="output directory"
|
22 |
+
)
|
23 |
+
parser.add_argument(
|
24 |
+
"--source", default="bdl,clb,slt,rms", type=str, help="Source voice from slt, clb, bdl, rms."
|
25 |
+
)
|
26 |
+
parser.add_argument(
|
27 |
+
"--target", default="bdl,clb,slt,rms", type=str, help="Target voice from slt, clb, bdl, rms."
|
28 |
+
)
|
29 |
+
parser.add_argument(
|
30 |
+
"--splits", default="932,100,100", type=str, help="Split of train,valid,test seperate by comma."
|
31 |
+
)
|
32 |
+
parser.add_argument(
|
33 |
+
"--wav-root", default=None, type=str, metavar="DIR", help="saved waveform root directory for tsv"
|
34 |
+
)
|
35 |
+
parser.add_argument(
|
36 |
+
"--spkemb-npy-dir", required=True, type=str, help="speaker embedding directory"
|
37 |
+
)
|
38 |
+
return parser
|
39 |
+
|
40 |
+
def main(args):
|
41 |
+
dest_dir = args.dest
|
42 |
+
wav_root = args.wav_root
|
43 |
+
if not os.path.exists(dest_dir):
|
44 |
+
os.makedirs(dest_dir)
|
45 |
+
|
46 |
+
source = args.source.split(",")
|
47 |
+
target = args.target.split(",")
|
48 |
+
spks = sorted(list(set(source + target)))
|
49 |
+
datasets = {}
|
50 |
+
|
51 |
+
datasets["slt"] = CMUARCTIC(args.root, url="slt", folder_in_archive="ARCTIC", download=False)
|
52 |
+
for spk in spks:
|
53 |
+
if spk != "slt":
|
54 |
+
datasets[spk] = CMUARCTIC(args.root, url=spk, folder_in_archive="ARCTIC", download=False)
|
55 |
+
datasets[spk]._walker = list(datasets["slt"]._walker) # some text sentences is missing
|
56 |
+
if "slt" not in spks:
|
57 |
+
del datasets["slt"]
|
58 |
+
|
59 |
+
num_splits = [int(n_split) for n_split in args.splits.split(',')]
|
60 |
+
assert sum(num_splits) == 1132, f"Missing utterances: {sum(num_splits)} != 1132"
|
61 |
+
|
62 |
+
tsv = {}
|
63 |
+
for split in SPLITS.keys():
|
64 |
+
tsv[split] = open(os.path.join(dest_dir, f"{split}.tsv"), "w")
|
65 |
+
print(wav_root, file=tsv[split])
|
66 |
+
|
67 |
+
for split, indices in SPLITS.items():
|
68 |
+
for i in tqdm(indices, desc=f"[{'-'.join(spks)}]tsv/wav/spk"):
|
69 |
+
for src_spk in source:
|
70 |
+
for tgt_spk in target:
|
71 |
+
if src_spk == tgt_spk: continue
|
72 |
+
# wav, sample_rate, utterance, utt_no
|
73 |
+
src_i = datasets[src_spk][i]
|
74 |
+
tgt_i = datasets[tgt_spk][i]
|
75 |
+
assert src_i[1] == tgt_i[1], f"{src_i[1]}-{tgt_i[1]}"
|
76 |
+
assert src_i[3] == tgt_i[3], f"{src_i[3]}-{tgt_i[3]}"
|
77 |
+
src_wav = os.path.join(os.path.basename(datasets[src_spk]._path), datasets[src_spk]._folder_audio, f"arctic_{src_i[3]}.wav")
|
78 |
+
src_nframes = src_i[0].shape[-1]
|
79 |
+
tgt_wav = os.path.join(os.path.basename(datasets[tgt_spk]._path), datasets[tgt_spk]._folder_audio, f"arctic_{tgt_i[3]}.wav")
|
80 |
+
tgt_nframes = tgt_i[0].shape[-1]
|
81 |
+
tgt_spkemb = os.path.join(args.spkemb_npy_dir, f"{os.path.basename(datasets[tgt_spk]._path)}-{datasets[tgt_spk]._folder_audio}-arctic_{tgt_i[3]}.npy")
|
82 |
+
print(f"{src_wav}\t{src_nframes}\t{tgt_wav}\t{tgt_nframes}\t{tgt_spkemb}", file=tsv[split])
|
83 |
+
for split in tsv.keys():
|
84 |
+
tsv[split].close()
|
85 |
+
|
86 |
+
|
87 |
+
if __name__ == "__main__":
|
88 |
+
parser = get_parser()
|
89 |
+
args = parser.parse_args()
|
90 |
+
main(args)
|
manifest/utils/make_tsv.sh
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/bin/bash
|
2 |
+
# bash utils/make_tsv.sh /root/data/cmu_arctic/ /root/data/cmu_arctic/cmu_arctic_finetuning_meta /opt/tiger/ARCTIC
|
3 |
+
root=$1
|
4 |
+
dest=$2
|
5 |
+
wav_root=$3
|
6 |
+
spkemb_split=$4
|
7 |
+
if [ -z ${spkemb_split} ]; then
|
8 |
+
spkemb_split=spkrec-xvect
|
9 |
+
fi
|
10 |
+
python utils/cmu_arctic_manifest.py ${root} --dest ${dest} --wav-root ${wav_root} --spkemb-npy-dir ${spkemb_split}
|
manifest/utils/prep_cmu_arctic_spkemb.py
ADDED
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import glob
|
3 |
+
import numpy
|
4 |
+
import argparse
|
5 |
+
import torchaudio
|
6 |
+
from speechbrain.pretrained import EncoderClassifier
|
7 |
+
import torch
|
8 |
+
from tqdm import tqdm
|
9 |
+
import torch.nn.functional as F
|
10 |
+
|
11 |
+
spk_model = {
|
12 |
+
"speechbrain/spkrec-xvect-voxceleb": 512,
|
13 |
+
"speechbrain/spkrec-ecapa-voxceleb": 192,
|
14 |
+
}
|
15 |
+
|
16 |
+
def f2embed(wav_file, classifier, size_embed):
|
17 |
+
signal, fs = torchaudio.load(wav_file)
|
18 |
+
assert fs == 16000, fs
|
19 |
+
with torch.no_grad():
|
20 |
+
embeddings = classifier.encode_batch(signal)
|
21 |
+
embeddings = F.normalize(embeddings, dim=2)
|
22 |
+
embeddings = embeddings.squeeze().cpu().numpy()
|
23 |
+
assert embeddings.shape[0] == size_embed, embeddings.shape[0]
|
24 |
+
return embeddings
|
25 |
+
|
26 |
+
def process(args):
|
27 |
+
wavlst = []
|
28 |
+
for split in args.splits.split(","):
|
29 |
+
wav_dir = os.path.join(args.arctic_root, split)
|
30 |
+
wavlst_split = glob.glob(os.path.join(wav_dir, "wav", "*.wav"))
|
31 |
+
print(f"{split} {len(wavlst_split)} utterances.")
|
32 |
+
wavlst.extend(wavlst_split)
|
33 |
+
|
34 |
+
spkemb_root = args.output_root
|
35 |
+
if not os.path.exists(spkemb_root):
|
36 |
+
print(f"Create speaker embedding directory: {spkemb_root}")
|
37 |
+
os.mkdir(spkemb_root)
|
38 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
39 |
+
classifier = EncoderClassifier.from_hparams(source=args.speaker_embed, run_opts={"device": device}, savedir=os.path.join('/tmp', args.speaker_embed))
|
40 |
+
size_embed = spk_model[args.speaker_embed]
|
41 |
+
for utt_i in tqdm(wavlst, total=len(wavlst), desc="Extract"):
|
42 |
+
# TODO rename speaker embedding
|
43 |
+
utt_id = "-".join(utt_i.split("/")[-3:]).replace(".wav", "")
|
44 |
+
utt_emb = f2embed(utt_i, classifier, size_embed)
|
45 |
+
numpy.save(os.path.join(spkemb_root, f"{utt_id}.npy"), utt_emb)
|
46 |
+
|
47 |
+
def main():
|
48 |
+
parser = argparse.ArgumentParser()
|
49 |
+
parser.add_argument("--arctic-root", "-i", required=True, type=str, help="LibriTTS root directory.")
|
50 |
+
parser.add_argument("--output-root", "-o", required=True, type=str, help="Output directory.")
|
51 |
+
parser.add_argument("--speaker-embed", "-s", type=str, required=True, choices=["speechbrain/spkrec-xvect-voxceleb", "speechbrain/spkrec-ecapa-voxceleb"],
|
52 |
+
help="Pretrained model for extracting speaker emebdding.")
|
53 |
+
parser.add_argument("--splits", type=str, help="Split of four speakers seperate by comma.",
|
54 |
+
default="cmu_us_bdl_arctic,cmu_us_clb_arctic,cmu_us_rms_arctic,cmu_us_slt_arctic")
|
55 |
+
args = parser.parse_args()
|
56 |
+
print(f"Loading utterances from {args.arctic_root}/{args.splits}, "
|
57 |
+
+ f"Save speaker embedding 'npy' to {args.output_root}, "
|
58 |
+
+ f"Using speaker model {args.speaker_embed} with {spk_model[args.speaker_embed]} size.")
|
59 |
+
process(args)
|
60 |
+
|
61 |
+
if __name__ == "__main__":
|
62 |
+
"""
|
63 |
+
python utils/prep_cmu_arctic_spkemb.py \
|
64 |
+
-i /root/data/cmu_arctic/CMUARCTIC \
|
65 |
+
-o /root/data/cmu_arctic/CMUARCTIC/spkrec-xvect \
|
66 |
+
-s speechbrain/spkrec-xvect-voxceleb
|
67 |
+
"""
|
68 |
+
main()
|
manifest/utils/spec2wav.sh
ADDED
File without changes
|
manifest/valid.tsv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a0d3fc2569593894864f881f2027c46b9ea39fcb01f0e6cdbacc8213dfa8dd6f
|
3 |
+
size 170418
|