File size: 3,810 Bytes
2916d61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
# CLI

## 0. Install and global paths settings

```bash
git clone https://github.com/litagin02/Style-Bert-VITS2.git
cd Style-Bert-VITS2
python -m venv venv
venv\Scripts\activate
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
```

Then download the necessary models and the default TTS model, and set the global paths.
```bash
python initialize.py [--skip_jvnv] [--dataset_root <path>] [--assets_root <path>]
```

Optional:
- `--skip_jvnv`: Skip downloading the default JVNV voice models (use this if you only have to train your own models).
- `--dataset_root`: Default: `Data`. Root directory of the training dataset. The training dataset of `{model_name}` should be placed in `{dataset_root}/{model_name}`.
- `--assets_root`: Default: `model_assets`. Root directory of the model assets (for inference). In training, the model assets will be saved to `{assets_root}/{model_name}`, and in inference, we load all the models from `{assets_root}`.


## 1. Dataset preparation

### 1.1. Slice wavs
```bash
python slice.py --model_name <model_name> [-i <input_dir>] [-m <min_sec>] [-M <max_sec>]
```

Required:
- `model_name`: Name of the speaker (to be used as the name of the trained model).

Optional:
- `input_dir`: Path to the directory containing the audio files to slice (default: `inputs`)
- `min_sec`: Minimum duration of the sliced audio files in seconds (default: 2).
- `max_sec`: Maximum duration of the sliced audio files in seconds (default: 12).

### 1.2. Transcribe wavs

```bash
python transcribe.py --model_name <model_name>
```
Required:
- `model_name`: Name of the speaker (to be used as the name of the trained model).

Optional
- `--initial_prompt`: Initial prompt to use for the transcription (default value is specific to Japanese).
- `--device`: `cuda` or `cpu` (default: `cuda`).
- `--language`: `jp`, `en`, or `en` (default: `jp`).
- `--model`: Whisper model, default: `large-v3`
- `--compute_type`: default: `bfloat16`

## 2. Preprocess

```bash
python preprocess_all.py -m <model_name> [--use_jp_extra] [-b <batch_size>] [-e <epochs>] [-s <save_every_steps>] [--num_processes <num_processes>] [--normalize] [--trim] [--val_per_lang <val_per_lang>] [--log_interval <log_interval>] [--freeze_EN_bert] [--freeze_JP_bert] [--freeze_ZH_bert] [--freeze_style] [--freeze_decoder]
```

Required:
- `model_name`: Name of the speaker (to be used as the name of the trained model).

Optional:
- `--batch_size`, `-b`: Batch size (default: 2).
- `--epochs`, `-e`: Number of epochs (default: 100).
- `--save_every_steps`, `-s`: Save every steps (default: 1000).
- `--num_processes`: Number of processes (default: half of the number of CPU cores).
- `--normalize`: Loudness normalize audio.
- `--trim`: Trim silence.
- `--freeze_EN_bert`: Freeze English BERT.
- `--freeze_JP_bert`: Freeze Japanese BERT.
- `--freeze_ZH_bert`: Freeze Chinese BERT.
- `--freeze_style`: Freeze style vector.
- `--freeze_decoder`: Freeze decoder.
- `--use_jp_extra`: Use JP-Extra model.
- `--val_per_lang`: Validation data per language (default: 0).
- `--log_interval`: Log interval (default: 200).

## 3. Train

Training settings are automatically loaded from the above process.

If NOT using JP-Extra model:
```bash
python train_ms.py [--repo_id <username>/<repo_name>]
```

If using JP-Extra model:
```bash
python train_ms_jp_extra.py [--repo_id <username>/<repo_name>] [--skip_default_style]
```

Optional:
- `--repo_id`: Hugging Face repository ID to upload the trained model to. You should have logged in using `huggingface-cli login` before running this command.
- `--skip_default_style`: Skip making the default style vector. Use this if you want to resume training (since the default style vector is already made).