--- |
tags: |
- espnet |
- audio |
- speech-enhancement-recognition |
language: en |
datasets: |
- chime4 |
license: cc-by-4.0 |
--- |
## ESPnet2 EnhS2T model |
### `Yoshiki/chime4_enh_asr1_wpd_wavlm_conformer` |
This model was trained by Yoshiki using chime4 recipe in [espnet](https://github.com/espnet/espnet/). |
### Demo: How to use in ESPnet2 |
Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html) |
if you haven't done that already. |
```bash |
cd espnet |
8ed83f45d5aa2ca6b3635e44b9c29afb9b5fb600 |
pip install -e . |
cd egs2/chime4/enh_asr1 |
./run.sh --skip_data_prep false --skip_train true --download_model Yoshiki/chime4_enh_asr1_wpd_wavlm_conformer |
``` |
<!-- Generated by scripts/utils/show_asr_result.sh --> |
## Environments |
- date: `Tue Oct 11 02:40:53 UTC 2022` |
- python version: `3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]` |
- espnet version: `espnet 202207` |
- pytorch version: `pytorch 1.10.1+cu111` |
- Git hash: `` |
- Commit date: `` |
## enh_asr_train_enh_asr_wpd_init_noenhloss_wavlm_conformer_raw_en_char |
### WER |
|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |
|---|---|---|---|---|---|---|---|---| |
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_real_isolated_6ch_track|1640|27119|98.8|0.9|0.2|0.2|1.3|16.2| |
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_simu_isolated_6ch_track|1640|27120|98.9|0.9|0.2|0.1|1.3|15.2| |
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_real_isolated_6ch_track|1320|21409|98.4|1.4|0.2|0.2|1.8|20.6| |
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_simu_isolated_6ch_track|1320|21416|98.9|1.0|0.2|0.1|1.2|15.2| |
### CER |
|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |
|---|---|---|---|---|---|---|---|---| |
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_real_isolated_6ch_track|1640|160390|99.7|0.1|0.2|0.2|0.5|16.2| |
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_simu_isolated_6ch_track|1640|160400|99.7|0.1|0.2|0.1|0.5|15.2| |
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_real_isolated_6ch_track|1320|126796|99.5|0.2|0.3|0.2|0.7|20.6| |
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_simu_isolated_6ch_track|1320|126812|99.7|0.2|0.2|0.1|0.5|15.2| |
### TER |
|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |
|---|---|---|---|---|---|---|---|---| |
## EnhS2T config |
<details><summary>expand</summary> |
``` |
config: conf/tuning/train_enh_asr_wpd_init_noenhloss_wavlm_conformer.yaml |
print_config: false |
log_level: INFO |
dry_run: false |
iterator_type: sequence |
output_dir: exp/enh_asr_train_enh_asr_wpd_init_noenhloss_wavlm_conformer_raw_en_char |
ngpu: 1 |
seed: 0 |
num_workers: 1 |
num_att_plot: 3 |
dist_backend: nccl |
dist_init_method: env:// |
dist_world_size: null |
dist_rank: null |
local_rank: 0 |
dist_master_addr: null |
dist_master_port: null |
dist_launcher: null |
multiprocessing_distributed: false |
unused_parameters: true |
sharded_ddp: false |
cudnn_enabled: true |
cudnn_benchmark: false |
cudnn_deterministic: true |
collect_stats: false |
write_collected_feats: false |
max_epoch: 31 |
patience: 10 |
val_scheduler_criterion: |
- valid |
- loss |
early_stopping_criterion: |
- valid |
- loss |
- min |
best_model_criterion: |
- - valid |
- acc |
- max |
- - train |
- loss |
- min |
keep_nbest_models: 10 |
nbest_averaging_interval: 0 |
grad_clip: 1 |
grad_clip_type: 2.0 |
grad_noise: false |
accum_grad: 2 |
no_forward_run: false |
resume: true |
train_dtype: float32 |
use_amp: false |
log_interval: null |
use_matplotlib: true |
use_tensorboard: true |
create_graph_in_tensorboard: false |
use_wandb: false |
wandb_project: null |
wandb_id: null |
wandb_entity: null |
wandb_name: null |
wandb_model_log_interval: -1 |
detect_anomaly: false |
pretrain_path: null |
init_param: |
- ../enh1/exp/enh_train_enh_beamformer_wpd_ci_sdr_shorttap_raw/valid.loss.best.pth:separator:enh_model.separator |
- ../asr1/exp/asr_train_asr_conformer_wavlm2_raw_en_char/valid.acc.best.pth:frontend:s2t_model.frontend |
- ../asr1/exp/asr_train_asr_conformer_wavlm2_raw_en_char/valid.acc.best.pth:preencoder:s2t_model.preencoder |
- ../asr1/exp/asr_train_asr_conformer_wavlm2_raw_en_char/valid.acc.best.pth:encoder:s2t_model.encoder |
- ../asr1/exp/asr_train_asr_conformer_wavlm2_raw_en_char/valid.acc.best.pth:ctc:s2t_model.ctc |
- ../asr1/exp/asr_train_asr_conformer_wavlm2_raw_en_char/valid.acc.best.pth:decoder:s2t_model.decoder |
ignore_init_mismatch: false |
freeze_param: |
- s2t_model.frontend.upstream |
num_iters_per_epoch: null |
batch_size: 16 |
valid_batch_size: null |
batch_bins: 1000000 |
valid_batch_bins: null |
train_shape_file: |
- exp/enh_asr_stats_raw_en_char/train/speech_shape |
- exp/enh_asr_stats_raw_en_char/train/speech_ref1_shape |
- exp/enh_asr_stats_raw_en_char/train/text_spk1_shape.char |
valid_shape_file: |
- exp/enh_asr_stats_raw_en_char/valid/speech_shape |
- exp/enh_asr_stats_raw_en_char/valid/speech_ref1_shape |
- exp/enh_asr_stats_raw_en_char/valid/text_spk1_shape.char |
batch_type: folded |
valid_batch_type: null |
fold_length: |
- 80000 |
- 80000 |
- 150 |
sort_in_batch: descending |
sort_batch: descending |
multiple_iterator: false |
chunk_length: 500 |
chunk_shift_ratio: 0.5 |
num_cache_chunks: 1024 |
train_data_path_and_name_and_type: |
- - dump/raw/tr05_multi_isolated_6ch_track/wav.scp |
- speech |
- sound |
- - dump/raw/tr05_multi_isolated_6ch_track/spk1.scp |
- speech_ref1 |
- sound |
- - dump/raw/tr05_multi_isolated_6ch_track/text_spk1 |
- text_spk1 |
- text |
valid_data_path_and_name_and_type: |
- - dump/raw/dt05_multi_isolated_6ch_track/wav.scp |
- speech |
- sound |
- - dump/raw/dt05_multi_isolated_6ch_track/spk1.scp |
- speech_ref1 |
- sound |
- - dump/raw/dt05_multi_isolated_6ch_track/text_spk1 |
- text_spk1 |
- text |
allow_variable_data_keys: false |
max_cache_size: 0.0 |
max_cache_fd: 32 |
valid_max_cache_size: null |
optim: sgd |
optim_conf: |
lr: 0.001 |
momentum: 0.9 |
scheduler: null |
scheduler_conf: {} |
token_list: data/en_token_list/char/tokens.txt |
src_token_list: null |
init: xavier_uniform |
input_size: null |
ctc_conf: |
dropout_rate: 0.0 |
ctc_type: builtin |
reduce: true |
ignore_nan_grad: null |
zero_infinity: true |
enh_criterions: |
- name: ci_sdr |
conf: |
filter_length: 512 |
wrapper: fixed_order |
wrapper_conf: |
weight: 0.1 |
diar_num_spk: null |
diar_input_size: null |
enh_model_conf: |
stft_consistency: false |
loss_type: mask_mse |
mask_type: null |
asr_model_conf: |
ctc_weight: 0.3 |
lsm_weight: 0.1 |
length_normalized_loss: false |
extract_feats_in_collect_stats: false |
st_model_conf: |
stft_consistency: false |
loss_type: mask_mse |
mask_type: null |
diar_model_conf: |
diar_weight: 1.0 |
attractor_weight: 1.0 |
subtask_series: |
- enh |
- asr |
model_conf: |
calc_enh_loss: false |
bypass_enh_prob: 0.0 |
use_preprocessor: true |
token_type: char |
bpemodel: null |
src_token_type: bpe |
src_bpemodel: null |
non_linguistic_symbols: data/nlsyms.txt |
cleaner: null |
g2p: null |
text_name: |
- text_spk1 |
enh_encoder: stft |
enh_encoder_conf: |
n_fft: 512 |
win_length: 400 |
hop_length: 128 |
use_builtin_complex: false |
enh_separator: wpe_beamformer |
enh_separator_conf: |
num_spk: 1 |
loss_type: spectrum |
use_wpe: false |
wnet_type: blstmp |
wlayers: 3 |
wunits: 512 |
wprojs: 512 |
wdropout_rate: 0.0 |
taps: 3 |
delay: 3 |
use_dnn_mask_for_wpe: true |
use_beamformer: true |
bnet_type: blstmp |
blayers: 3 |
bunits: 512 |
bprojs: 512 |
badim: 320 |
ref_channel: 4 |
use_noise_mask: true |
beamformer_type: wpd_souden |
bdropout_rate: 0.0 |
enh_decoder: stft |
enh_decoder_conf: |
n_fft: 512 |
win_length: 400 |
hop_length: 128 |
enh_mask_module: multi_mask |
enh_mask_module_conf: {} |
frontend: s3prl |
frontend_conf: |
frontend_conf: |
upstream: wavlm_large |
download_dir: ./hub |
multilayer_feature: true |
fs: 16k |
specaug: specaug |
specaug_conf: |
apply_time_warp: true |
time_warp_window: 5 |
time_warp_mode: bicubic |
apply_freq_mask: true |
freq_mask_width_range: |
- 0 |
- 100 |
num_freq_mask: 4 |
apply_time_mask: true |
time_mask_width_range: |
- 0 |
- 40 |
num_time_mask: 2 |
normalize: utterance_mvn |
normalize_conf: {} |
asr_preencoder: linear |
asr_preencoder_conf: |
input_size: 1024 |
output_size: 80 |
asr_encoder: conformer |
asr_encoder_conf: |
output_size: 256 |
attention_heads: 4 |
linear_units: 2048 |
num_blocks: 12 |
dropout_rate: 0.1 |
positional_dropout_rate: 0.1 |
attention_dropout_rate: 0.0 |
input_layer: conv2d2 |
normalize_before: true |
macaron_style: true |
pos_enc_layer_type: rel_pos |
selfattention_layer_type: rel_selfattn |
activation_type: swish |
use_cnn_module: true |
cnn_module_kernel: 15 |
asr_postencoder: null |
asr_postencoder_conf: {} |
asr_decoder: transformer |
asr_decoder_conf: |
input_layer: embed |
attention_heads: 4 |
linear_units: 2048 |
num_blocks: 6 |
dropout_rate: 0.1 |
positional_dropout_rate: 0.1 |
self_attention_dropout_rate: 0.0 |
src_attention_dropout_rate: 0.0 |
st_preencoder: null |
st_preencoder_conf: {} |
st_encoder: rnn |
st_encoder_conf: {} |
st_postencoder: null |
st_postencoder_conf: {} |
st_decoder: rnn |
st_decoder_conf: {} |
st_extra_asr_decoder: rnn |
st_extra_asr_decoder_conf: {} |
st_extra_mt_decoder: rnn |
st_extra_mt_decoder_conf: {} |
diar_frontend: default |
diar_frontend_conf: {} |
diar_specaug: null |
diar_specaug_conf: {} |
diar_normalize: utterance_mvn |
diar_normalize_conf: {} |
diar_encoder: transformer |
diar_encoder_conf: {} |
diar_decoder: linear |
diar_decoder_conf: {} |
label_aggregator: label_aggregator |
label_aggregator_conf: {} |
diar_attractor: null |
diar_attractor_conf: {} |
required: |
- output_dir |
version: '202207' |
distributed: false |
``` |
</details> |
### Citing ESPnet |
```BibTex |
@inproceedings{watanabe2018espnet, |
author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai}, |
title={{ESPnet}: End-to-End Speech Processing Toolkit}, |
year={2018}, |
booktitle={Proceedings of Interspeech}, |
pages={2207--2211}, |
doi={10.21437/Interspeech.2018-1456}, |
url={http://dx.doi.org/10.21437/Interspeech.2018-1456} |
} |
``` |
or arXiv: |
```bibtex |
@misc{watanabe2018espnet, |
title={ESPnet: End-to-End Speech Processing Toolkit}, |
author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai}, |
year={2018}, |
eprint={1804.00015}, |
archivePrefix={arXiv}, |
primaryClass={cs.CL} |
} |
``` |