Model Description
Pre-Training Settings:
166k samples from Common Voice 13.0 was recognized by Whisper tiny.en.
1,000 random samples was selected as the test set, and the rest for training and validation with an 80%-20% split
Batch size: 256
Initial learning rate: 1e-5
Adam optimizer
30 epochs
Cross-entropy loss
Best checkpoint saved based on WER as the evaluation metric
Decoding is performed using beam search with a size of 5
S2S backbone model adopted from ''Exploring data augmentation for code generation tasks''.
Continue-Training Setting:
- 2 epochs for gold-gold to prevent the over-correction problem on ''Ted talk data''