ComSpeech

Authors: Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng*

Code for ACL 2024 paper "Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?".

🎧 Listen to ComSpeech's translated speech 🎧

💡 Highlights

ComSpeech is a general composite S2ST model architecture, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
ComSpeech surpasses previous two-pass models like UnitY and Translatotron 2 in both translation quality and decoding speed.
With our proposed training strategy ComSpeech-ZS, we achieve performance comparable to supervised training without using any parallel speech data.

🔥 Quick Start

Requirements

python==3.8, torch==2.1.2
Install fairseq:
```
cd fairseq
pip install -e .
```

Data Preparation

This section is under construction and will be updated within 3 days.

ComSpeech (Supervised Learning)

The following scripts use 4 RTX 3090 GPUs by default. You can adjust --update-freq, --max-tokens-st, --max-tokens, and --batch-size-tts depending on your available GPUs.

In the supervised learning scenario, we first use the S2TT data and TTS data to pretrain the S2TT and TTS models respectively, and then finetune the entire model using the S2ST data. The following script is an example on the CVSS Fr-En dataset. For De-En and Es-En directions, you only need to change the source language in scripts.

Pretrain the S2TT model, and the best checkpoint will be saved at ComSpeech/checkpoints/st.cvss.fr-en/checkpoint_best.pt.

bash ComSpeech/train_scripts/st/train.st.cvss.fr-en.sh

Pretrain the TTS model, and the best checkpoint will be saved at ComSpeech/checkpoints/tts.fastspeech2.cvss-fr-en/checkpoint_best.pt.

bash ComSpeech/train_scripts/tts/train.tts.fastspeech2.cvss-fr-en.sh

Finetune the entire model using the S2ST data, and the chekpoints will be saved at ComSpeech/checkpoints/s2st.fr-en.comspeech.

bash ComSpeech/train_scripts/s2st/train.s2st.fr-en.comspeech.sh

Average the 5 best checkpoints and test the results on the test set.

bash ComSpeech/test_scripts/generate.fr-en.comspeech.sh

To run inference, you need to download the pretrained HiFi-GAN vocoder from this link and place it in the hifi-gan/ directory.

ComSpeech-ZS (Zero-shot Learning)

In the zero-shot learning scenario, we first pretrain the S2TT model using CVSS Fr/De/Es-En S2TT data, and pretrain the TTS model using CVSS X-En TTS (X∉{Fr,De,Es}) data. Then, we finetune the entire model in two stages using these two parts of the data.

Pretrain the S2TT model, and the best checkpoint will be saved at ComSpeech/checkpoints/st.cvss.fr-en/checkpoint_best.pt.

bash ComSpeech/train_scripts/st/train.st.cvss.fr-en.sh

Pretrain the TTS model, and the best checkpoint will be saved at ComSpeech/checkpoints/tts.fastspeech2.cvss-x-en/checkpoint_best.pt (note: this checkpoint is used for experiments on all language pairs in the zero-shot learning scenario).

bash ComSpeech/train_scripts/tts/train.tts.fastspeech2.cvss-x-en.sh

Finetune the S2TT model and the vocabulary adaptor using S2TT data (stage 1), and the best checkpoint will be saved at ComSpeech/checkpoints/st.cvss.fr-en.ctc/checkpoint_best.pt.

bash ComSpeech/train_scripts/st/train.st.cvss.fr-en.ctc.sh

Finetune the entire model using both S2TT and TTS data (stage 2), and the checkpoints will be saved at ComSpeech/checkpoints/s2st.fr-en.comspeech-zs.

bash ComSpeech/train_scripts/s2st/train.s2st.fr-en.comspeech-zs.sh

Average the 5 best checkpoints and test the results on the test set.

bash ComSpeech/test_scripts/generate.fr-en.comspeech-zs.sh

Checkpoints

We have released the checkpoints for each of the above steps. You can download them from 🤗HuggingFace.

Supervised Learning

Directions	S2TT Pretrain	TTS Pretrain	ComSpeech
Fr-En	[download]	[download]	[download]
De-En	[download]	[download]	[download]
Es-En	[download]	[download]	[download]

Zero-shot Learning

Directions	S2TT Pretrain	TTS Pretrain	1-stage Finetune	2-stage Finetune
Fr-En	[download]	[download]	[download]	[download]
De-En	[download]	[download]	[download]	[download]
Es-En	[download]	[download]	[download]	[download]

🖋 Citation

If you have any questions, please feel free to submit an issue or contact fangqingkai21b@ict.ac.cn.

If our work is useful for you, please cite as:

@inproceedings{fang-etal-2024-can,
    title = {Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?},
    author = {Fang, Qingkai and Zhang, Shaolei and Ma, Zhengrui and Zhang, Min and Feng, Yang},
    booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics},
    year = {2024},
}