Confusion about the use of the Encodec model
#4
by
xtluo
- opened
In your published paper, the Encodec model is used as the final acoustic teacher, but the pseudocode is:
y_VQ = embedding(x_acoustic_labels)
z = MERT(x_noised)
loss_acoustic = Cross_Entropy(z[mask_idx], y_VQ[mask_idx])
So, my questions are:
- How CrossEntropy is calculated between
z
andembedding y_VQ
- I didn't find any encodec-related information in your open-sourced code
You may refer to our open-sourced code: https://github.com/yizhilll/MERT
The codecs are pre-extracted from the audio, and directly used as labels.