the chinese training data of the model is contaminated
#165
by
bookwoods123
- opened
I have tested many long audio recordings that are over half an hour long, the text contains many of the following fields, which is not present in the original audio
请不吝点赞 订阅 转发 打赏支持明镜与点点栏目
字幕志愿者 杨茜茜优优独播剧场
this situation occurs in both openai/whisper-large-v3 and openai/whisper-large-v3-turbo, I am very certain that my audios don't contain these words
太离谱了
我也一样