Speech and vision models have their own pretraining objectives.