RuntimeError: CUDA out of memory.

#73
by allenxiao - opened

Thanks for providing this amazing model again.
When I modified the pretraining example's default parameters to the 12-layer version, i.e. num_embed_dim = 512, num_attn_heads = 8, and num_layers = 12, the RuntimeError: CUDA out of memory appeared, and I believe you may also face this problem before, so could you please provide some suggestions to avoid the error?

Thank you for your interest in Geneformer. We trained the 12 layer model with the same resources and distributed training algorithms as discussed in our manuscript for the 6 layer model. We trained the 12 layer model at the same time as the 6 layer model, over 2 years ago, so there are more advances in efficient training algorithms since then that you could consider implementing (e.g. FlashAttention).

Because we expected that users may face memory limitations, we focused on the 6 layer model for our manuscript to allow it to be accessible to more researchers. However, we release here the pretrained 12 layer model as well for users with resources that are capable of working with this size model. Of note, fine-tuning the 12 layer model we have already pretrained will of course be much less resource-intensive than repeating the pretraining from scratch.

ctheodoris changed discussion status to closed

If I reduce the batch_size parameter to avoid OOM, would smaller batch_size weaken the model's performance? In other words, for the Geneformer model, does batch_size play an important role in the model's performance?
Thank you for your patience.

If I reduce the batch_size parameter to avoid OOM, would smaller batch_size weaken the model's performance? In other words, for the Geneformer model, does batch_size play an important role in the model's performance?
Thank you for your patience.
I want to know that, too.

Thank you for your question. Deep learning models in general can be critically affected by learning hyperparameters. That is why we recommend always optimizing learning hyperparameters for fine-tuning the model, for example. Pretraining is more computationally intensive so it may not be feasible to optimize hyperparameters to the same degree, but we certainly recommend optimizing them as is possible with the available resources. Changing the batch size can definitely affect the model's training, but reducing it is not necessarily going to affect it in a negative way. One strategy is to set the maximum batch size allowed by your resources to facilitate more efficient training and then to optimize the remainder of the hyperparameters with this fixed batch size. Of note, the hyperparameters we used to train the 12 layer model were different than the 6 layer model.

Sign up or log in to comment