Need help to debug my training process
Hello fellows,
My friend and me, we're fine-tuning the model with our dataset. The task is very heavy for our PCs, then we passed to SageMaker. Then, we have some questions :
- Firstly, I would like to know if it's normal to take 5h to train it in a ml.g5.24xlarge instance, mainly because, for testing, we're using a very small dataset (ten audio files).
- Is it necessary to have all the demo files ? How could we understand better the params from demo_cfg ?
- Is there any step that we did that is not necessary - and, maybe, is causing the heavy computations ? Batch sizes, gpus, cuda stuff, etc.
We're attaching all the process of our training, to help the collective debugging.
a) the model archi
In jupter notebooks:
b) first imports
c) model loading
d) Cuda import and training prompt
OUR TRAINING PROMPT :
!python stable-audio-tools-sam/train.py --model-config stable_open_model_files/model_config.json --dataset-config stable_open_model_files/dataset_config.json --name rayan-training --save-dir checkpoints --pretrained-ckpt-path stable_open_model_files/model.safetensors --batch-size 16 --num-gpus 4 --strategy deepspeed
Outputs:
e) Models loaded
f) Some warnings and cuda loading
g) Training in action
h) After 5h without conclusion, our keyboard interruption...
We can, eventually, put the logs from sagemaker here too.
Thanks in advance !
hours? hmm with 4 gpus... I heard they put something like 16000 gpu hours just into the vae.
https://github.com/yukara-ikemiya/friendly-stable-audio-tools
give that an eyball maybe.
for what its worth your code seemed okay.
after 10k steps it should drop out a ckpt and keep the best 2 of those models (IIRC) and do that every 10k
After 5h without conclusion, our keyboard interruption...
it will NEVER conclude.
the code says maximum epochs 100000 (or something insane). You are supposed to stop it AFTER a 10k moment (like 10k or 100k or 200k however many steps you want)
when it has JUST spat out a ckpt is my preferred time.
If you exit early
Define the path where you want to save the model
model_save_path = os.path.join(OUTPUT_DIR, 'final_model_checkpoint.ckpt')
Save the model using the trainer
trainer.save_checkpoint(model_save_path)
print("Model saved successfully at:", model_save_path)
I believe that can - in a pinch - rip out the traned model so far.
But better off doing what I said before. (since this is just some last ditch code I made up one time)
Hello,
Effectively, I put a lower number for max_epochs and I'm getting the model after a keyboard interruption, your solution seems like mine's, thanks for confirming me my insight !