Job disconnected after 1000 steps

#1
by alielfilali01 - opened

Hi @osanseviero , @nateraw and @azzr
Today i launched a CPT script using llama-factory (based on HF's TRL) on a 4 L4 GPUs and after about 9 hours (542 minutes) and exactly 1000 steps (logged to wandb) the jupyter lab just got disconnected ! I don't know if is because of the Jupyter Lab space or if the Space's GPUs have this limit of 9 hours allocation in order to prevent high usage or something ?
Btw this is not the space that got disconnected, i just made it in order to open this discussion.
I'm cc @julien-c as well since he might have insights on the GPU allocation within spaces ...
Thank you and really appreciate your help in advance

Sign up or log in to comment