Spaces:
Runtime error
Runtime error
Job disconnected after 1000 steps
#1
by
alielfilali01
- opened
Hi
@osanseviero
,
@nateraw
and
@azzr
Today i launched a CPT script using llama-factory (based on HF's TRL) on a 4 L4 GPUs and after about 9 hours (542 minutes) and exactly 1000 steps (logged to wandb) the jupyter lab just got disconnected ! I don't know if is because of the Jupyter Lab space or if the Space's GPUs have this limit of 9 hours allocation in order to prevent high usage or something ?
Btw this is not the space that got disconnected, i just made it in order to open this discussion.
I'm cc
@julien-c
as well since he might have insights on the GPU allocation within spaces ...
Thank you and really appreciate your help in advance