Continue a failed job? and a possible bug report

xzhang2017 · October 6, 2019, 5:38am

Hi,

I ran the patch motion correction for >2000 micrographs using 2 GPUs, but the job failed after a couple of hours, and it looks I have to start over again. Is it possible to continue from where it failed? Also, when it failed, the resource manager shows no job running, but nvidia-smi shows a job is still running on a GPU, not killed. Thanks a lot.

Bests,
Xing

stephan · October 7, 2019, 2:53pm

Hi @xzhang2017,

This is a situation we’ve noticed occurs more often now- we are implementing new features inside the preprocessing jobs to allow them to ignore failed exposures and place them into a “exposures_failed” output result group so that users can diagnose the issue at a later time but still continue processing with exposures that have completed successfully. I will update this thread as soon as the feature is released. Also, the failure could mean that the job had become orphaned- which may explain why you see the discrepancy.

orangeboomerang · March 19, 2020, 7:56pm

had this happen to me. Just in general it would be good to be able to resume jobs of all kinds.

stephan · April 1, 2020, 2:44pm

Hey @orangeboomerang,

What version are you running? This functionality has existed for preprocessing jobs since v2.13.0

Luca · September 14, 2020, 12:52pm

Hello,
I was wondering if there had been progress on restarting failed jobs? My Patch Motion Correction job failed in the middle because of one dodgy image, and it would be nice to be able to continue from the next image, without having to restart from scratch.
Kind regards,
Luca

olibclarke · September 14, 2020, 6:40pm

You can do this already - just mark it as completed, and start a new job using the incomplete exposures as input. Then once that is done you can just combine the outputs of both jobs for CTF etc.

Cheers
Oli

GYADAV · October 21, 2020, 5:54pm

Hello Oli,
I have same issue and when I rerun the job it complaining that " waiting because inputs are not ready".
What I do with the failed job to make inputs ready for next job?

Thanks
Gaya

GYADAV · October 21, 2020, 5:55pm

Just mark it complete.