Continue a failed job? and a possible bug report

Hi,

I ran the patch motion correction for >2000 micrographs using 2 GPUs, but the job failed after a couple of hours, and it looks I have to start over again. Is it possible to continue from where it failed? Also, when it failed, the resource manager shows no job running, but nvidia-smi shows a job is still running on a GPU, not killed. Thanks a lot.

Bests,
Xing

Hi @xzhang2017,

This is a situation we’ve noticed occurs more often now- we are implementing new features inside the preprocessing jobs to allow them to ignore failed exposures and place them into a “exposures_failed” output result group so that users can diagnose the issue at a later time but still continue processing with exposures that have completed successfully. I will update this thread as soon as the feature is released. Also, the failure could mean that the job had become orphaned- which may explain why you see the discrepancy.

had this happen to me. Just in general it would be good to be able to resume jobs of all kinds.

Hey @orangeboomerang,

What version are you running? This functionality has existed for preprocessing jobs since v2.13.0

Hello,
I was wondering if there had been progress on restarting failed jobs? My Patch Motion Correction job failed in the middle because of one dodgy image, and it would be nice to be able to continue from the next image, without having to restart from scratch.
Kind regards,
Luca

You can do this already - just mark it as completed, and start a new job using the incomplete exposures as input. Then once that is done you can just combine the outputs of both jobs for CTF etc.

Cheers
Oli

1 Like