Continue a failed job? and a possible bug report




I ran the patch motion correction for >2000 micrographs using 2 GPUs, but the job failed after a couple of hours, and it looks I have to start over again. Is it possible to continue from where it failed? Also, when it failed, the resource manager shows no job running, but nvidia-smi shows a job is still running on a GPU, not killed. Thanks a lot.



Hi @xzhang2017,

This is a situation we’ve noticed occurs more often now- we are implementing new features inside the preprocessing jobs to allow them to ignore failed exposures and place them into a “exposures_failed” output result group so that users can diagnose the issue at a later time but still continue processing with exposures that have completed successfully. I will update this thread as soon as the feature is released. Also, the failure could mean that the job had become orphaned- which may explain why you see the discrepancy.


had this happen to me. Just in general it would be good to be able to resume jobs of all kinds.