I ran the patch motion correction for >2000 micrographs using 2 GPUs, but the job failed after a couple of hours, and it looks I have to start over again. Is it possible to continue from where it failed? Also, when it failed, the resource manager shows no job running, but nvidia-smi shows a job is still running on a GPU, not killed. Thanks a lot.
This is a situation we’ve noticed occurs more often now- we are implementing new features inside the preprocessing jobs to allow them to ignore failed exposures and place them into a “exposures_failed” output result group so that users can diagnose the issue at a later time but still continue processing with exposures that have completed successfully. I will update this thread as soon as the feature is released. Also, the failure could mean that the job had become orphaned- which may explain why you see the discrepancy.
Hello,
I was wondering if there had been progress on restarting failed jobs? My Patch Motion Correction job failed in the middle because of one dodgy image, and it would be nice to be able to continue from the next image, without having to restart from scratch.
Kind regards,
Luca
You can do this already - just mark it as completed, and start a new job using the incomplete exposures as input. Then once that is done you can just combine the outputs of both jobs for CTF etc.
Hello Oli,
I have same issue and when I rerun the job it complaining that " waiting because inputs are not ready".
What I do with the failed job to make inputs ready for next job?