I ran the patch motion correction for >2000 micrographs using 2 GPUs, but the job failed after a couple of hours, and it looks I have to start over again. Is it possible to continue from where it failed? Also, when it failed, the resource manager shows no job running, but nvidia-smi shows a job is still running on a GPU, not killed. Thanks a lot.
This is a situation we’ve noticed occurs more often now- we are implementing new features inside the preprocessing jobs to allow them to ignore failed exposures and place them into a “exposures_failed” output result group so that users can diagnose the issue at a later time but still continue processing with exposures that have completed successfully. I will update this thread as soon as the feature is released. Also, the failure could mean that the job had become orphaned- which may explain why you see the discrepancy.
I was wondering if there had been progress on restarting failed jobs? My Patch Motion Correction job failed in the middle because of one dodgy image, and it would be nice to be able to continue from the next image, without having to restart from scratch.