I just had a very annoying issue with live where a number (147/14108) of exposures had no motion correction output files, and a smaller number (7) had no CTF estimation output files. The missing files caused downstream jobs to fail and the only solution I found was to find these exposures and manually reject them, then use exposure sets to isolate them and run the missed processing steps. It seems the new outputs can be successfully combined with the rest of the data either using Curate Exposures or by copying the output files into the live directories.
I am not sure why this happened, but my best guess is that I used 12 workers in order to rapidly reprocess the whole dataset using live, but some of those workers were never allowed to run by the cluster queue. Is that possibly the problem?
The live session also ran out of filesystem space previously, but after that was fixed the session was cleared and all the data was reprocessed. I used live to reprocess because the offline multi-GPU jobs are constrained by the node configurations (4 GPUs) while the live workers are independent and any number can be used.