Edge case with extra live workers?

I just had a very annoying issue with live where a number (147/14108) of exposures had no motion correction output files, and a smaller number (7) had no CTF estimation output files. The missing files caused downstream jobs to fail and the only solution I found was to find these exposures and manually reject them, then use exposure sets to isolate them and run the missed processing steps. It seems the new outputs can be successfully combined with the rest of the data either using Curate Exposures or by copying the output files into the live directories.

I am not sure why this happened, but my best guess is that I used 12 workers in order to rapidly reprocess the whole dataset using live, but some of those workers were never allowed to run by the cluster queue. Is that possibly the problem?

The live session also ran out of filesystem space previously, but after that was fixed the session was cleared and all the data was reprocessed. I used live to reprocess because the offline multi-GPU jobs are constrained by the node configurations (4 GPUs) while the live workers are independent and any number can be used.

Did you run these missed steps in a Live session or inside “standard” CryoSPARC? Did you observe anything unusual for these exposures?

We are not sure. If you would like us to investigate, please can you send us

  1. the filenames of exposures that had missing outputs, along with an annotation at which step, motion correction or ctf estimation, output was missing.
  2. the exposure indices for exposures you manually rejected, for example 3 in the attached screenshot
  3. for each Live Worker job in the session workspace, the output of the command
    for job in 4 5 6 7;do
        cryosparcm cli "get_job_streamlog('P123', 'J${job}')" >> /tmp/job_events.out
    
    where you would substitute the actual project UID and the numeric portions of the Live Worker job UIDs.

I will contact you in a direct message regarding where to send the information.