Full-Frame/Patch-Frame Motion Correction Failing

Hello, our lab is running full-frame motion correction on 15 group of 1000 movies (imported from .tiff files). Some of the 15 groups are fine, and others fail, returning this error: “Job is Unresponsive - no heartbeat in 30 seconds”. Up until that error, it seems to be running normally. The failed job has also made our entire cryoSPARC system unresponsive, which could only be recovered after restarting the local computer and then restarting cryoSPARC. Has anyone seen an issue like this before? Thanks so much-

Welcome to the forum @emmarl25.

Please can you describe your cryoSPARC instance:
“single workstation” or “master and separate worker(s)” or “cluster”?
Could the failing jobs have coincided with other workloads that had been initiated independently from the cryoSPARC scheduler?

I believe it is “master and separate workers”. It is unlikely that other jobs were at fault since this happened multiple times overnight, when there was no workload.

With reference to the “job logs” (see guide), I would search for a pattern in which jobs succeeded, which ones failed, like:

  • do success/failure depend on which worker node was allocated to the job?
  • were jobs more likely to succeed/fail depending on when they started running?
  • do failed groups of movies have anything in common?
  • could there be intermittent network problems preventing reliable access to the (presumably shared between worker nodes) data storage?
    • system logs on worker nodes
    • output of cryosparcm log command_core

All of the jobs were queued onto the master node, and some jobs failed when they were the only ones in the queue. Is there any way to skip to a certain day when using cryosparcm log command_core? The log goes minute by minute and it’s taking a while to check each job.

And when looking up specific jobs using cryosparcm joblog, the output for the failed jobs just say “searching for heartbeat”.

Recent versions of cryoSPARC support log filtering, including filtering based on date.

Thanks so much! However, there does not seem to be anything abnormal in the output - there are usually just four jobs in the queue at the time, and there aren’t any apparent errors.

One possibility: the computer is running out of DRAM, casing some jobs to “miss” their heartbeat deadline.
What is the RAM configuration on the computer (free -g)?
How many jobs could end up running simultaneously, based on CPU, GPU resources and job specifications?
How much RAM do the jobs use? RAM usage is shown at the beginning of some lines in a job’s Overview tab.

1 Like

Oh, I think that’s what it is. At the time, there were many of the jobs waiting in the master queue. When I checked the job RAM usage at the top, the array is [0,1]. We queued the jobs in batches of 2 to the worker node now and that seems to be fine!