V3.3.2 : Job marked as FAILED still running in SLURM and Cryosparc web

Hi people,

This is quite strange, and I will try to explain it to my best. :thinking:
Context: we have an heterogeneous SLURM cluster where several cryosparc instances are running for different users, each instance running in a subset of ports of the same server. Also, we have different versions and each user has his own processing lanes.

Problem: one of the users of version: v3.3.2 is launching a batch of jobs (around 60, mostly “Local Refinement (NEW!)” jobs) where some of the jobs (around 10) end up marked as FAILED but if you check they are still running in SLURM.
Sample SLURM output (blurred for privacy)

JOBID queue    user@instance_lane_project_job     user  RUNNING      5:29:37      1      8        96G     (null)       gres:gpu:A40:1     290107 slurmnodename

Even worse, in this state they are updated on the web interface, to the point that later in time FAILED jobs magically end up completed. To the point that almost all of them end up marked as completed. The failed jobs converted to “completed” afterwards are even showing nice results, user said :nerd_face:.

So the complain from the user is now:
“How can I know that a job marked in the cryosparc web as FAILED really failed?” :sweat:
And my question is: is there any way to avoid or patch this confusing behaviour? Maybe increasing the heartbeat for this particular user instance or similar?

Thank you for your time in advance!

May I initially suggest updating this CryoSPARC instance. It is possible that updates to heartbeat-related code after v3.3.2 resolve this issue.

Hi @wtempel thanks for the comment, I asked the user to update and test with the new setup.
By now all the “failed” jobs ended up successfully on v 3.3.2, so I will need to wait until the user launches something similar on the updated instance.
I will add info to this post if I see something similar happening with the newest CS.
Thanks also for your quick response!