Hi people,
This is quite strange, and I will try to explain it to my best.
Context: we have an heterogeneous SLURM cluster where several cryosparc instances are running for different users, each instance running in a subset of ports of the same server. Also, we have different versions and each user has his own processing lanes.
Problem: one of the users of version: v3.3.2 is launching a batch of jobs (around 60, mostly “Local Refinement (NEW!)” jobs) where some of the jobs (around 10) end up marked as FAILED but if you check they are still running in SLURM.
Sample SLURM output (blurred for privacy)
JOBID queue user@instance_lane_project_job user RUNNING 5:29:37 1 8 96G (null) gres:gpu:A40:1 290107 slurmnodename
Even worse, in this state they are updated on the web interface, to the point that later in time FAILED jobs magically end up completed. To the point that almost all of them end up marked as completed. The failed jobs converted to “completed” afterwards are even showing nice results, user said .
So the complain from the user is now:
“How can I know that a job marked in the cryosparc web as FAILED really failed?”
And my question is: is there any way to avoid or patch this confusing behaviour? Maybe increasing the heartbeat for this particular user instance or similar?
Thank you for your time in advance!