So we are running CryoSparc on our HPC system. We recently gained exclusive access to a single A100 with 128 Gb of memory. The HPC system is using SLURM for job control. When submitting jobs I have noticed that it has a significant number of tries before running. Not sure what is going on and neither do our HPC folks. Anyone else run into something like this. As an example a Blob Picker job took I would guess 5000 tries before actually running. Nothing else is running on the card and the time is still better then days for a shared card but it is puzzling.
I have run into this issue as well Ab-initio run that I have queued has undergone like 2900+ tries already, I thought it is a cluster issue but as your HPC folks have no idea, I would like to follow this up. We have A100 cards with 48 Gb of memory, Would like to know more of this
Here is the line in the event log
[2023-03-30 15:26:04.29]
-------- Cluster job status at 2023-03-31 19:15:04.761925 (7922 retries)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
14772352 vr_cryo cryospar lenthoma R 22:48:11 1 c833
As you can see at this point 7922 retries, This is Extract from Micrographs job. As to what we have done, nothing the Blob Picker job eventually ran about 15 hours later, which is faster then the wait of days on the common cards due to overall HPC user demand. Still it is puzzling.
If you run on job status command on the HPC system it just indicates the job is running.
When submitting a CryoSPARC job to a cluster node, a bash submission script is created in the job directory. This submission script is run on the cluster node, at which point the cluster management software, in this case SLURM, takes over responsibility for running the CryoSPARC job code.
In CryoSPARC, when running a job submitted to a cluster node, the event log will show the status of the submission after the submission script is run. The status is polled at a constant interval.
[2023-03-30 15:26:04.29]
-------- Cluster job status at 2023-03-31 19:15:04.761925 (7922 retries)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
14772352 vr_cryo cryospar lenthoma R 22:48:11 1 c833
In the above event, the SLURM job status is being displayed. The “7922 retries” is not the number of times the job has been submitted to SLURM, but rather the number of status updates that have occurred since the job was submitted. According to Slurm Workload Manager - squeue, the status code “R” indicates the job was assigned a node and is running.
After the submission script is run on the cluster node, SLURM becomes responsible for assigning a node to the job and running it. If the job is not running immediately after submission, I recommend you monitor the status to see if it is stuck in a waiting/pending status, and check the job log tab for any output. You can also try manually submitting the script on the cluster node and monitoring the output of that. The submission command is printed in the event log. In @lmthomas 's it is printed as -------- Submission command: sbatch /ourdisk/hpc/bsc/cbourne/dont_archive/CS-bourne/J13/queue_sub_script.sh
As for this event,
Cluster job status update for P10 J13 failed with exit code 1 (24444 retries)
slurm_load_jobs error: Invalid job id specified
SLURM will eventually drop job IDs after they have not been active for some time. This means that eventually, status retries will fail because the job ID being queried no longer refers to a valid SLURM job. The above event is expected if the SLURM job was submitted a long time ago and the CryoSPARC job is still running, but status updates no longer work for that ID because SLURM has dropped it.
I’m now having this issue as well (exact same symptoms as described). Just updated to v4, and have jobs that aren’t starting to run immediately after submission. Never had this problem with v3, so while it is happening after handoff to SLURM, I think it’s related to some difference in v3 versus v4.
Any tips on how to get a job from this state to actually start running?
May I ask what happens after SLURM drops the job ID? Once the job ID is dropped, it is not possible to track the job status any more.
If the count of “retries” is still increasing even after the drop of job ID by SLURM, does it mean that the job is still running? How many retries will it take to finish the job actually for GPU or Launch test? Are there any examples that you can share, which specifies the number of retries for GPU or Launch test?
If the slurm job id was correctly extracted from sbatch output, but slurm no longer “knows” about the corresponding job, slurm should ensure that the corresponding slurm job is no longer running, if it ever ran. Under those circumstances, you may want to investigate why the slurm job exited prematurely. Relevant information may be found in
the job log (under the Metadata|Log subtab)
the files specified by the #SBATCH-e/--error and -o/--output parameters
slurm logs (may require help from your IT support)
… you may want to perform the CryoSPARC Kill Job action.