Long wait for Job to start

lmthomas · March 30, 2023, 2:48pm

So we are running CryoSparc on our HPC system. We recently gained exclusive access to a single A100 with 128 Gb of memory. The HPC system is using SLURM for job control. When submitting jobs I have noticed that it has a significant number of tries before running. Not sure what is going on and neither do our HPC folks. Anyone else run into something like this. As an example a Blob Picker job took I would guess 5000 tries before actually running. Nothing else is running on the card and the time is still better then days for a shared card but it is puzzling.

Len Thomas

Rahul · March 31, 2023, 6:30pm

Hi Len,

I have run into this issue as well Ab-initio run that I have queued has undergone like 2900+ tries already, I thought it is a cluster issue but as your HPC folks have no idea, I would like to follow this up. We have A100 cards with 48 Gb of memory, Would like to know more of this

Regards,
Rahul

wtempel · March 31, 2023, 7:04pm

Welcome to the forum @lmthomas @Rahul.
Please can you illustrate what you mean with tries and what it took to finally get jobs running.

lmthomas · March 31, 2023, 7:19pm

Here is the line in the event log
[2023-03-30 15:26:04.29]
-------- Cluster job status at 2023-03-31 19:15:04.761925 (7922 retries)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
14772352 vr_cryo cryospar lenthoma R 22:48:11 1 c833

As you can see at this point 7922 retries, This is Extract from Micrographs job. As to what we have done, nothing the Blob Picker job eventually ran about 15 hours later, which is faster then the wait of days on the common cards due to overall HPC user demand. Still it is puzzling.
If you run on job status command on the HPC system it just indicates the job is running.

Rahul · April 1, 2023, 11:59am

I have attached a screenshot as well for the line in event log showing retries

lmthomas · April 2, 2023, 6:57pm

So we have set time limit on the HPC gpus and cpus of 48 hours so the job stopped with the following after 48 hours:

-------- Submission command: sbatch /ourdisk/hpc/bsc/cbourne/dont_archive/CS-bourne/J13/queue_sub_script.sh

-------- Cluster Job ID: 14772352

-------- Queued on cluster at 2023-03-30 20:26:03.384940

Cluster job status update for P10 J13 failed with exit code 1 (24444 retries) slurm_load_jobs error: Invalid job id specified

24K retries

nwong · April 3, 2023, 3:58pm

Hi @Rahul , @lmthomas,

Allow me to clarify what you’re seeing.

When submitting a CryoSPARC job to a cluster node, a bash submission script is created in the job directory. This submission script is run on the cluster node, at which point the cluster management software, in this case SLURM, takes over responsibility for running the CryoSPARC job code.

In CryoSPARC, when running a job submitted to a cluster node, the event log will show the status of the submission after the submission script is run. The status is polled at a constant interval.

[2023-03-30 15:26:04.29]
-------- Cluster job status at 2023-03-31 19:15:04.761925 (7922 retries)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
14772352 vr_cryo cryospar lenthoma R 22:48:11 1 c833

In the above event, the SLURM job status is being displayed. The “7922 retries” is not the number of times the job has been submitted to SLURM, but rather the number of status updates that have occurred since the job was submitted. According to Slurm Workload Manager - squeue, the status code “R” indicates the job was assigned a node and is running.

After the submission script is run on the cluster node, SLURM becomes responsible for assigning a node to the job and running it. If the job is not running immediately after submission, I recommend you monitor the status to see if it is stuck in a waiting/pending status, and check the job log tab for any output. You can also try manually submitting the script on the cluster node and monitoring the output of that. The submission command is printed in the event log. In @lmthomas 's it is printed as
-------- Submission command: sbatch /ourdisk/hpc/bsc/cbourne/dont_archive/CS-bourne/J13/queue_sub_script.sh

As for this event,

Cluster job status update for P10 J13 failed with exit code 1 (24444 retries)
slurm_load_jobs error: Invalid job id specified

SLURM will eventually drop job IDs after they have not been active for some time. This means that eventually, status retries will fail because the job ID being queried no longer refers to a valid SLURM job. The above event is expected if the SLURM job was submitted a long time ago and the CryoSPARC job is still running, but status updates no longer work for that ID because SLURM has dropped it.

kpahil · April 19, 2023, 1:18am

I’m now having this issue as well (exact same symptoms as described). Just updated to v4, and have jobs that aren’t starting to run immediately after submission. Never had this problem with v3, so while it is happening after handoff to SLURM, I think it’s related to some difference in v3 versus v4.

Any tips on how to get a job from this state to actually start running?

ahm9th · July 22, 2024, 10:14am

May I ask what happens after SLURM drops the job ID? Once the job ID is dropped, it is not possible to track the job status any more.

If the count of “retries” is still increasing even after the drop of job ID by SLURM, does it mean that the job is still running? How many retries will it take to finish the job actually for GPU or Launch test? Are there any examples that you can share, which specifies the number of retries for GPU or Launch test?

wtempel · July 22, 2024, 3:17pm

Welcome to the forum @ahm9th.

If the slurm job id was correctly extracted from sbatch output, but slurm no longer “knows” about the corresponding job, slurm should ensure that the corresponding slurm job is no longer running, if it ever ran. Under those circumstances, you may want to investigate why the slurm job exited prematurely. Relevant information may be found in

the job log (under the Metadata|Log subtab)
the files specified by the #SBATCH -e/--error and -o/--output parameters
slurm logs (may require help from your IT support)

… you may want to perform the CryoSPARC Kill Job action.