We would like to report a potential bug in v5 that halts all jobs in the CryoSPARC queue before launching them to the cluster. We found that by running the v5 “cryosparcm restart scheduler" brings the queued jobs back to active.
Due to restrictions in our HPC policy, we cannot have a dedicated VM to host the CryoSPARC master. Instead, both our CryoSPARC master and workers are launched as independent batch jobs per HPC user. This setup worked perfectly for older versions; however, starting from v4.4, queued jobs started to hang indefinitely, even after the input was ready. We found previously that by running the cryosparcm restart command_core would bring the queued jobs back to active.
For v5, every job now hangs in the queue, not just those waiting for unfinished input. Our workaround is a wrapper that restarts the v5 cryosparcm scheduler every 60 seconds. I’m wondering if you could look into this and possibly fix it in the code? Many thanks.
post the sbatch or srun script script and options for the CryoSPARC master.
send us the tgz file created with the command cryosparcm snaplogs
just after it is observed that
(for a master instance whose scheduler service is not automatically restarted every 60 seconds.
let us know how long after the scheduler restart jobs would begin “hanging” again
Thanks for the speedy reply. The .tgz file and the run_server.sh script that starts the master has been sent to: [redacted].
The third question “let us know how long after the scheduler restart jobs would begin “hanging” again“:
To put it simply: almost immediately. Strangely, when the master is freshly started, any job submitted into the queue can be successfully launched—until one of them is stopped, finished, or killed. After that, any jobs submitted to the queue are halted until the cryosparcm scheduler is restarted. Unlike a fresh master restart,the scheduler restart only seems to activate jobs already in the queue for a single instance; newly submitted jobs after that still get stuck. Hope that helps.
Thanks @morganyang422 for sending the information.
Resource-limiting #SBATCH options, such as --time=, --cpus-per-task= or --mem= (or their single-character equivalents) or similar limitations imposed on the slurm job might be causing some CryoSPARC master services to malfunction.
Thanks for the suggestion, but could you be more specific? The ‘stuck in queue’ problem persists regardless of how much we increase resource allocation. The CryoSPARC internal scheduler worked fine in our environment before v4.4, but now every job gets stuck after upgrading to v5. It seems like something changed in your code that is not compatible with the cluster environment.
We understand this may not be an easy fix, but we wanted to bring to your attention that this issue persists and is becoming more serious for cluster setups.
@morganyang422 What is the output of the command cat /sys/kernel/mm/transparent_hugepage/enabled
on the compute node where the CryoSPARC master slurm job is running?