Cryosparc v4.6.0 2D job never finish

wtempel · October 28, 2024, 8:33pm

For troubleshooting job stalls as those described in this topic, we still recommend testing whether the stall is resolved by disabling THP.
It is unclear to us at this time whether the new IO subsystem introduced in v4.6.0 is affected by THP differently than earlier implementations of the IO subsystem.
For stalls observed for GPU-accelerated CryoSPARC jobs on a CryoSPARC instance where

the job was queued to a workload manager like slurm,
CPU resources available to the job are actually restricted to the resources requested , and,
the stall is not resolved by disabling THP:

one may try to increase the number of requested CPUs. We received feedback that the increase resolved the stall for jobs submitted with a modified script template.
For example, one might replace a line

#SBATCH --cpus-per-task={{ num_cpu }}

in the current template
with

{% set increased_num_cpu = 8 -%}
#SBATCH --cpus-per-task={{ [1, num_cpu, [increased_num_cpu*num_gpu, increased_num_cpu]|min]|max }}

This recommendation is based on user feedback. Please be aware that we have not yet reproduced stalls that were resolved by an increase of the number of requested CPUs beyond {{ num_cpu }}. Our recommendation to increase the number of requested CPUs is subject to change, pending user feedback and our own testing. Please report your observations regarding the
--cpus-per-task= or equivalent parameter in this thread.
Special considerations may apply to GPU-accelerated VMs in the cloud that are used with a cluster workload manager like slurm. A suitable cloud-based VM may have (just) the required number of virtual cores, but custom VM and/or workload manager settings may or may not be required in order for those virtual cores to be “recognized” for the purpose of CPU allocations.