Hello,
we are submitting CryoSPARC jobs to a SLURM queuing system with cgroups enabled.
We request the memory in the submission script template with:
#SBATCH --mem={{ (ram_gb*1000)|int }}MB
For particle extraction jobs we see two kind of issues:
-
If the jobs are run on CPUs only ram_gb is defined as 0.0 and the job requests 0 memory. This might be the same issue as seen in Cryosparc Live: issue 4248
-
Especially for larger jobs with a lot of micrographs/particles run on the GPU, the jobs sometimes stall. The jobs are still show as running in CryoSPARC and SLURM, but they don’t progress anymore. After killing them manually in the CryoSPARC web app I see the following error in the SLURM stderr file:
slurmstepd: error: *** JOB 480 ON server CANCELLED AT 2020-05-06T09:01:50 *** slurmstepd: error: Detected 1 oom-kill event(s) in step 480.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
This makes me think that a process in the CryoSPARC worker used slightly more memory than requested and was killed by the OOM handler, leading to the stalling of the job.
For now as a workaround I changed the submission template to request max(8GB, min(45GB, 2*ram_gb ) ), which seems to help for these jobs.
I was wondering if there is a way to get a better estimate on how much memory should be allocated for each job. And related to that, what determines the memory usage for the extraction jobs? Is it just dependent on the box size and number of particles per micrograph, or does it also depend on the number of micrographs processed?
Thank you for any inputs.
Regards,
Andreas