Dear Cryosparc team,
I noticed that DeepEMhancer jobs launched to SLURM processing lanes would occasionally fail due to out of memory error.
Further digging showed that all jobs are sent to GPU 0, causing collision with other jobs on the same node.
This is because GPU id is effectively hardcoded in the wrapper:
cryosparc_worker/cryosparc_compute/jobs/deepemhancer/run.py
line 81:
cuda_dev = res_alloc[‘slots’][‘GPU’][0]line 109:
if use_half_maps:
command = [deepem_exec_path, ‘-i’, half_A_abs_path, ‘-i2’, half_B_abs_path, ‘-o’, output_path, ‘-g’, str(cuda_dev)]
else:
command = [deepem_exec_path, ‘-i’, map_abs_path, ‘-o’, output_path, ‘-g’, str(cuda_dev)]
Because cryosparc is not in charge of allocating the GPU slot, it seems to set cuda_dev to 0.
Therefore the deepemhancer command will always invoke -g 0 (i.e. ask for GPU 0) and therefore, depending on SLURM configuration, potentially escape the the GPU allocation by SLURM.
Proposed fix:
In the description of the DeepEMhancer job type, a sample script is provided.
Amending the last line to read:
exec deepemhancer -g $SLURM_JOB_GPUS $@
and changing run.py to:
if use_half_maps:
command = [deepem_exec_path, ‘-i’, half_A_abs_path, ‘-i2’, half_B_abs_path, ‘-o’, output_path]
else:
command = [deepem_exec_path, ‘-i’, map_abs_path, ‘-o’, output_path]
fixed the issue for me.