DeepEMhancer GPU allocation on SLURM

Andrea · June 20, 2023, 12:49pm

Dear Cryosparc team,

I noticed that DeepEMhancer jobs launched to SLURM processing lanes would occasionally fail due to out of memory error.
Further digging showed that all jobs are sent to GPU 0, causing collision with other jobs on the same node.

This is because GPU id is effectively hardcoded in the wrapper:

cryosparc_worker/cryosparc_compute/jobs/deepemhancer/run.py

line 81:
cuda_dev = res_alloc[‘slots’][‘GPU’][0]

line 109:
if use_half_maps:
command = [deepem_exec_path, ‘-i’, half_A_abs_path, ‘-i2’, half_B_abs_path, ‘-o’, output_path, ‘-g’, str(cuda_dev)]
else:
command = [deepem_exec_path, ‘-i’, map_abs_path, ‘-o’, output_path, ‘-g’, str(cuda_dev)]

Because cryosparc is not in charge of allocating the GPU slot, it seems to set cuda_dev to 0.
Therefore the deepemhancer command will always invoke -g 0 (i.e. ask for GPU 0) and therefore, depending on SLURM configuration, potentially escape the the GPU allocation by SLURM.

Proposed fix:

In the description of the DeepEMhancer job type, a sample script is provided.
Amending the last line to read:

exec deepemhancer -g $SLURM_JOB_GPUS $@

and changing run.py to:

if use_half_maps:
command = [deepem_exec_path, ‘-i’, half_A_abs_path, ‘-i2’, half_B_abs_path, ‘-o’, output_path]
else:
command = [deepem_exec_path, ‘-i’, map_abs_path, ‘-o’, output_path]

fixed the issue for me.

wtempel · June 20, 2023, 1:27pm

Thanks for your suggestion, and for raising this interesting topic.

Have you encountered circumstances where a configuration as described in Slurm, GPU, CGroups, ConstrainDevices - #3 by dchin - Discussion Zone - ask.CI is either ineffective in preventing GPU oversubscription or otherwise undesirable?

Andrea · June 20, 2023, 1:52pm

Hi wtempel

Setting CUDA_VISIBLE_DEVICES is not honored by deepEMhancer.

Adding

ConstrainDevices=yes

to cgroups.conf should work, though.
We are already using cgroups but did not enable this restriction.

I cannot test this until patch monday in a couple of weeks, though… so hang on

Thanks

Andrea · July 3, 2023, 2:25pm

Follow up:

Constraining devices at the cgroup level as suggested by @wtempel works. Thanks.

Just a note, for anyone who might implement the same change:

if previously the GPU allocation was done at script level with lines such as:

export CUDA_VISIBLE_DEVICES=$SLURM_JOB_GPUs

now that line needs to be removed, otherwise your script might be unable to find the assigned GPU(s) on the node.

wtempel · July 4, 2023, 4:06pm

Thanks for confirming that device constraints resolve the issue in your case and the useful reminder: