CUDA_ERROR_NO_DEVICE - but only when AF2 is running!

kevinj · March 29, 2024, 12:17am

I don’t think this is a cryosparc-specific problem, but that’s where it’s showing up for us, so I’d appreciate any guidance I can get here.

We have Cryosparc 4.4.1 installed on our local cluster (4 nodes, 4 GPUs each), using cuda 12.4. We have recently installed Alphafold2 from GitHub, which uses cuda 11.8 in a condo environment. Both programs use the same slurm queue.

When AF2 is running, if Cryosparc tries to submit a job to a GPU on the same node I get the error “number.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_NO_DEVICE] Call to cuInit results in CUDA_ERROR_NO_DEVICE” (see below for full traceback)
But when the AF2 job is finished, CS jobs run without a problem on the same node. If I simply activate the AF2 conda environment on the node, I also have no problem.

I can only guess that this is either a problem with the setup of the AF2 conda environment or with the way we are activating it with slurm. Or perhaps it’s not possible to use different versions of cuda on different GPUs on the same node at the same time. But it may really be something else, and I’m at a loss how to investigate any further. Any help/insight would be much appreciated.

output of nvidia-smi --query-gpu=index,name,driver_version --format=csv
index, name, driver_version
0, NVIDIA RTX A6000, 550.54.14
1, NVIDIA RTX A6000, 550.54.14
2, NVIDIA RTX A6000, 550.54.14
3, NVIDIA RTX A6000, 550.54.14

output of bin/cryosparcw gpulist
Detected 4 CUDA devices.

id pci-bus name

   0                24  NVIDIA RTX A6000
   1                59  NVIDIA RTX A6000
   2               134  NVIDIA RTX A6000
   3               175  NVIDIA RTX A6000

Full error message

[CPU: 418.0 MB]
Traceback (most recent call last) :
File “/home/exx/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 258, in ensure_initialized
self. cuInit (0)
File “/home/exx/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 352, in safe_cuda_api_call
return self._check_cuda_python_error (fname, libfn(*args))
File “/home/exx/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver-py”, line 412, in_check_cuda_python_error
raise CudaAPIError (retcode, msg)
numba. cuda. cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_NO_DEVICE] Call to cuInit results in CUDA_ERROR_NO_DEVICE
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 95, in cryosparc_master.cryosparc_compute.run.main File “cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py”, line 276, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun. run_class_2D
File “cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py”, line 547, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun. class2D_engine_run
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 47, in cryosparc_master. cryosparc_compute.gpu gpucore. initialize File “/home/exx/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cud adrv/driver-py”, line 3318, in get_version return driver.get_version()
File “/home/exx/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver-py”, line 465, in get_version
version = driver.cuDriverGetVersion()
File “/home/exx/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 296, in getattr
self.ensure_initialized()
File “/home/exx/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver-py”, line 262, in ensure_initialized
raise CudaSupportError(f"Error at driver init: {description}")

numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results In a CUDA_ERROR_NO_DEVICE (100)

rbs_sci · March 29, 2024, 5:57am

Does AlphaFold run inside Docker? How does Docker handle GPU assignments? Does it require passthrough? GPU passthrough will make the GPU unavailable for the host OS if it works like virtual machine passthrough, and may make CUDA think there are no devices…

wtempel · April 2, 2024, 4:35pm

@kevinj What is the output of the command

nvidia-smi --query-gpu=index,name,compute_mode --format=csv

?

kevinj · April 3, 2024, 1:26am

0, NVIDIA RTX A6000, Default
1, NVIDIA RTX A6000, Default
2, NVIDIA RTX A6000, Default
3, NVIDIA RTX A6000, Default

I’m traveling in part of the US with very spotty internet access and may be slow to respond this week

kevinj · April 3, 2024, 1:27am

Hi, no we aren’t running AlphaFold in Docker.

wtempel · April 3, 2024, 9:09pm

@kevinj Please can you post the the output of the command
cryosparcm cli "get_scheduler_targets()" and indicate the name of the CryoSPARC scheduler lane you used for the job(s) that did not run.
Do you ConstrainDevices in slurm?

rbs_sci · April 4, 2024, 12:02am

OK, thanks. I thought Docker was the “approved” way of running local AlphaFold, which is why I asked. From a search, seems since I set up my testing local copy others have come up with non-Docker ways of doing it.

kevinj · April 4, 2024, 4:12am

Output is below. ‘io’ is the particular node that gives the error, so we get the error when running in lane ‘io’ or ‘all’, if the job is assigned to io. I don’t have ConstrainDevices set in slurm.

[{'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'all', 'lane': 'all', 'name': 'all', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -p all\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb)|int }}GB \n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --error=/home/exx//Slurmlogs/%j.err\n#SBATCH --output=/home/exx//Slurmlogs/%j.out\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'garcialab', 'tpl_vars': ['worker_bin_path', 'num_cpu', 'ram_gb', 'run_args', 'job_log_path_abs', 'command', 'cluster_job_id', 'num_gpu', 'project_uid', 'job_uid', 'job_creator', 'run_cmd', 'project_dir_abs', 'job_dir_abs', 'cryosparc_username'], 'type': 'cluster', 'worker_bin_path': '/home/exx/software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'notio', 'lane': 'notio', 'name': 'notio', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -p notio\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb)|int }}GB \n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --error=/home/exx//Slurmlogs/%j.err\n#SBATCH --output=/home/exx//Slurmlogs/%j.out\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'garcialab', 'tpl_vars': ['worker_bin_path', 'num_cpu', 'ram_gb', 'run_args', 'job_log_path_abs', 'command', 'cluster_job_id', 'num_gpu', 'project_uid', 'job_uid', 'job_creator', 'run_cmd', 'project_dir_abs', 'job_dir_abs', 'cryosparc_username'], 'type': 'cluster', 'worker_bin_path': '/home/exx/software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'jupiter', 'lane': 'jupiter', 'name': 'jupiter', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -p jupiter\n#SBATCH -n {all{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb)|int }}GB \n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --error=/home/exx//Slurmlogs/%j.err\n#SBATCH --output=/home/exx//Slurmlogs/%j.out\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'garcialab', 'tpl_vars': ['worker_bin_path', 'num_cpu', 'ram_gb', 'run_args', 'job_log_path_abs', 'command', 'cluster_job_id', 'num_gpu', 'project_uid', 'job_uid', 'job_creator', 'run_cmd', 'project_dir_abs', 'job_dir_abs', 'cryosparc_username'], 'type': 'cluster', 'worker_bin_path': '/home/exx/software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'io', 'lane': 'io', 'name': 'io', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -p io\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb)|int }}GB \n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --error=/home/exx//Slurmlogs/%j.err\n#SBATCH --output=/home/exx//Slurmlogs/%j.out\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'garcialab', 'tpl_vars': ['worker_bin_path', 'num_cpu', 'ram_gb', 'run_args', 'job_log_path_abs', 'command', 'cluster_job_id', 'num_gpu', 'project_uid', 'job_uid', 'job_creator', 'run_cmd', 'project_dir_abs', 'job_dir_abs', 'cryosparc_username'], 'type': 'cluster', 'worker_bin_path': '/home/exx/software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'europa', 'lane': 'europa', 'name': 'europa', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -p europa\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb)|int }}GB \n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --error=/home/exx//Slurmlogs/%j.err\n#SBATCH --output=/home/exx//Slurmlogs/%j.out\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'garcialab', 'tpl_vars': ['worker_bin_path', 'num_cpu', 'ram_gb', 'run_args', 'job_log_path_abs', 'command', 'cluster_job_id', 'num_gpu', 'project_uid', 'job_uid', 'job_creator', 'run_cmd', 'project_dir_abs', 'job_dir_abs', 'cryosparc_username'], 'type': 'cluster', 'worker_bin_path': '/home/exx/software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'ganymede', 'lane': 'ganymede', 'name': 'ganymede', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -p ganymede\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb)|int }}GB \n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --error=/home/exx//Slurmlogs/%j.err\n#SBATCH --output=/home/exx//Slurmlogs/%j.out\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'garcialab', 'tpl_vars': ['worker_bin_path', 'num_cpu', 'ram_gb', 'run_args', 'job_log_path_abs', 'command', 'cluster_job_id', 'num_gpu', 'project_uid', 'job_uid', 'job_creator', 'run_cmd', 'project_dir_abs', 'job_dir_abs', 'cryosparc_username'], 'type': 'cluster', 'worker_bin_path': '/home/exx/software/cryosparc/cryosparc_worker/bin/cryosparcw'}]

wtempel · April 4, 2024, 3:13pm

Thanks @kevinj for sharing these details.

I can see how setting CUDA_VISIBLE_DEVICES inside a slurm script can fail at brokering GPU resources between multiple cluster jobs on the same host. May I suggest

updating the CryoSPARC cluster script templates to remove any code block like

available_devs=""
for devidx in $(seq 0 15);
do
 if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
 if [[ -z "$available_devs" ]] ; then
 available_devs=$devidx
 else
 available_devs=$available_devs,$devidx
 fi
 fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

instead isolating GPU resources between jobs using slurm cgroup settings. For an example, please see Slurm, GPU, CGroups, ConstrainDevices - #3 by dchin - Discussion Zone - ask.CI.

kevinj · April 10, 2024, 9:50pm

Thanks! I set up cgroups and removed the GPU allocation block from my sbatch script, now everything is playing together nicely.