Configuring cryoSPARC on a HPC

udalwadi · October 22, 2021, 10:36pm

Hi,

I am trying to set up cryoSPARC on a Compute Canada HPC. I have managed to get cryoSPARC master and worker instances to install without errors, and connect to the cluster using the cluster_config.json and cluster_script.sh files. Importing movies works fine, but Patch motion correction jobs don’t seem to work. This is the content from the job.log file of a failed job:

================= CRYOSPARCW ======= 2021-10-21 17:18:16.561103 =========
Project P1 Job J12
Master cdr767.int.cedar.computecanada.ca Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 27221
MAIN PID 27221
motioncorrection.run_patch cryosparc_compute.jobs.jobregister
Traceback (most recent call last):
File “”, line 1, in
File “cryosparc_worker/cryosparc_compute/run.py”, line 172, in cryosparc_compute.run.run
File “/project/6003680/udalwadi/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1911, in get_gpu_info
} for devid in devs ]
File “/project/6003680/udalwadi/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1911, in
} for devid in devs ]
pycuda._driver.LogicError: cuDeviceGet failed: invalid device ordinal

Running job on hostname %s cedar
Allocated Resources : {‘fixed’: {‘SSD’: False}, ‘hostname’: ‘cedar’, ‘lane’: ‘cedar’, ‘lane_type’: ‘cedar’, ‘license’: True, ‘licenses_acquired’: 4, ‘slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7]}, ‘target’: {‘cache_path’: ‘/localscratch/udalwadi.*/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘hostname’: ‘cedar’, ‘lane’: ‘cedar’, ‘name’: ‘cedar’, ‘qdel_cmd_tpl’: ‘scancel {{ cluster_job_id }}’, ‘qinfo_cmd_tpl’: ‘sinfo’, ‘qstat_cmd_tpl’: ‘squeue -j {{ cluster_job_id }}’, ‘qsub_cmd_tpl’: ‘sbatch {{ script_path_abs }}’, ‘script_tpl’: ‘#!/usr/bin/env bash\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n\n#SBATCH --account=def-calyip\n\n#SBATCH --output={{ job_dir_abs }}/output.txt\n\n#SBATCH --error={{ job_dir_abs }}/error.txt\n\n{%- if num_gpu == 0 %}\n\n#SBATCH --ntasks={{ num_cpu }}\n\n#SBATCH --cpus-per-task=1\n\n#SBATCH --threads-per-core=1\n\n{%- else %}\n\n#SBATCH --nodes=1 \n#SBATCH --gres=gpu:p100l:4 \n#SBATCH --ntasks=1\n#SBATCH --cpus-per-task=24 # There are 24 CPU cores on P100 Cedar GPU nodes\n#SBATCH --mem=0 # Request the full memory of the node\n#SBATCH --time=03:00:00\n#SBATCH --cpus-per-task=1\n\n{%- endif %}\n\nmodule load cuda/11.0\n\nmkdir -p /localscratch/udalwadi.16247496.0/cryosparc_cache\n\navailable_devs=""\nfor devidx in $(seq 1 16);\ndo\n\tif [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n\n\t\tif [[ -z “$available_devs” ]] ; then\n \t\t available_devs=$devidx\n\t\telse\n\t\t available_devs=$available_devs,$devidx\n\t\tfi\n\tfi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n\n\n{{ run_cmd }}\n\n’, ‘send_cmd_tpl’: ‘{{ command }}’, ‘title’: ‘cedar’, ‘type’: ‘cluster’, ‘worker_bin_path’: ‘/home/udalwadi/project/udalwadi/cryosparc/cryosparc_worker/bin/cryosparcw’}}
Process Process-1:4:
Traceback (most recent call last):
File “/project/6003680/udalwadi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/process.py”, line 297, in _bootstrap
self.run()
File “/project/6003680/udalwadi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/process.py”, line 99, in run
self._target(*self._args, **self._kwargs)
File “/project/6003680/udalwadi/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py”, line 176, in process_work_simple
process_setup(proc_idx) # do any setup you want on a per-process basis
File “cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 83, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.process_setup
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 34, in cryosparc_compute.engine.cuda_core.initialize
pycuda._driver.LogicError: cuDeviceGet failed: invalid device ordinal
**** handle exception rc
set status to failed

It seems the worker instance can’t detect the GPUs properly, is there soemthing differently I can do to ensure there is proper communication?

Thanks,
Udit

klemens.noga · October 27, 2021, 11:34am

Dear Udit,

Could you send your cluster script? I’ve seen

CUDA_VISIBLE_DEVICES=$available_devs

and

pycuda._driver.LogicError: cuDeviceGet failed: invalid device ordinal

in log of your failed job.

I think tkat yu you’re giving wrong GPU ID for cryoSPARC.

Usually, on HPC cluster which use SLURM as scheduler SLURM prepares CUDA_VISIBLE_DEVICES environmental variable to each job. Could you check whether it is also in your case? To do so you could short test job on GPU partition with echo $CUDA_VISIBLE_DEVICES and nvidia-smi commands to check it.

udalwadi · November 12, 2021, 9:19pm

Thanks for the reply. I was able to resolve this issue my adding a line for

nvidia-smi

to the submission script.