Configuring Cryosparc for a SLURM cluster

RussellM · September 13, 2021, 8:31pm

I’ve been working with my university’s computing center to get Cryosparc setup and configured on our cluster. I believe that I have my sample submit script and configuration.json files set up properly, however upon running the T20s benchmark, the patch motion job is interrupted with the following error:

License is valid.

Launching job on lane reichow-cs target reichow-cs ...

Launching job on cluster reichow-cs


====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P1_J7
#SBATCH --partition=gpu
#SBATCH --account=reichowlab
#SBATCH --output=/home/exacloud/gscratch/reichowlab/P1/J7/job.log
#SBATCH --error=/home/exacloud/gscratch/reichowlab/P1/J7/job.log
#SBATCH --nodes=1
#SBATCH --qos=normal
#SBATCH --mem-per-cpu=8G
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --error=/home/exacloud/gscratch/reichowlab/P1/J7/error.txt
#SBATCH --gres=gpu:4

available_devs=""
for devidx in $(seq 1 16);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

srun /home/exacloud/gscratch/reichowlab/local/cryosparc_worker/bin/cryosparcw run --project P1 --job J7 --master_hostname reichow-cs.ohsu.edu --master_command_core_port 39002 > /home/exacloud/gscratch/reichowlab/P1/J7/job.log 2>&1 
==========================================================================
==========================================================================

-------- Submission command: 
sbatch /home/exacloud/gscratch/reichowlab/P1/J7/queue_sub_script.sh

Failed to launch! 1

I’ve gone back forth a few times with the computing center and corrected an error that I had initially, but this continued problem is stumping me a little. There isn’t much of an obvious error and in the end, Cryosparc isn’t writing out a job.log or error.txt file as I have the sbatch command configured to write out (and has been suggested to post for similar problems that others have faced). Directories and some files (like job.json and queue_sub_script.sh) are written out though.

jelka · September 14, 2021, 6:55am

Hi @RussellM,

To narrow this down I think a look at your “cluster_scripts.sh” and “cluster_info.json” files would be nice.
Also, does your your SLURM cluster use cgroups?
Because, then you should leave out all the available_devs/CUDA_VISIBLE_DEVICES stuff, to make sure you only run on allocated GPUs.

Best,
Jesper

RussellM · September 14, 2021, 4:07pm

Sorry - I should have included those from the get go:

cluster_info.json:

"qdel_cmd_tpl": "scancel {{ cluster_job_id }}",
"worker_bin_path": "/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/bin/cryosparcw",
"title": "reichow-cs",
"cache_path": "/home/exacloud/gscratch/reichowlab/cryosparc_cache",
"qinfo_cmd_tpl": "sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'",
"qsub_cmd_tpl": "sbatch {{ script_path_abs }}",
"qstat_cmd_tpl": "squeue -j {{ cluster_job_id }}",
"cache_quota_mb": 1000000,
"send_cmd_tpl": "{{ command }}",
"cache_reserve_mb": 10000,
"name": "reichow-cs"
}

cluster_script.sh:

#!/bin/bash
#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --partition=gpu
#SBATCH --account=reichowlab
#SBATCH --output={{ job_log_path_abs }}
#SBATCH --error={{ job_log_path_abs }}
#SBATCH --nodes=1
#SBATCH --qos=normal
#SBATCH --mem-per-cpu=8G
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --error={{ job_dir_abs }}/error.txt
#SBATCH --gres=gpu:4

available_devs=""
for devidx in $(seq 1 16);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

srun {{ run_cmd }}

I do believe that my cluster is using cgroups, but I would have to double check with them on that.

Thank you,
Russell

jelka · September 14, 2021, 7:54pm

I think your cluster_info.json looks fine.
But your cluster_script.sh needs a little edit.
Try something like this instead:

cluster_script.sh:

#!/bin/bash
#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --partition=gpu
#SBATCH --account=reichowlab
#SBATCH --output={{ job_log_path_abs }}
#SBATCH --error={{ job_log_path_abs }}
#SBATCH -N 1
#SBATCH --qos=normal
#SBATCH --mem={{ (ram_gb)|int }}G
#SBATCH -n {{ num_cpu }}
#SBATCH --error={{ job_dir_abs }}/error.txt
#SBATCH --gres=gpu:{{ num_gpu }}

{{ run_cmd }}

The above includes values of nCPUs, nGPUs and RAM usage already set in Cryosparc.
And it gets CUDA_VISIBLE_DEVICES from cgroups (if activated in SLURM) upon job submission, which makes GPU nodes much more modular, as only needed resources are allocated. Making simultaneous runs possible on the GPU nodes.

//Jesper

RussellM · September 14, 2021, 9:22pm

Thank you - I think your changes to my submit script worked, but I’m not sure if the jobs are queued. I’ve gotten no outputs (yet) but the job is listed in CryoSparc as ‘Launched’ and is in my list of active jobs, though I don’t see it showing up in the queue from the command line yet.