Unable to submit to SLURM queue

hpourreza · September 19, 2018, 3:27pm

Greetings,

I followed the installation instructions (https://cryosparc.com/docs/reference/install/) and installed cryoSPARC v2.2 on a cluster. Everything seems to be find and I can see the name of cluster under Resource Manager-> Compute Configuration on the cryoSPARC GUI. However, when I create a new job (e.g., T20S Tutorial), I can see under the Metadata tab the following information:
“job_dir”: “J2”,
“errors_run”: [],
“queued_to_lane”: “my_cluster”,
“run_on_master_direct”: true,
“version”: “v2.2.0”,

and it runs on the master node. It does not attempt to submit a job. Could you please help me understand what I am doing wrong. I found a similar post (cryoSPARC v2 cluster install) but does not provide a solution. I tried to reinstall the software and it did not help me.

Any help will be greatly appreciated.

hpourreza · October 4, 2018, 5:52pm

I received this reply from @apunjani which I think should resolve my issue. I will try a more complicated job to see if it actually starts a SLURM job.

rnavaza · October 9, 2018, 10:50pm

Hi hpourreza,

I’ve been working on the cryoSPARC SLURM submission script and I wanted to share some insights about it.
I noticed that the CTF calculation doesn’t use GPUs, so for this kind of job the SLURM script shouldn’t mention “–gres=gpu:0” because it will implicitly tell SLURM to consider every other values on a “per node” basis. Let’s say that you want to do the CTF calculation with 100 CPUs, then having the “–gres=gpu:0” directive in the script will tell SLURM to search for a single node with 100 CPUs…

Thanks to the jinja2 template engine It’s possible to split CPU jobs from GPU jobs. Also, if you have a cluster with the same number of GPUs on each node then you can add a rule for calculating the number of nodes and number of GPUs per node that you need to satisfy “num_gpu”.

Here’s a simplified version of my script; I didn’t launch all possible kind of jobs so it might be incomplete:

#!/bin/bash
#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --partition=debug
#SBATCH --output={{ job_log_path_abs }}
#SBATCH --error={{ job_log_path_abs }}
{%- if num_gpu == 0 %}
#SBATCH --ntasks={{ num_cpu }}
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --mem-per-cpu={{ ((ram_gb*1000)/num_cpu)|int }}M
{%- else %}
#SBATCH --nodes=1
#SBATCH --mem={{ (ram_gb*1000)|int }}M
#SBATCH --ntasks-per-node={{ num_cpu }}
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH --gres-flags=enforce-binding
{%- endif %}
{{ run_cmd }}

If like me your nodes don’t have 4 CPUs per GPU (needed for 3D refinements, 3D classifications, etc) then you can add a rule to reduce the number of requested CPUs based on the value of “num_cpu/num_gpu”.
In my real case tests, not even once a job hit 200% CPU when allocated more than 1 CPU per GPU…

Cheers,
Rafael.

sunny1226 · April 18, 2019, 3:20am

I also have problem to queue cryosparc v2 job through slurm. My job always request too many cpu and sbatch would report error as “error: Batch job submission failed: Requested node configuration is not available”. In my cluster, queue relion job through slurm is OK. Did you meet problem like this?

stephan · April 18, 2019, 4:30pm

Hi @sunny1226,

Which job were you trying to run? How many GPU’s did you request?

sunny1226 · April 22, 2019, 1:15am

Thank you. The CUDA we used is CUDA10.0. We are trying to run a MotionCor2 job. I request 10 GPUs since my cluster has 8 GPUs per workstation. But it reported an error: "Batch job submission failed: Requested node configuration is not available. " However, when I request 4 GPUs, it reported the same error. After I removed “#SBATCH --gres:gpu:{{ num_gpu }}” in my cluster_scripts.sh, my job could launch.