Wrong default configuration in SLURM cluster_script.sh template?

rnavaza · October 2, 2018, 2:40pm

Hi all,

I just managed to launch my first cryoSPARCv2 job with SLURM but there is something strange about how the GPUs are bound…
When I ask for two GPUs with the default SLURM cluster_script.sh I get two “python” processes using both GPUs each, so the GPU memory is shared between the two processes and the job crashes (asking 300 2D classes from a stack of 90000 256x256 particles).

I made some tests and I changed :
#SBATCH -n {{ num_cpu }}
to:
#SBATCH -n 1
#SBATCH --cpus-per-task={{ num_cpu }}

Now I get one “python” process using 2 CPUs (the top command shows that it uses about 170% CPU) and that is bound to the 2 GPUs; the job runs well for now.

My question is: what’s the desired behaviour of CPU/GPU allocation in cryoSPARC v2 ?

hpourreza · October 4, 2018, 1:46pm

Dear Rafael,

I think from SLURM’s perspective both commands will give your job num_cpu cores. The -n {{ num_cpu }} is a bit risky since tasks can run on multiple nodes and your solution limit them to one node.

I am glad you are able to run a SLURM job. Could you please let me know how you were able to make SLURM work (I asked for help here but no reply yet)? I installed cryoSPARC 4 or 5 times and I followed the instructions word-by-word but when I queue a job, it runs on the master node only.

apunjani · October 4, 2018, 5:20pm

Hi @hpourreza,

What types of job were you trying to run through SLURM? Note that some job types (import, interactive jobs, etc) only run on the master node since they have minimal CPU/RAM/GPU requirements. Only “major” jobs like motion correction, refinement, etc actually will launch through the cluster scheduler.

When a job does run via SLURM, you will see the begnning of the job output will contain information about the exact cluster command and script that was used.

Ali

rnavaza · October 5, 2018, 3:46pm

Hello, thanks for the replies.

@ Hossein

I think from SLURM’s perspective both commands will give your job num_cpu cores. The -n {{ num_cpu }} is a bit risky since tasks can run on multiple nodes and your solution limit them to one node.

Yes, it limits the job submission to one node, but the GRES:GPU in SLURM is also “per node” based, so if you ask for 4 GPUs in cryoSPARC (even without specifying the number of nodes in the script), the scheduler will search for a node with 4 GPUs.

Here is my actual configuration :

cluster_info.json :

{
"qdel_cmd_tpl": "scancel {{ cluster_job_id }}",
"worker_bin_path": "/home/cryosparc_user/cryosparc2_worker/bin/cryosparcw",
"title": "debug_cluster",
"cache_path": "/ssd/tmp",
"qinfo_cmd_tpl": "sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'",
"qsub_cmd_tpl": "sbatch {{ script_path_abs }}",
"qstat_cmd_tpl": "squeue -j {{ cluster_job_id }}",
"cache_quota_mb": null,
"send_cmd_tpl": "{{ command }}",
"cache_reserve_mb": 10000,
"name": "debug_cluster"

}

cluster_script.sh :

#!/bin/bash
#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --partition=debug
#SBATCH --output={{ job_log_path_abs }}
#SBATCH --error={{ job_log_path_abs }}
#SBATCH --nodes=1
#SBATCH --mem={{ (ram_gb*1000)|int }}M
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task={{ num_cpu }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH --gres-flags=enforce-binding
srun {{ run_cmd }}

The CUDA_VISIBLE_DEVICES is set by the srun command so you don’t need to process it like in the example script.

@ Ali :

The job is a 2D classification with 2 GPUs

Launching job on lane debug_cluster target debug_cluster ...

License is valid.

Launching job on cluster debug_cluster


====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P1_J16
#SBATCH --partition=debug
#SBATCH --output=/home/rnavaza/csparc_PTO/P1/J16/job.log
#SBATCH --error=/home/rnavaza/csparc_PTO/P1/J16/job.log
#SBATCH --nodes=1
#SBATCH --mem=24000M
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:2
#SBATCH --gres-flags=enforce-binding
srun /home/cryosparc_user/cryosparc2_worker/bin/cryosparcw run --project P1 --job J16 --master_hostname master.example.org --master_command_core_port 39002 > /home/rnavaza/csparc_PTO/P1/J16/job.log 2>&1 

==========================================================================
==========================================================================

-------- Submission command: 
sbatch /home/rnavaza/csparc_PTO/P1/J16/queue_sub_script.sh

-------- Cluster Job ID: 
203

-------- Queued at 2018-10-05 21:20:35.942541

-------- Job status at 2018-10-05 21:20:35.961971
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               203     debug cryospar cryospar PD       0:00      1 (None)

Project P1 Job J16 Started

Master running v2.3.0, worker running v2.3.0

Running on lane debug_cluster

Resources allocated: 

  Worker:  debug_cluster

  CPU   :  [0, 1]

  GPU   :  [0, 1]

  RAM   :  [0, 1, 2]

  SSD   :  True

--------------------------------------------------------------

Importing job module for job type class_2D...

Job ready to run

***************************************************************

Using random seed of 1555189427

Loading a ParticleStack with 89761 items...

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced, found 22440.37MB of files on SSD.

 SSD cache : cache successfuly requested to check 127 files.

 SSD cache : cache requires 0.00MB more on the SSD for files to be downloaded.

 SSD cache : cache has enough available space.

 SSD cache : cache starting transfers to SSD.

 SSD cache : complete, all requested files are available on SSD.

  Done.

Windowing particles

  Done.

Using 300 classes.

Computing 2D class averages: 

  Volume Size: 128 (voxel size 2.42A)

  Zeropadded Volume Size: 256

  Data Size: 256 (pixel size 1.21A)

  Using Resolution: 6.00A (51.0 radius)

  Windowing only corners of 2D classes at each iteration.

Using random seed for initialization of 1735495459

  Done in 1.148s.

Start of Iteration 0

I’m not sure how to resolve the GPU binding problem. I can try to generate an “heterogeneous job” in SLURM for working around that. Can you confirm that cryoSPARC needs one MPI process per GPU and “num_cpu / num_gpu” threads per MPI process ? Or does it need “num_cpu” MPI processes ?

rnavaza · October 9, 2018, 6:13pm

Maybe my problem hasn’t been understood so here’s what’s happening when I request 2 GPUs for a 2D classification with SLURM :

====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P3_J61
#SBATCH --partition=debug
#SBATCH --output=/home/rnavaza/CryoEM/Process/csparc_PTO/P3/J61/job.log
#SBATCH --error=/home/rnavaza/CryoEM/Process/csparc_PTO/P3/J61/job.log
#SBATCH --nodes=1
#SBATCH --mem=24000M
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --gres-flags=enforce-binding
srun /home/cryosparc_user/cryosparc2_worker/bin/cryosparcw run --project P3 --job J61 --master_hostname master.example.org --master_command_core_port 39002 > /home/rnavaza/CryoEM/Process/csparc_PTO/P3/J61/job.log 2>&1 
==========================================================================
==========================================================================

-------- Submission command: 
sbatch /home/rnavaza/CryoEM/Process/csparc_PTO/P3/J61/queue_sub_script.sh

-------- Cluster Job ID: 
273

-------- Queued at 2018-10-09 19:34:47.673510

-------- Job status at 2018-10-09 19:34:47.698162
    JOBID                 NAME            USER  PARTITION  CPUS  NODES  TRES_PER_NODE         TIME    TIME_LIMIT     STATE  NODELIST(REASON)
      273     cryosparc_P3_J61  cryosparc_user      debug     2      1          gpu:2         0:00     UNLIMITED   PENDING  (None)

Project P3 Job J61 Started

Master running v2.3.0, worker running v2.3.0

Running on lane mycluster

Resources allocated: 

  Worker:  mycluster

  CPU   :  [0, 1]

  GPU   :  [0, 1]

  RAM   :  [0, 1, 2]

  SSD   :  True

--------------------------------------------------------------

Importing job module for job type class_2D...

Project P3 Job J61 Started

Master running v2.3.0, worker running v2.3.0

Running on lane mycluster

Resources allocated: 

  Worker:  mycluster

  CPU   :  [0, 1]

  GPU   :  [0, 1]

  RAM   :  [0, 1, 2]

  SSD   :  True

--------------------------------------------------------------

Importing job module for job type class_2D...

Job ready to run

Then I can see the 2 “workers” fighting over the GPUs, which doesn’t make sense as each one should be using exclusively a different GPU

Tue Oct  9 19:36:40 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87                 Driver Version: 390.87                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0  On |                  N/A |
| 40%   72C    P2    79W / 250W |    428MiB / 11178MiB |     21%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 37%   72C    P2    79W / 250W |   1144MiB / 11178MiB |     34%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7591      C   python                                       223MiB |
|    0      7594      C   python                                       185MiB |
|    1      7591      C   python                                       939MiB |
|    1      7594      C   python                                       185MiB |
+-----------------------------------------------------------------------------+

The problem with this behaviour is that the GPU RAM is shared between the “workers”, limiting the size/number of the particles that can be processed.