I am running cryosparc job using cluster script, but the output shows the resource request different than what is requested in the script. Here is the script
====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## /cluster/software/cryosparc/v4.4.1/cryosparc_worker/bin/cryosparcw run --project P41 --job J10 --master_hostname test --master_command_core_port 39002 > /home/data/test_jobs/CS-test-4/J10/job.log 2>&1 - the complete command string to run the job
## 4 - the number of CPUs needed
#### 1 - the number of GPUs needed.
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## 8.0 - the amount of RAM needed in GB
## /home/test_jobs/CS-test-asif-4/J10 - absolute path to the job directory
## /home/test_jobs/CS-test-asif-4 - absolute path to the project dir
## /home/test_jobs/CS-test-asif-4/J10/job.log - absolute path to the log file for the job
## /cluster/software/cryosparc/v4.4.1/cryosparc_worker/bin/cryosparcw - absolute path to the cryosparc worker command
## --project P41 --job J10 --master_hostname test --master_command_core_port 39002 - arguments to be passed to cryosparcw run
## P41 - uid of the project
## J10 - uid of the job
## test - name of the user that created the job (may contain spaces)
## test@email.edu - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:
#SBATCH --job-name cryosparc_P41_J10
#SBATCH -c 2
#SBATCH --gres=gpu:1g.10gb:1
#SBATCH -p gpu
#SBATCH --mem=8G
#SBATCH --time 2:00:00
#SBATCH --output=/home/test_jobs/CS-test-4/J10/job.log
#SBATCH --error=/home/test_jobs/CS-test-4/J10/job.log
export CUDA_VISIBLE_DEVICES=`echo $CUDA_VISIBLE_DEVICES| awk -F ',' '{print NF}'`
/cluster/software/cryosparc/v4.4.1/cryosparc_worker/bin/cryosparcw run --project P41 --job J10 --master_hostname test --master_command_core_port 39002 > /home/test_jobs/CS-test-4/J10/job.log 2>&1 | egrep -o '[0-9]+'
==========================================================================
==========================================================================
-------- Submission command:
ssh sbatch /home/test_jobs/CS-test-4/J10/queue_sub_script.sh
-------- Cluster Job ID:
1398668
-------- Queued on cluster at 2024-02-22 14:53:43.290215
-------- Cluster job status at 2024-02-22 14:53:44.220706 (0 retries)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1398668 gpu cryospar R 0:00 1 g01
[CPU: 214.9 MB]
Job J10 Started
[CPU: 214.9 MB]
Master running v4.4.1, worker running v4.4.1
[CPU: 214.9 MB]
Working in directory: /home/test_jobs/CS-test-4/J10
[CPU: 214.9 MB]
Running on lane HB_MIG_test
[CPU: 214.9 MB]
Resources allocated:
[CPU: 214.9 MB]
Worker: HB_MIG_test
[CPU: 214.9 MB]
CPU : [0, 1, 2, 3]
[CPU: 214.9 MB]
GPU : [0]
[CPU: 214.9 MB]
RAM : [0]
[CPU: 214.9 MB]
SSD : False
I requested 2 cpu’s but the output says 4 cpu’s,is it trying to run on 4 cpu’s? the gpu shows 0, is this the gpu id on which the job is running?
the CUDA_VISIBLE_DEVICES env variable should be no. of gpu card or the card id?
IIRC if your slurm cluster constraints devices via cgroups (i.e. you have ConstrainDevices=yes in your /etc/slurm-llnl/cgroups.conf), then SLURM will
1 - mask all non-requested cuda devices
2 - automatically set CUDA_VISIBLE_DEVICES to the appropriate value (0,1,…,ngpus-1)
e.g.
srun -p cryoem -t 1-0 --gres=gpu:1 nvidia-smi #all nodes on cryoem have 8 gpus
Mon Feb 26 09:17:58 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:08:00.0 Off | N/A |
| 27% 30C P8 2W / 250W | 6MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2781 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
If you do not have constrain devices via cgroups, then you should set
CUDA_VISIBLE_DEVICES = $SLURM_JOB_GPUS
in this case all GPUs will be visible to your script and GPU usage will rely on your script following the directives of CUDA_VISIBLE_DEVICES.
Native cryosparc jobs always respect CUDA_VISIBLE_DEVICES;
However, without constraining devices, DeepEMHancer launched by cryosparc would (as of the last time I checked) default to GPU 0 regardless and escape cluster resource management. Topaz and other wrappers by cryosparc might also be problematic (I am not sure of the implementation details of these).
CUDA_VISIBLE_DEVICES can deal with MiG, and each slice can be set as a different device
So device 0 is probably a slice. Ask your sysadmin.
Does setting CUDA_VISIBLE_DEVICES=0,1,2,3 is being respected on MIG nodes or Cryosparc is able to see only the first MIG slice?
It depends on how SLURM is setup. Ask your sysadmin.
But let’s say you ask for 1 GPU (or 1 slice):
If you have constrain devices on, SLURM renames and hides devices.
Your script will only see the assigned slice of the assigned GPU as device id 0 (even if you are physically running on, e.g. slice 3 of GPU 7).
CUDA_VISIBLE_DEVICES = 0 will be set automatically for you.
Other GPUs are invisible.Trying to use e.g., GPU 1 will raise an error, so setting CUDA_VISIBLE_DEVICES manually might cause your script to fail.
If constrain devices is not on, then you want to make sure that you are pointing your script to the correct devices, which are stored in $SLURM_JOB_GPUS. Therefore you want to set CUDA_VISIBLE_DEVICES = $SLURM_JOB_GPUS.
And there are a bajillion possible other configurations, so ask your sysadmin.
Through slurm script I request 2 cpus but from the cryosparc output, it says it needs 4 cpu
Yes. But your sbatch for whatever reason is asking for
-c 2
which asks for 2 cpus per task.
In the default cryosparc slurm template, cores are usually asked as: