We’re using SLURM as our job manager on our HPC and we’re seeing an odd behavior in the system and don’t know where to look to optimize it and we were hoping someone here might be able to help. A motioncorr job was submitted with the following SLURM settings
License is valid.
Launching job on lane bs2 target bs2 ...
Launching job on cluster bs2
====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## /gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw run --project P2 --job J277 --master_hostname ****** --master_command_core_port 39002 > /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J277/job.log 2>&1 - the complete command string to run the job
## 6 - the number of CPUs needed
## 1 - the number of GPUs needed.
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## 16.0 - the amount of RAM needed in GB
## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J277 - absolute path to the job directory
## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2 - absolute path to the project dir
## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J277/job.log - absolute path to the log file for the job
## /gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw - absolute path to the cryosparc worker command
## --project P2 --job J277 --master_hostname ****** --master_command_core_port 39002 - arguments to be passed to cryosparcw run
## P2 - uid of the project
## J277 - uid of the job
## Bryan Hansen - name of the user that created the job (may contain spaces)
## hansenbry@niaid.nih.gov - cryosparc username of the user that created the job (usually an email)
##
#### What follows is a simple SLURM script:
#SBATCH --job-name cryosparc_P2_J277
#SBATCH --gres=gpu:1
#SBATCH -p gpu
#SBATCH --cpus-per-task=6
#SBATCH --mem=48384MB
available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z "$available_devs" ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done
export CUDA_VISIBLE_DEVICES=$available_devs
/gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw run --project P2 --job J277 --master_hostname ***** --master_command_core_port 39002 > /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J277/job.log 2>&1
==========================================================================
==========================================================================
-------- Submission command:
sbatch /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J277/queue_sub_script.sh
-------- Cluster Job ID:
264933
-------- Queued on cluster at 2021-08-24 09:06:08.419585
-------- Job status at 2021-08-24 09:06:08.450041
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
264933 gpu cryospar cryo1 PD 0:00 6 (None)
which resulted in 6 nodes being taken despite only 1 GPU being requested. There are also 16 cores per nodes on the GPU systems so the 6 cores per task shouldn't have resulted in 6 nodes. the reason we don't know where to start is a CTF job with the following SLURM settings
License is valid.
Launching job on lane bs2 target bs2 ...
Launching job on cluster bs2
====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## /gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw run --project P2 --job J276 --master_hostname ***** --master_command_core_port 39002 > /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J276/job.log 2>&1 - the complete command string to run the job
## 2 - the number of CPUs needed
## 1 - the number of GPUs needed.
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## 8.0 - the amount of RAM needed in GB
## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J276 - absolute path to the job directory
## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2 - absolute path to the project dir
## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J276/job.log - absolute path to the log file for the job
## /gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw - absolute path to the cryosparc worker command
## --project P2 --job J276 --master_hostname ***** --master_command_core_port 39002 - arguments to be passed to cryosparcw run
## P2 - uid of the project
## J276 - uid of the job
## Bryan Hansen - name of the user that created the job (may contain spaces)
## hansenbry@niaid.nih.gov - cryosparc username of the user that created the job (usually an email)
##
#### What follows is a simple SLURM script:
#SBATCH --job-name cryosparc_P2_J276
#SBATCH --gres=gpu:1
#SBATCH -p gpu
#SBATCH --cpus-per-task=2
#SBATCH --mem=24192MB
available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z "$available_devs" ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done
export CUDA_VISIBLE_DEVICES=$available_devs
/gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw run --project P2 --job J276 --master_hostname ***** --master_command_core_port 39002 > /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J276/job.log 2>&1
==========================================================================
==========================================================================
-------- Submission command:
sbatch /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J276/queue_sub_script.sh
-------- Cluster Job ID:
264932
only pulled 1 node as expected and not 2 as would have been the pattern from the motioncorr job. We’re currently running v3.2.0. Also I was asked to remove the hostname from the report so that’s why it’s not there