cryoSPARC Live and SLURM

hansenbry · March 19, 2020, 8:30pm

We are trying to use cryoSPARC Live in an HPC environment with SLURM as the queuing system. Right now we are getting the following errors:

Unable to start session: {u'message': u"OtherError: Command '['sbatch', '/gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J4/queue_sub_script.sh']' returned non-zero exit status 1", u'code': 500, u'data': None, u'name': u'OtherError'}

License is valid.
Launching job on lane bs2 target bs2 ...
Launching job on cluster bs2
====================== Cluster submission script: ======================== ========================================================================== #!/usr/bin/env bash #### cryoSPARC cluster submission script template for SLURM ## Available variables: ## /gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw run --project P2 --job J2 --master_hostname ai-rmlcryoprd1.niaid.nih.gov --master_command_core_port 39002 > /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J2/job.log 2>&1 - the complete command string to run the job ## 0 - the number of CPUs needed ## 1 - the number of GPUs needed. ## Note: the code will use this many GPUs starting from dev id 0 ## the cluster scheduler or this script have the responsibility ## of setting CUDA_VISIBLE_DEVICES so that the job code ends up ## using the correct cluster-allocated GPUs. ## 0.0 - the amount of RAM needed in GB ## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J2 - absolute path to the job directory ## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2 - absolute path to the project dir ## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J2/job.log - absolute path to the log file for the job ## /gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw - absolute path to the cryosparc worker command ## --project P2 --job J2 --master_hostname ai-rmlcryoprd1.niaid.nih.gov --master_command_core_port 39002 - arguments to be passed to cryosparcw run ## P2 - uid of the project ## J2 - uid of the job ## Bryan Hansen - name of the user that created the job (may contain spaces) ## hansenbry@niaid.nih.gov - cryosparc username of the user that created the job (usually an email) ## ## What follows is a simple SLURM script: #SBATCH --job-name cryosparc_P2_J2 #SBATCH -n 0 #SBATCH --gres=gpu:1 #SBATCH -p gpu #SBATCH --mem=0MB #SBATCH --constraint=v100 available_devs="" for devidx in $(seq 0 15); do if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then if [[ -z "$available_devs" ]] ; then available_devs=$devidx else available_devs=$available_devs,$devidx fi fi done export CUDA_VISIBLE_DEVICES=$available_devs /gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw run --project P2 --job J2 --master_hostname ai-rmlcryoprd1.niaid.nih.gov --master_command_core_port 39002 > /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J2/job.log 2>&1 ========================================================================== ========================================================================== 
-------- Submission command: sbatch /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J2/queue_sub_script.sh
Failed to launch! 1

We looked into the queue_sub_script.sh and saw this:

#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## /gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw run --project P2 --job J2 --master_hostname ai-rmlcryoprd1.niaid.nih.gov --master_command_core_port 39002 > /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J2/job.log 2>&1             - the complete command string to run the job
## 0            - the number of CPUs needed
## 1            - the number of GPUs needed. 
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## 0.0             - the amount of RAM needed in GB
## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J2        - absolute path to the job directory
## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2    - absolute path to the project dir
## /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J2/job.log   - absolute path to the log file for the job
## /gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw    - absolute path to the cryosparc worker command
## --project P2 --job J2 --master_hostname ai-rmlcryoprd1.niaid.nih.gov --master_command_core_port 39002           - arguments to be passed to cryosparcw run
## P2        - uid of the project
## J2            - uid of the job
## Bryan Hansen        - name of the user that created the job (may contain spaces)
## hansenbry@niaid.nih.gov - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:
 
#SBATCH --job-name cryosparc_P2_J2
#SBATCH -n 0
#SBATCH --gres=gpu:1
#SBATCH -p gpu
#SBATCH --mem=0MB             
#SBATCH --constraint=v100
 
available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs
 
/gs1/RTS/EM/Software/CryoSPARCv2/cryosparc2_worker/bin/cryosparcw run --project P2 --job J2 --master_hostname ai-rmlcryoprd1.niaid.nih.gov --master_command_core_port 39002 > /gs1/RTS/EM/Processing/marcotrigianoj2-2020/P2/J2/job.log 2>&1

Our question is where in the gui or config for cryoSPARC Live do we set the cpu, gpu, and mem values? The GPU value is always 1 no matter what is selected in the UI and the values of 0 for CPU and Mem we think is the source of the original error.

Thanks for any tips/input that can help us.

stephan · March 20, 2020, 2:03pm

Hey @hansenbry,

Is it first possible if you can let us know what job type you were trying to run? It’s definitely weird how CPU & MEM were set to 0. These CPU and MEM values are set by the job type- they’re usually tied to the number of GPUs you request. Also, this is a new instance right? Was the SLURM integration working before?

hansenbry · March 20, 2020, 2:17pm

Hi @stephan
This was a cryoSPARC Live job. The system was working just fine with the standard cryoSPARC side when we did the initial testing, but we saw this behavior as soon as we started trying the Live side of the program

stephan · March 20, 2020, 6:55pm

Hi @hansenbry,

Thank you for clarifying. You’re correct, this is a bug- it turns out those values aren’t properly set for the RTP Worker, as they are 0. I will create a fix for this bug and get it out in the next release. For the time being, if you’d like, you can hard-code default or “minimum” values for the CPU and MEM variables. For example, you can set the CPU variable such that CPU = max(2, num_cpus).
You can even isolate this modified cluster submission script so that it’s only used for cryoSPARC Live by connecting it to cryoSPARC again as a different name- this will put it in a different lane, which you can explicitly select inside a session configuration page.
Please let me know if you have any questions.

hansenbry · March 20, 2020, 7:05pm

@stephan,

Thanks so much! Glad we weren’t losing our minds . We’ll keep an eye out for the update to Live for this fix.

stephan · March 25, 2020, 6:45pm

Hey @hansenbry,

To deal with the case where the num_cpu and ram_gb variables are 0, you can modify your cluster submission script to have the following:

#SBATCH --cpus-per-task={%if num_cpu == 0%}1{%else%}{{ num_cpu }}{%endif%}

##{%- if ram_gb == 0.0 -%}{% set ram_gb_new = 4.0 %}{%- else -%}{% set ram_gb_new = ram_gb %}{%- endif -%}
#SBATCH --mem={{ (ram_gb_new*1024)|int }}MB