Running cryoSPARC2 in SGE cluster mode (gridengine)

open

#1

Dear All,

My version of cryosparc 2.5.0 won’t save gridengine templates ( cryosparcm cluster example gridengine) for cluster install (pbs and slurm options work fine). Since I don’t have a gridengine template, it makes it challenging to fine-tune cryosparc for SGE queueing (I had to improvise!). Based on what I came up with (please see below), it only runs on the master node (I did run cryosparcm cluster connect), avoiding the SGE queue completely (even though, I’ve added a gpu node with 8 K80 devices and it shows itself in the list). What am I missing to make cryosparc run its calculations on the gpu node using the SGE or UGE gridengine? Please let me know.

» more cluster_info.json
{
    "name" : "sgecluster",
    "worker_bin_path" : "/nethome/appbuild/eb_files/cryoSPARC/cryosparc2_worker/bin/cryosparcw",
    "cache_path" : "/hpcdata/scratch",
    "send_cmd_tpl" : "{{ command }}",
    "qsub_cmd_tpl" : "qsub -l gpu {{ script_path_abs }}",
    "qstat_cmd_tpl" : "qstat -j {{ cluster_job_id }}",
    "qdel_cmd_tpl" : "qdel {{ cluster_job_id }}",
    "qinfo_cmd_tpl" : "qstat -f",
    "transfer_cmd_tpl" : "scp {{ src_path }} loginnode:{{ dest_path }}"
}
» more cluster_script.sh
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SGE by Sergei
## Available variables:
## {{ run_cmd }}            - the complete command string to run the job
## {{ num_cpu }}            - the number of CPUs needed
## {{ num_gpu }}            - the number of GPUs needed.
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## {{ ram_gb }}             - the amount of RAM needed in GB
## {{ job_dir_abs }}        - absolute path to the job directory
## {{ project_dir_abs }}    - absolute path to the project dir
## {{ job_log_path_abs }}   - absolute path to the log file for the job
## {{ worker_bin_path }}    - absolute path to the cryosparc worker command
## {{ run_args }}           - arguments to be passed to cryosparcw run
## {{ project_uid }}        - uid of the project
## {{ job_uid }}            - uid of the job
## {{ job_creator }}        - name of the user that created the job (may contain spaces)
## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SGE script:

#$ -N __MRO_JOB_NAME__
#$ -V
#$ -pe threaded __MRO_THREADS__
#$ -cwd
#$ -l m_mem_free=__MRO_MEM_GB__G
#$ -o __MRO_STDOUT__
#$ -e __MRO_STDERR__
#$ -S "/usr/bin/env bash"

__MRO_CMD__

available_devs=""
for devidx in $(seq 1 8);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

{{ run_cmd }}

Sergei


#2

Hi @ponomarevsy,

Thanks for posting - do you get any errors or information in the job streamlog when you actually try to launch a job via the SGE cluster? You can also check
cryosparcm log command_core
right after you try to launch a job to see if there is a traceback of what is going wrong.

Reading through, I can’t spot any obvious issues in your setup or template submission script.
Could it be that for some reason the qsub etc commands are not in the PATH for the cryosparc user?
You can use
cryosparcm icli
and within the cryosparc shell,
cli.verify_cluster('lanename')
which will try to execute the qinfo_cmd_tpl to check that the cluster is connected.


#3

Thanks for your reply, @apunjani. Cli command (in my case it is: In [2]: cli.verify_cluster(‘sgecluster’)) seems to work, as I can see the result of the qstat command, showing users in queue. Last time I checked, cryosparc was running the submit node in OpenMP regime (on 4 cpus, 400% load from top command) and not using the GPU node at all, or at, least, I could not see it using it based on nvidia-smi output… I will run some tests again today and will let you know what is happening.

Hmm… it seems to be using our submit node as a worker, it does not attempt to connect to a gpu node:

GUI output:

License is valid.

Running job on master node

Project P3 Job J4 Started

Master running v2.5.0, worker running v2.5.0

Running on lane sgecluster

Resources allocated: 

  Worker:  ai-submit2

--------------------------------------------------------------

Importing job module for job type import_micrographs...

Job ready to run

***************************************************************

Importing movies from /test/username/cryoSPARC2/Projects/empiar_10025_subset/*.tif

Importing 20 files

Import paths were unique at level -1

Importing 20 files

Reading headers of each input file...

Traceback (most recent call last):
  File "cryosparc2_master/cryosparc2_compute/run.py", line 78, in cryosparc2_compute.run.main
  File "cryosparc2_compute/jobs/imports/run.py", line 481, in run_import_movies_or_micrographs
    assert shape[0] == 1, "Data file %s has more than 1 frame - import as movie instead" % (abs_path)
AssertionError: Data file /test/usrname/cryoSPARC2/Projects/empiar_10025_subset/14sep05c_00024sq_00003hl_00002es.frames.tif has more than 1 frame - import as movie instead

$ cryosparcm log command_core
COMMAND CORE STARTED ===  2019-05-09 11:40:22.635998  ==========================
*** BG WORKER START
*** LINK WORKER START
accel_kv number
Setting parameter J4.accel_kv with value 300 of type <type 'int'>
blob_paths path
Setting parameter J4.blob_paths with value /test/username/cryoSPARC2/Projects/empiar_10025_subset/*.tif of type <type 'str'>
total_dose_e_per_A2 number
Setting parameter J4.total_dose_e_per_A2 with value 53 of type <type 'int'>
psize_A number
Setting parameter J4.psize_A with value 0.6575 of type <type 'float'>
cs_mm number
Setting parameter J4.cs_mm with value 2.7 of type <type 'float'>
---------- Scheduler running ---------------
Lane  sgecluster cluster : Jobs Queued (nonpaused, inputs ready):  [u'J4']
Now trying to schedule J4
  Need slots :  {}
  Need fixed :  {}
  Need licen :  False
  Master direct :  True
---- Running project UID P3 job UID J4
failed to connect link
---------- Scheduler done ------------------
Changed job P3.J4 status started
Changed job P3.J4 status running
Changed job P3.J4 status failed

#4

Not writing these templates is an issue in 2.8.0 as well:

$ cryosparcm cluster example gridengine
Writing example cluster_info.json and cluster_script.sh to current dir
Unknown cluster type. Supported templates are:
  pbs
  slurm
  gridengine
Any cluster scheduler is supported, but you may have to write your own custom submission script.