Installation onto slurm, no gpus, no cores, no mem, failed to launch error

renee · February 24, 2022, 5:51pm

Hi all,
I am attempting to install cryosparc onto my local cluster with slurm scheduler. I have successfully launched the UI and am able to upload images. However, when I try to do anything involving GPUs, the jobs fail. When I check the instance information, I see: Cores ? Memory ? GPUs ?

My method for install was the following:

In my personal folder on the cluster, downloaded all the relevant files in a cryosparc folder, then:

export license
curl the .gz files
tar -xf the .gz files

in cryosparc_master, performed ./install.sh with a hostname that does launch the UI. However, I am unsure this is the correct hostname. I have seen three variations on my hostnames and have tried them all, but only one seems to launch the UI. Any insight into which hostname to use? It is the FQDN?

Upon install, said 1, 1 in response to the yes/no questions. Then I performed the ./bin
/cryosparcm start

In order to install the worker package, I had to salloc -p gpu, then ssh into my gpu node. I do not have to enter a password, so did not set up any passwordless ssh.

Once ssh’d into my gpunode, I navigated to cryosparc_worker, had to export the license again. Then I performed the ./install.sh --license $LICENSE_ID --cudapath /usr/local/cuda-11.2

After this, I was able to ./bin/cryosparcw gpulist and was able to see all of the gpus.

I then went back to my master and created a user. Here, I also updated my cluster_script.sh files and my cluster_info.json files.

Cluster_script.sh

#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## {{ run_cmd }}            - the complete command string to run the job
## {{ num_cpu }}            - the number of CPUs needed
## {{ num_gpu }}            - the number of GPUs needed. 
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## {{ ram_gb }}             - the amount of RAM needed in GB
## {{ job_dir_abs }}        - absolute path to the job directory
## {{ project_dir_abs }}    - absolute path to the project dir
## {{ job_log_path_abs }}   - absolute path to the log file for the job
## {{ worker_bin_path }}    - absolute path to the cryosparc worker command
## {{ run_args }}           - arguments to be passed to cryosparcw run
## {{ project_uid }}        - uid of the project
## {{ job_uid }}            - uid of the job
## {{ job_creator }}        - name of the user that created the job (may contain spaces)
## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:

#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH -n {{ num_cpu }}
#SBATCH -p gpu
#SBATCH --mem={{ (ram_gb*1000)|int }}MB             
#SBATCH -o {{ job_dir_abs }}
#SBATCH -e {{ job_dir_abs }}

{{ run_cmd }}

cluster_info.json

{
    "name" : "soroban",
    "worker_bin_path" : "/home/renee.arias/cryosparc/cryosparc_worker/bin/cryosparcw", 
    "cache_path" : "",
    "send_cmd_tpl" : "ssh loginnode {{ command }}",
    "qsub_cmd_tpl" : "sbatch {{ script_path_abs }}",
    "qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
    "qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
    "qinfo_cmd_tpl" : "sinfo"
}

When I try to perform patch motion correction, I get a failed to launch message every time. No error or output files are generated.

Any ideas?

wtempel · February 24, 2022, 6:36pm

@renee One requirement is that the cryoSPARC master hostname can be resolved appropriately by the cryoSPARC worker(s).
Does the output of the command
host <your-chosen-cryosparc-master-hostname>
(executed on the worker(s))
point to an ip address uniquely associated with the cryoSPARC master?
Can you
ping <your-chosen-cryosparc-master-hostname>
from the worker(s)?
Also, I missed any mention of cryosparcm cluster connect.

renee · February 24, 2022, 8:33pm

Apologies, I did do cryosparcm cluster connect from the master with the previously mentioned cluster_script and cluster_info files. It seemed to execute successfully.

I just logged into the worker and tried host my-master-hostname and got:
host: command not found

Then I tried ping my-master-hostname and got:
ping: socket: Operation not permitted

I also tried to ping the other possible hostnames that I have tried, and get the same error.

renee · February 24, 2022, 8:39pm

I don’t know if this info helps, but when I do a cryosparcm status from the worker, I get:

unix:///tmp/cryosparc-supervisor-f4c04734f78e8db15d0aa4bd2e96235c.sock refused connection

wtempel · February 28, 2022, 6:00pm

Interesting. Only on a (defunct) cryoSPARC master would I expect this type of message from cryosparcm status. Did this worker serve as a cryoSPARC master in the past? cryosparcm status was intended to be run on the master only.

wtempel · February 28, 2022, 6:17pm

@renee Has successful slurm job submission and execution been confirmed on this cluster for non-cryoSPARC jobs?

A more informative way for confirming the required connectivity from the worker to the master would be:
telnet <your-chosen-cryosparc-master-hostname> <mongo-port>
<mongo-port> can be obtained by adding 1 to CRYOSPARC_BASE_PORT (found in crosparc_worker/master/config.sh).
(The host and telnet commands might not yet have been installed on the worker.)

renee · February 28, 2022, 9:39pm

I have never successfully installed cryoSPARC on this cluster. When I run the same command from the master node, I get a normal status.

(base) [renee.lastname@server ~]$ cryosparcm status

CryoSPARC System master node installed at
/home/renee.lastname/cryosparc/cryosparc_master
Current cryoSPARC version: v3.3.1

CryoSPARC process status:

app RUNNING pid 114900, uptime 3 days, 2:01:43
app_dev STOPPED Not started
command_core RUNNING pid 114582, uptime 3 days, 2:01:54
command_rtp RUNNING pid 114726, uptime 3 days, 2:01:50
command_vis RUNNING pid 114710, uptime 3 days, 2:01:52
database RUNNING pid 114448, uptime 3 days, 2:01:57
liveapp STOPPED Not started
liveapp_dev STOPPED Not started
webapp RUNNING pid 114867, uptime 3 days, 2:01:44
webapp_dev STOPPED Not started

License is valid

global config variables:

export CRYOSPARC_LICENSE_ID=“license”
export CRYOSPARC_MASTER_HOSTNAME=“dqdn.server.com”
export CRYOSPARC_DB_PATH="/home/renee.lastname/cryosparc/cryosparc_database"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false
export CRYOSPARC_CLICK_WRAP=true
export CRYOSPARC_FORCE_HOSTNAME=true

renee · February 28, 2022, 9:41pm

Unfortunately I not not have root, so cannot install telnet to perform this operation. When I run: hostname --nis, I get:

hostname: Local domain name not set.

Would this impede my ability to communicate between master and worker nodes?

wtempel · March 2, 2022, 6:10pm

@renee As you do not have full control over the cluster, it may be best to contact the IT staff that supports the cluster. They may help you ensure that

worker nodes can resolve the address and access the necessary network ports of the cryoSPARC master node.
the cryoSPARC-specific cluster configuration scripts are appropriately configured for your particular cluster.

Installation onto slurm, no gpus, no cores, no mem, failed to launch error

(base) [renee.lastname@server ~]$ cryosparcm status

CryoSPARC System master node installed at /home/renee.lastname/cryosparc/cryosparc_master Current cryoSPARC version: v3.3.1

License is valid

CryoSPARC System master node installed at
/home/renee.lastname/cryosparc/cryosparc_master
Current cryoSPARC version: v3.3.1