Hi all,
I am attempting to install cryosparc onto my local cluster with slurm scheduler. I have successfully launched the UI and am able to upload images. However, when I try to do anything involving GPUs, the jobs fail. When I check the instance information, I see: Cores ? Memory ? GPUs ?
My method for install was the following:
In my personal folder on the cluster, downloaded all the relevant files in a cryosparc folder, then:
export license
curl the .gz files
tar -xf the .gz files
in cryosparc_master, performed ./install.sh with a hostname that does launch the UI. However, I am unsure this is the correct hostname. I have seen three variations on my hostnames and have tried them all, but only one seems to launch the UI. Any insight into which hostname to use? It is the FQDN?
Upon install, said 1, 1 in response to the yes/no questions. Then I performed the ./bin
/cryosparcm start
In order to install the worker package, I had to salloc -p gpu, then ssh into my gpu node. I do not have to enter a password, so did not set up any passwordless ssh.
Once ssh’d into my gpunode, I navigated to cryosparc_worker, had to export the license again. Then I performed the ./install.sh --license $LICENSE_ID --cudapath /usr/local/cuda-11.2
After this, I was able to ./bin/cryosparcw gpulist and was able to see all of the gpus.
I then went back to my master and created a user. Here, I also updated my cluster_script.sh files and my cluster_info.json files.
Cluster_script.sh
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## {{ run_cmd }} - the complete command string to run the job
## {{ num_cpu }} - the number of CPUs needed
## {{ num_gpu }} - the number of GPUs needed.
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## {{ ram_gb }} - the amount of RAM needed in GB
## {{ job_dir_abs }} - absolute path to the job directory
## {{ project_dir_abs }} - absolute path to the project dir
## {{ job_log_path_abs }} - absolute path to the log file for the job
## {{ worker_bin_path }} - absolute path to the cryosparc worker command
## {{ run_args }} - arguments to be passed to cryosparcw run
## {{ project_uid }} - uid of the project
## {{ job_uid }} - uid of the job
## {{ job_creator }} - name of the user that created the job (may contain spaces)
## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:
#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH -n {{ num_cpu }}
#SBATCH -p gpu
#SBATCH --mem={{ (ram_gb*1000)|int }}MB
#SBATCH -o {{ job_dir_abs }}
#SBATCH -e {{ job_dir_abs }}
{{ run_cmd }}
cluster_info.json
{
"name" : "soroban",
"worker_bin_path" : "/home/renee.arias/cryosparc/cryosparc_worker/bin/cryosparcw",
"cache_path" : "",
"send_cmd_tpl" : "ssh loginnode {{ command }}",
"qsub_cmd_tpl" : "sbatch {{ script_path_abs }}",
"qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
"qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
"qinfo_cmd_tpl" : "sinfo"
}
When I try to perform patch motion correction, I get a failed to launch message every time. No error or output files are generated.
Any ideas?