Install on shared GPU cluster (h4h, Toronto, Canada)

Geoffrey · February 12, 2020, 7:27pm

I’m working with Zhibin from the Bioinformatics and HPC Core at UHN (Toronto, Canada) to see if we can install cryosparc on h4h

info on UHN’s h4h slurm cluster: https://1drv.ms/w/s!AinqViGDqMQDiFsDGm9YthNkfpvD
h4h has one GPU node with 3 GPUs for general use
can not set up http
no ssh to cluster compute nodes
can not have a web server running all the time on the cluster node because the user will not be able to access it
software now allowed to schedule jobs unless it can go through the cluster’s scheduler.
cluster IT staff can not allow a webapp/server running all the time on the cluster and they will not open the port to the internet. Cluster uses have to submit jobs to the cluster themselves.

stephan · February 13, 2020, 4:59pm

Hey @Geoffrey,

Submitting to a SLURM cluster is fully supported. The only understandable confusion here is where to run the master node, since you aren’t able to run processes on the login node. Technically, you should be able to run the master node on your own laptop (and submit jobs to the cluster), but the main requirement is that the storage layer is shared across the master and worker nodes. This can be done by potentially mounting the group directory (/cluster/projects) onto your laptop, but I doubt the cluster admins would allow that since it’s not even mounted on the login node. Another option would be to set up some sort of “Edge Node” that is located in the same network as the cluster- this would just be another VM that can submit to the cluster, but again shares the storage layer. Either way, it might be worth talking to the cluster admins in this case.

Geoffrey · February 20, 2020, 4:18pm

We got a cryosparc instance working. It can be accessed remotely once UHN’s VPN is enabled.

I’m doing some tests now. Right now I can’t use the SSD… if our cluster has one that is. Forgive my ignorance, but do clusters usually have SSDs?

Geoffrey · February 20, 2020, 4:42pm

If the user submits a job without selecting the h4h lane (see screenshot below) then the job appears as queued in the workspace, but does not appear in the resource manager. The job output is completely blank. It’s actually an easy mistake to make.

No lane selected. This job will stay as queued, but never run, and not appear as queued in the resource manager.

Versus selecting the Lane

stephan · February 20, 2020, 5:13pm

Hey @Geoffrey,

Thanks for reporting the Queue modal quirk- we found the bug and we’re working on a fix!
Also, yes, sometimes cluster worker nodes have local SSD’s mounted on them. You should check with the cluster admin to see if they have one available. You can dynamically create the SSD path in your cluster_script.sh by adding export CRYOSPARC_SSD_PATH="<function that creates path to ssd>".

Geoffrey · February 20, 2020, 5:35pm

It turns out we don’t have an SSD on the cluster. Maybe we can get one…

Geoffrey · February 20, 2020, 6:10pm

After some initial discussions it seems there is some reluctance to add an SSD. I’m not sure if its a policy issue (money, maintenance, priority) or feasibility issue.

Would clusters typically be able to have an SSD added? Is it difficult to do?

Here’s an info sheet on the cluster, https://1drv.ms/w/s!AinqViGDqMQDiFsDGm9YthNkfpvD