Queue message : GPU not available

Hello,

I am new to the program and trying to run T20S tutorial. I have three machines with 1GPU each. I got to the motion correction and I define 3 GPU to use (1 on each machine). Unfortunately I got permanently queued jobs

---------- Scheduler running --------------- 
Jobs Queued:  [('P1', 'J13')]
Licenses currently active : 0
Now trying to schedule J13
  Need slots :  {'CPU': 18, 'GPU': 3, 'RAM': 6}
  Need fixed :  {'SSD': False}
  Master direct :  False
    Queue status : waiting_resources
    Queue message : GPU not available

Since all workers are the same I attached one from Resource Manager/Instance Information

Target 3: gr3 node

Cores 16
Memory 64 GB 
GPUs 1
Worker bin path /usr/local/cryoSPARC/cryosparc_worker/bin/cryosparcw
Hostname gri13
Name gr13
Cache path /data-em/cryos/temp
Cache quota (MB)
SSH String cryos@gr13
Cache Reserve (MB) 1 0000

Anyway I am confused of why this is failed?

Dear @ketiwsim,
Patch Motion Correction and several other jobs in cryoSPARC can be parallelized across GPUs within a single node/workstation but not across multiple nodes. If you clear and re-queue the job requesting 1 GPU, it should run. Thanks!

Dear Spunjani,

Thank you for your explanation.
Could you specify which programs will use multiple GPUs/CPUs of all workers?
2D classification also seems to not accept more than one GPU.
Best regards,
Ketiw

Hi @ketiwsim, any cryoSPARC job that includes the “Number of GPUs to parallelize” parameter can use multiple GPUs on the same machine. This includes the following jobs:

  • All motion correction jobs
  • Patch CTF
  • 2D Classification
  • Topaz Train and Cross Validation
  • Deep Picker Train and Inference

However, as @spunjani mentioned, jobs can only use multiple CPUs and GPUs installed on the same machine. A job cannot be distributed across multiple machines.

Hi @frasser,

I also have a similar situation with the same response.
Does cryoSPARC support parsing only via the workload manager (sbatch for SLURM or qsub for PBS) to submit jobs through the login node of a HPC facility? The submission commands (qsub or sbatch) on the login node are the only access to its GPU nodes within the same HPC facility. The GPU nodes are not directly available to the login node… (meaning if I install the master-node application of CryoSPARC on the login node of a HPC facility, it is not able to communicate with the GPU nodes directly except I use the qsub and wrap around cryoSPARC and submit as a job…)

Does this mean I may need to use the interactive mode of qsub and then run cryosparcw on the login node?

Thank you!

Hi @qitsweauca, cryoSPARC has extensive support for a cluster installation, whether it’s SLURM, PBS or something else. Once you’ve installed the master node, you can connect the a cluster from these instructions.

When setting up a cluster, you don’t need to know ahead of time what GPUs are available. When a job requires multiple GPUs, the number of required GPUs will be available in the cluster submission script template for use with the qsub command or otherwise. So you should not need qsub’s interactive mode once you’ve set this up.

Hope that answers your questions, let me know if there’s anything that I can further clarify or elaborate on.

Hi @nfrasser,

Although it doesn’t need to know what GPUs are available, it seems that it would require direct access to the cluster nodes and install cryosparcW on the custer nodes? Am I correct?

My environment is a shared facility where it would only allow access to a login node and submit jobs via PBS though… quite restricted.

Are there any ways to package the master-worker application as a stand-alone application/shell binaries to be submitted for job compute?

I assume cryoSPARC is based on Flask framework; are the RESTful APIs available to be directly used somehow if any?

Thank you.

The cryosparc_worker package does need to be installed on the cluster machines, but you can also do this on the shared file system that gets mounted on all machines (as per cryoSPARC’s requirements). The only constraint with this is that the worker needs to be installed on a machine with GPUs so that cryoSPARC’s compute kernels get compiled against the correct version of CUDA and Nvidia drivers.

Since you don’t have direct access to the cluster environment, I’d suggest writing an installation script based on our instructions. The script should select the shared bulk storage as the installation directory. e.g., your /home directory in the cluster or anywhere else results are saved. Then submit that script to the cluster. Would this solution work for you?

Unfortunately we don’t provide a way to package a cryoSPARC installation as standalone binaries, though we have seen other facilities accomplish this with Singularity or Docker.

Yes, cryoSPARC’s master package runs a Flask- and JSON-RPC- based API, but this cannot be used to do the actual execution of GPU-based jobs. Instead, the Flask server runs a scheduler which either submits the jobs to the cluster or (in a traditional master/worker setup) remotes into the worker machine to execute commands.

The worker does not provide a Flask server and functions as a client of the master server. What precisely did you have in mind in regards to using this system as a workaround?

But then, if instead I install the ‘worker’ application on the login node (the node that does have cuda package to do software compiling and can do qsub scripts to its GPU cluster, although it doesn’t actually have GPU devices on these login nodes), this should also work by installing the ‘master’ application on another remote cloud instance. and then specify ssh directly to the login node, allowing the PBS job scripts being directed to the internal GPU clusters.

Sorry, when I mentioned that GPU nodes (clusters) are not available directly to the login node, it just means that I don’t even know the computer name of the GPU clusters and it does sit on the same local private network, and can only be used by submitting jobs through PBS scripts.

Now, my master application is functioning on a cloud instance, and my cluster can be register successfully on the instance, given that the ssh information of the login node and the worker application is installed on the login node too. But I encounter an issue where i do motion correction it does not find script file. The ssh command i have is: ssh username@###.###.###.### qsub /path/to/cryosparc_projects/P1/J1/queue_sub_script.sh

In the job log, it is: Failed to launch! 1

Does the cloud instance where the master application sits still need to be a specific hostname? Or can it be an ip address? For my login node, it does have a specific hostname (actual domain name). Would this be the reason for not finding the file? (I have set non password ssh at the master application machine.)

About the API, was just simply because I thought if the set of the two-piece web application doesn’t work, is it possible to make it standalone with those APIs somehow working independently to be wrapped into PBS scripts for job submission. This is kinda like going backward…from the internet era.

Thanks,