Queue message : GPU not available

ketiwsim · March 16, 2021, 4:42pm

Hello,

I am new to the program and trying to run T20S tutorial. I have three machines with 1GPU each. I got to the motion correction and I define 3 GPU to use (1 on each machine). Unfortunately I got permanently queued jobs

---------- Scheduler running --------------- 
Jobs Queued:  [('P1', 'J13')]
Licenses currently active : 0
Now trying to schedule J13
  Need slots :  {'CPU': 18, 'GPU': 3, 'RAM': 6}
  Need fixed :  {'SSD': False}
  Master direct :  False
    Queue status : waiting_resources
    Queue message : GPU not available

Since all workers are the same I attached one from Resource Manager/Instance Information

Target 3: gr3 node

Cores 16
Memory 64 GB 
GPUs 1
Worker bin path /usr/local/cryoSPARC/cryosparc_worker/bin/cryosparcw
Hostname gri13
Name gr13
Cache path /data-em/cryos/temp
Cache quota (MB)
SSH String cryos@gr13
Cache Reserve (MB) 1 0000

Anyway I am confused of why this is failed?

spunjani · March 16, 2021, 6:25pm

Dear @ketiwsim,
Patch Motion Correction and several other jobs in cryoSPARC can be parallelized across GPUs within a single node/workstation but not across multiple nodes. If you clear and re-queue the job requesting 1 GPU, it should run. Thanks!

ketiwsim · March 17, 2021, 4:09pm

Dear Spunjani,

Thank you for your explanation.
Could you specify which programs will use multiple GPUs/CPUs of all workers?
2D classification also seems to not accept more than one GPU.
Best regards,
Ketiw

nfrasser · March 19, 2021, 7:43pm

Hi @ketiwsim, any cryoSPARC job that includes the “Number of GPUs to parallelize” parameter can use multiple GPUs on the same machine. This includes the following jobs:

All motion correction jobs
Patch CTF
2D Classification
Topaz Train and Cross Validation
Deep Picker Train and Inference

However, as @spunjani mentioned, jobs can only use multiple CPUs and GPUs installed on the same machine. A job cannot be distributed across multiple machines.

qitsweauca · June 4, 2021, 6:14am

Hi @frasser,

I also have a similar situation with the same response.
Does cryoSPARC support parsing only via the workload manager (sbatch for SLURM or qsub for PBS) to submit jobs through the login node of a HPC facility? The submission commands (qsub or sbatch) on the login node are the only access to its GPU nodes within the same HPC facility. The GPU nodes are not directly available to the login node… (meaning if I install the master-node application of CryoSPARC on the login node of a HPC facility, it is not able to communicate with the GPU nodes directly except I use the qsub and wrap around cryoSPARC and submit as a job…)

Does this mean I may need to use the interactive mode of qsub and then run cryosparcw on the login node?

Thank you!

nfrasser · June 4, 2021, 3:08pm

Hi @qitsweauca, cryoSPARC has extensive support for a cluster installation, whether it’s SLURM, PBS or something else. Once you’ve installed the master node, you can connect the a cluster from these instructions.

When setting up a cluster, you don’t need to know ahead of time what GPUs are available. When a job requires multiple GPUs, the number of required GPUs will be available in the cluster submission script template for use with the qsub command or otherwise. So you should not need qsub’s interactive mode once you’ve set this up.

Hope that answers your questions, let me know if there’s anything that I can further clarify or elaborate on.

qitsweauca · June 8, 2021, 4:48am

Hi @nfrasser,

Although it doesn’t need to know what GPUs are available, it seems that it would require direct access to the cluster nodes and install cryosparcW on the custer nodes? Am I correct?

My environment is a shared facility where it would only allow access to a login node and submit jobs via PBS though… quite restricted.

Are there any ways to package the master-worker application as a stand-alone application/shell binaries to be submitted for job compute?

I assume cryoSPARC is based on Flask framework; are the RESTful APIs available to be directly used somehow if any?

Thank you.

nfrasser · June 9, 2021, 8:30pm

The cryosparc_worker package does need to be installed on the cluster machines, but you can also do this on the shared file system that gets mounted on all machines (as per cryoSPARC’s requirements). The only constraint with this is that the worker needs to be installed on a machine with GPUs so that cryoSPARC’s compute kernels get compiled against the correct version of CUDA and Nvidia drivers.

Since you don’t have direct access to the cluster environment, I’d suggest writing an installation script based on our instructions. The script should select the shared bulk storage as the installation directory. e.g., your /home directory in the cluster or anywhere else results are saved. Then submit that script to the cluster. Would this solution work for you?

Unfortunately we don’t provide a way to package a cryoSPARC installation as standalone binaries, though we have seen other facilities accomplish this with Singularity or Docker.

Yes, cryoSPARC’s master package runs a Flask- and JSON-RPC- based API, but this cannot be used to do the actual execution of GPU-based jobs. Instead, the Flask server runs a scheduler which either submits the jobs to the cluster or (in a traditional master/worker setup) remotes into the worker machine to execute commands.

The worker does not provide a Flask server and functions as a client of the master server. What precisely did you have in mind in regards to using this system as a workaround?