Hi, is there a way to queue jobs on specific GPUs but not having them all run at the same time?
On our workstation we have different users, for both Relion and cryosparc, and depending on the workload we book the use of specific GPUs.
Unfortunately the “Queue directly to GPU” option overrides the scheduler, so it is not possible to queue multiple successive jobs (for example overnight).
Is there a way to specify the GPU to use and also schedule the jobs according to the resources available?
As a workaround (albeit an inelegant one), have you considered setting up separate worker lanes for each GPU?
You can get around the issue of hostname duplication with ssh_config host aliases. There are oddities in how cache-locks are (not) honoured under such conditions, e.g. when jobs sent to separate lanes happen to cache the same input data to ssdpath. However, if this is a common occurrence in your workflow, you can specify a unique ssdpath for each worker lane as well.
Alternatively, if you only ever have one cryoSPARC user sharing the workstation with RELION jobs, you can have a single worker lane for which the gpu configuration can be updated when required.
More than a year passed and I would like to ask you if there is any easy solution in place for standalone workstations as discussed above… I understand that for clusters lane configuration is working well.
I do not think so. The challenging part would be to configure the cluster manager and job template(s) to fit your needs. You could then use a commands like cryosparcm cluster connect of your existing CryoSPARC installation to update the lane and target configurations that are stored in the CryoSPARC database.
I would also appreciate a fix for this - it is strange that by default “run on specific GPU” overrides the scheduler.
This might be sometimes useful - e.g. using a single GPU with lots of VRAM to run multiple jobs on the same GPU - but more often, I would like to just avoid running on a particular GPU on my workstation because I am using it for something else - alphafold calculations, relion, whatever. I know I could use a cluster resource manager, but this is a lot of overhead for occasional use cases on a standalone workstation.
Would it be possible to update the default behavior to “queue on a specific GPU”, rather than “run on a specific GPU”? Perhaps with an “override scheduler” checkbox for those who want it?
As it is accessed from the “Queue Job” menu this would also make more sense to new users, I think.
@olibclarke@rbs_sci We have noted the feature request. At the moment, even on a single (multi-GPU) workstation, we are not aware of a method, other than an external workload manager with proper resource isolation, that would reliably queue a mix of CryoSPARC and non-CryoSPARC workloads.
Thanks @wtempel - but I think the request is rather simpler - not asking for a smart, comprehensive queuing system.
It is just to be able to queue to a specific GPU (as opposed to running the job regardless of whether other CS jobs are already running on the same GPU).
This would allow for avoiding a specific GPU (if I know I am running something on there outside CS), as well as targeting a specific GPU (so e.g. if I have a mix of GPUs I can submit a big box refinement to the GPU with the most VRAM).
I’m not asking (and I don’t think Oli is either) for a system-wide scheduler - but the CryoSPARC scheduler having the logic not to run a job on a GPU which is already running a GPU job (assigned manually to that GPU by the user, from CryoSPARC).
I see I am not the only one a bit upset with this old default behavior. Glad to see this topic revived!
Thanks @olibclarke and @rbs_sci for underlying so well the need and thank you tk @wtempel for following up on the topic. I hope that what it seems to as a simple need/request to the default behavior is not too difficult to alter.
This may be possible with the caveat that the solution proposed below will not be aware of
non-CryoSPARC workloads
CryoSPARC workloads from another CryoSPARC instance, if the computer serves as a worker for multiple CryoSPARC instances.
Suppose the hypothetical scenario where a two-gpu worker was originally connected and linked to the default scheduler lane with the command, run on relevant worker node, say worker1.mydomain.com, by cryosparcuser:
master.mydomain.com corresponded to the value of the CRYOSPARC_MASTER_HOSTNAME variable defined inside cryosparc_master/config.sh
61000 corresponded to the value of the CRYOSPARC_BASE_PORT variable inside cryosparc_master/config.sh
/disks/scratch1 is the dedicated CryoSPARC scratch device on the worker node
This would have created a scheduler target with "hostname": "worker1.mydomain.com" and "lane": "default".
Then one could instead run these commands on worker1.mydomain.com:
One should also remove any scheduler lane that may have ended up empty after removal of a scheduler node, using the remove_scheduler_lane()cli function.
When connecting gpu-specific targets:
refer to the output of the command cryosparcw gpulist for the appropriate gpu indices (notnvidia-smi)
Right - thank you for the workaround - but to be clear the request is specific to how the GUI works.
“Queue on specific GPU” implies queuing, which is not currently what happens - the job is just submitted, regardless of what other cryoSPARC jobs are already running.
The normal cryosparc scheduler is aware of other jobs and does do this, so it is not completely clear to me why the “queue on specific GPU” scheduler does not have this capacity.
Just to confirm, the current intended behaviour of the “Run on specific GPU” tab in the Queue dialog is to run the job immediately on the specific GPU (i.e. to override any checks by the scheduler other than that inputs are ready).
The reason that this is the current behaviour is not simply because it is set as the UI’s default; currently, the CryoSPARC scheduler internally does not have the ability to “Queue on a specific GPU”. The scheduler’s internal logic is based on lanes, and we have not yet implemented a mode for it that can schedule at a finer granularity (e.g. a single GPU).
This is similar to the fact that if you had a lane with multiple nodes, there is no support in the scheduler currently for queueing to just one of the nodes - a job has to be queued to the lane, or else skip resource checks in the scheduler altogether.
This is why we can’t yet change the UI behaviour. We’re definitely aware of and tracking this feature request though, and having the feedback from you all is very helpful!
Is there a way of spoofing multiple lanes on a single node?
So, for example, master/worker1/worker2 on a single system, where (pseudo-)worker1 could be assigned, say, 50% of the cores, 50% of the RAM and all 24GB GPUs (e.g. 8 total), while (pseudo-)worker2 could be assigned 50% of the cores, 50% of the RAM and all 48GB GPUs (e.g. 2 total). Because of the autodetection during install, I think that might be trickier than in writing, though?