Manually assigning GPU causes immediate job start even when GPU already in use by CryoSPARC

rbs_sci · March 14, 2024, 2:43am

Hi CryoSPARC team,

I wanted to run a local resolution estimation job on a specific GPU as it has more VRAM than others, but it was already busy with another job. The box size is quite large and I was worried Fourier padding might exceed the VRAM of the other GPUs. Instead of waiting for the running job to complete as I expected, the local resolution job immediately started running simultaneously!

Funnily, this hasn’t resulted in either job crashing (although they are both running more slowly than anticipated)… perhaps because it’s a 48GB GPU and neither job is using enough VRAM to cause the other to have insufficient?

Is this behaviour expected (immediate start on manual assignment)?

Thanks,
R

wtempel · March 14, 2024, 3:11pm

Yes, this is expected.

rbs_sci · March 15, 2024, 12:12am

I see, thanks.

This post must be at least 20 characters.

niklas · April 17, 2024, 10:08am

I just ran into the same thing on our instance. Unfortunately we had jobs crashing because they ran out of memory.

Is there a way to disable this behavior?

To me it seems counter intuitive to skip the queue when selecting a specific GPU.

wtempel · April 17, 2024, 8:18pm

We took note of your concern. Unfortunately, it is not currently possible to block Run on Specific GPU on instances that have node-type, in contrast to cluster-type, scheduler lanes with GPUs. You may want to ask users of your CryoSPARC instance to not Run on Specific GPU unless overriding the scheduler is desired and appropriate.

niklas · April 18, 2024, 1:20pm

Thanks, I guess I have to change the queues then to offer an alternative way of selecting a specific computing node.

Out of curiosity, why was the GPU selection implemented to skip the resource availability check? I can see a use for both features, but I wouldn’t expect them to be in the same option.

wtempel · April 19, 2024, 10:06pm

… can be implemented by creating scheduler lanes that each have a single, specific target node. Selection of a specific node is not the intended use case of the Run on Specific GPU option.

niklas · April 22, 2024, 9:33am

Yes, I know. That is how I configured it now.

Still, I think an option to skip the queue should be reserved for Admin users or users that are also allowed to set job priorities. Also since it is not really intuitive that selecting a specific GPU skips the resource checks, it is easy to abuse this unintentionally.
That’s why I am asking for the reason behind this implementation.

olibclarke · April 22, 2024, 2:01pm

Agreed with this. At least a “warning, are you sure you know what you are doing” type dialog might be useful here.