Queueing jobs on specific GPUs

leetleyang · November 7, 2024, 1:02am

Last I looked into it, it was possible to spoof multiple occurrences of the same workstation/node by way of unique hostname aliases in sshd_config. Each can be assigned non-overlapping GPUs at time of connection to avoid the most obvious conflict, but there didn’t seem to be a way to effect similar CPU/RAM accounting—cryosparcw script auto-detected everything onboard without an obvious avenue for user-control. Thinking it through, however, if the ratio of resources happens to be appropriately provisioned for most uses cases, then outside of RBMC, this may not be a problem in practice?

Worth mentioning that I never got as far as testing cache-handling—cryoSPARC will treat both instances as unique resources, which could pose a problem under certain conditions.

All of this was experimental and unsanctioned, of course.

Cheers,
Yang

rbs_sci · November 7, 2024, 1:17am

Thanks, @leetleyang

I’d say “I’ll give it a go too and report back” but I’ve just done a big update run on our main processing servers and I’m not about to take one down again given the queue of things to run right now.

Cache collisions shouldn’t be a huge issue since each worker can be assigned a different directory (if using the same mount point) or even given an SSD each. Or just run with --nossd if running an all-SSD system…

Might be possible to do it just via /etc/hosts as well, rather than anything exotic with sshd*.

Hm. I have another system sitting on my desk which needs its big update as well, I could temporarily play musical chairs with another systems’ GPUs to experiment… OK, I think I’ll try that next chance I get.

* edit: Using dummy NICs if necessary should prevent any weirdness. OK, definitely going to try this.