Hi all,
I’m trying to figure out the best way to easily share resources on a single machine with 4 GPUs without having most of the jobs interrupted with “out of memory” error.
Some users at our server use Relion for data processing (via anydesk), some use cryoSPARC (via web-interface). Recently we found that running cryoSPARC’s jobs on a default lane (all 4 GPUs) can sometimes interfiere with relion users, who also run jobs on the same GPUs. It usually results in cryoSPARC jobs being interrupted with cuda’s “out of memory” error.
For the time being, we’re using the following workaround: relion users use GPUs 2 and 3, and cryoSPARC users use GPUs 1 and 2. But this has a downside: when multiple users specify GPU with “Run on specific GPU” in cryoSPARC, multiple jobs can end up on a same GPU (which doesn’t happen when submiting jobs to a lane). And this results in the same problem – multiple processes from cryoSPARC end up interrupting each other, ultill one of them crashes with “out of memory” error.
What’s the best way to manage such situation? I didn’t want to cryoSPARC to GPUs 1 and 2 only, since we’re using it like 70% of the time, but don’t know how to set everything up properly.