A correct way to share resources between relion and cryosparc on a single machine

marinegor · July 10, 2022, 6:46pm

Hi all,

I’m trying to figure out the best way to easily share resources on a single machine with 4 GPUs without having most of the jobs interrupted with “out of memory” error.

Some users at our server use Relion for data processing (via anydesk), some use cryoSPARC (via web-interface). Recently we found that running cryoSPARC’s jobs on a default lane (all 4 GPUs) can sometimes interfiere with relion users, who also run jobs on the same GPUs. It usually results in cryoSPARC jobs being interrupted with cuda’s “out of memory” error.

For the time being, we’re using the following workaround: relion users use GPUs 2 and 3, and cryoSPARC users use GPUs 1 and 2. But this has a downside: when multiple users specify GPU with “Run on specific GPU” in cryoSPARC, multiple jobs can end up on a same GPU (which doesn’t happen when submiting jobs to a lane). And this results in the same problem – multiple processes from cryoSPARC end up interrupting each other, ultill one of them crashes with “out of memory” error.

What’s the best way to manage such situation? I didn’t want to cryoSPARC to GPUs 1 and 2 only, since we’re using it like 70% of the time, but don’t know how to set everything up properly.

marino-j · July 11, 2022, 12:42pm

not sure on a single machine it makes sense, but a slurm system would solve the problem.

marinegor · July 11, 2022, 12:47pm

thanks, it would, but seems like a huge overhead in this situation.
I am aware of cryoSPARC lanes, but not sure it it’s possible to e.g. create two lanes that share GPUs (e.g. lane_without_relion with GPUs 0 and 1, and lane_with_all_GPUs with GPUs 0,1,2,3)?

wtempel · July 13, 2022, 4:30pm

I think a cluster workload manager can make sense even on a single host, particularly when there are 4 heavily used GPUs and if all workloads are submitted to that manager.