CryoSPARC Live Persistent Issue

Hello,

We are facing multiple issues when running CryoSPARC Live on a Slurm cluster. The errors come up when we omit the Slurm --exclusive flag, and it also happens on a subset of the imported micrographs not all of them, the errors also varies on different GPUs within a node. Any way we can go by fixing this?

Issue #1:
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py”, line 371, in cryosparc_compute.jobs.rtp_workers.run.rtp_worker
File “cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py”, line 434, in cryosparc_compute.jobs.rtp_workers.run.process_movie
File “cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py”, line 580, in cryosparc_compute.jobs.rtp_workers.run.do_patch_motion
File “cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py”, line 585, in cryosparc_compute.jobs.rtp_workers.run.do_patch_motion
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py”, line 255, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py”, line 352, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 355, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
File “/admin/software/cryosparc/hite/CS-4.1.2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/pycuda/gpuarray.py”, line 210, in init
self.gpudata = self.allocator(self.size * self.dtype.itemsize)
pycuda._driver.Error: cuMemAlloc failed: unknown error

Issue #2:
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py”, line 371, in cryosparc_compute.jobs.rtp_workers.run.rtp_worker
File “cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py”, line 434, in cryosparc_compute.jobs.rtp_workers.run.process_movie
File “cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py”, line 549, in cryosparc_compute.jobs.rtp_workers.run.do_patch_motion
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 34, in cryosparc_compute.engine.cuda_core.initialize
pycuda._driver.Error: cuDevicePrimaryCtxRetain failed: unknown error

Is there any slurm configuration in place that would, independently from CryoSPARC, ensure that jobs that share a node cannot encroach on another job’s reserved GPU device(s)?