Patch Motion Correction - RuntimeError: Could not allocate GPU array: CUDA_ERROR_OUT_OF_MEMORY

This problem still persists in v4.5.

Movies: superres K3, 8184x11520 px, 80 frames.
System:
Linux GPU-4X-2080Ti 5.15.0-71-generic #78-Ubuntu SMP Tue Apr 18 09:00:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
NVIDIA-SMI 530.30.02
Driver Version: 530.30.02
CUDA Version: 12.1
GPUs: 4xRTX 2080Ti 11GB, RAM: 256GB

Using F-crop=1/2,1/8,1/16 and or different number of knots doesn’t help.
Adding NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT=0 also doesn’t change anything.
The same task used to run fine in v4.3.

Here is the full ouput:

Traceback (most recent call last):
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 851, in _attempt_allocation
return allocator()
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1054, in allocator
return driver.cuMemAlloc(size)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 348, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/eugene/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py”, line 59, in exec
return self.process(item)
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 210, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 213, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 242, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 219, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py”, line 292, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py”, line 628, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 390, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
File “/home/eugene/cryosparc/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py”, line 270, in empty
return device_array(shape, dtype, stream=stream)
File “/home/eugene/cryosparc/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py”, line 226, in device_array
arr = GPUArray(shape=shape, strides=strides, dtype=dtype, stream=stream)
File “/home/eugene/cryosparc/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py”, line 21, in init
super().init(shape, strides, dtype, stream, gpu_data)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devicearray.py”, line 103, in init
gpu_data = devices.get_context().memalloc(self.alloc_size)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1372, in memalloc
return self.memory_manager.memalloc(bytesize)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1056, in memalloc
ptr = self._attempt_allocation(allocator)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 863, in _attempt_allocation
return allocator()
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1054, in allocator
return driver.cuMemAlloc(size)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 348, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

Was the job queued to

  1. submitted to an external workload manager (like slurm)?
  2. or submitted to the CryoSPARC-builtin cluster manager
  3. launched directly on GPU(s)?

Do non-CryoSPARC applications or jobs from another CryoSPARC instance also use the GPUs on this host?

Is any of these failed attempts an exact clone

  • same worker
  • same data
  • same parameters

of a successful CryoSPARC v4.3 job?

If not, please can you confirm that a job with the same worker, data and parameters does not fail after downgrading (preprequisites, downgrade instructions) your instance to v4.3.1

cryosparcm update --version=v4.3.1

Until the issue is resolved, you may want to preserve the failed jobs for comparison (that is, neither delete nor re-run them).