Patch Motion Correction - RuntimeError: Could not allocate GPU array: CUDA_ERROR_OUT_OF_MEMORY

eugene_pichkur · May 14, 2024, 8:12pm

This problem still persists in v4.5.

Movies: superres K3, 8184x11520 px, 80 frames.
System:
Linux GPU-4X-2080Ti 5.15.0-71-generic #78-Ubuntu SMP Tue Apr 18 09:00:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
NVIDIA-SMI 530.30.02
Driver Version: 530.30.02
CUDA Version: 12.1
GPUs: 4xRTX 2080Ti 11GB, RAM: 256GB

Using F-crop=1/2,1/8,1/16 and or different number of knots doesn’t help.
Adding NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT=0 also doesn’t change anything.
The same task used to run fine in v4.3.

Here is the full ouput:

Traceback (most recent call last):
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 851, in _attempt_allocation
return allocator()
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1054, in allocator
return driver.cuMemAlloc(size)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 348, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/eugene/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py”, line 59, in exec
return self.process(item)
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 210, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 213, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 242, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 219, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py”, line 292, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py”, line 628, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 390, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
File “/home/eugene/cryosparc/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py”, line 270, in empty
return device_array(shape, dtype, stream=stream)
File “/home/eugene/cryosparc/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py”, line 226, in device_array
arr = GPUArray(shape=shape, strides=strides, dtype=dtype, stream=stream)
File “/home/eugene/cryosparc/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py”, line 21, in init
super().init(shape, strides, dtype, stream, gpu_data)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devicearray.py”, line 103, in init
gpu_data = devices.get_context().memalloc(self.alloc_size)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1372, in memalloc
return self.memory_manager.memalloc(bytesize)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1056, in memalloc
ptr = self._attempt_allocation(allocator)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 863, in _attempt_allocation
return allocator()
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1054, in allocator
return driver.cuMemAlloc(size)
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 348, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/home/eugene/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

wtempel · May 21, 2024, 9:09pm

Was the job queued to

submitted to an external workload manager (like slurm)?
or submitted to the CryoSPARC-builtin cluster manager
launched directly on GPU(s)?

Do non-CryoSPARC applications or jobs from another CryoSPARC instance also use the GPUs on this host?

Is any of these failed attempts an exact clone

same worker
same data
same parameters

of a successful CryoSPARC v4.3 job?

If not, please can you confirm that a job with the same worker, data and parameters does not fail after downgrading (preprequisites, downgrade instructions) your instance to v4.3.1

cryosparcm update --version=v4.3.1

Until the issue is resolved, you may want to preserve the failed jobs for comparison (that is, neither delete nor re-run them).

dzyla · August 9, 2024, 5:21am

We are experiencing this error when running Patch Motion Correction on a multi-GPU job. With CryoSPARC Live (v4.5.3), everything functions correctly, even with multiple preprocessing workers. However, once the job is started in the workspace, all multi-GPU jobs result in a CUDA_ERROR_OUT_OF_MEMORY. Currently, I am running a single-GPU job, and this error has not occurred. Additionally, the low-memory option does not seem to have any effect. It would be helpful if this issue could be resolved. Before upgrading to CryoSPARC 4.5, everything worked as expected.

wtempel · August 9, 2024, 7:44pm

@dzyla What is the output of the command

nvidia-smi --query-gpu=index,name --format=csv

on the affected worker(s)?

dzyla · August 9, 2024, 9:43pm

The result is:

workstation 1:

index, name
0, NVIDIA GeForce RTX 3070
1, NVIDIA GeForce RTX 3070
2, NVIDIA GeForce RTX 3070
3, NVIDIA GeForce RTX 3070

workstation 2:

index, name
0, NVIDIA GeForce RTX 2080 Ti
1, NVIDIA GeForce RTX 2080 Ti
2, NVIDIA GeForce RTX 2080 Ti
3, NVIDIA GeForce RTX 2080 Ti

Both worked well previously, and we have never had issues with this error.

wtempel · August 14, 2024, 6:15pm

@dzyla We expect a modest increase in VRAM usage after an upgrade to CryoSPARC v4.4+. With GPUs with VRAM sizes below or barely at the (by now fairly dated) minimum recommendation of 11 GB, certain job types may may fail due to insufficient VRAM. We are considering an increase in the minimum VRAM recommendation for recent versions of CryoSPARC.

dzyla · August 17, 2024, 4:25am

Was the change in version 4.4 and later so significant that the motion correction feature, which previously worked perfectly, is now showing errors? Is the live version still using the old algorithm? I have not encountered any issues with GPU memory in the live version. I would appreciate the addition of a legacy Patch Motion Correction to avoid the need for hardware upgrades.