Motion correction partial error

Hi,

I got motion correction error, not in all micrographs but on 30% of the dataset.

Here’s the error

[CPU: 320.4 MB Avail: 177.55 GB]
Error occurred while processing J2590/imported/004347276147038339169_OhyA-1.5-substrate-stearicacid-FAD_90-85_0009.tif
Traceback (most recent call last):
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 855, in _attempt_allocation
return allocator()
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 1058, in allocator
return driver.cuMemAlloc(size)
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 352, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 412, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/cryosparc_compute/jobs/pipeline.py”, line 61, in exec
return self.process(item)
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 192, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 195, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 224, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 201, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py”, line 292, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py”, line 397, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 390, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/cryosparc_compute/gpu/gpuarray.py”, line 270, in empty
return device_array(shape, dtype, stream=stream)
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/cryosparc_compute/gpu/gpuarray.py”, line 226, in device_array
arr = GPUArray(shape=shape, strides=strides, dtype=dtype, stream=stream)
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/cryosparc_compute/gpu/gpuarray.py”, line 21, in init
super().init(shape, strides, dtype, stream, gpu_data)
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devicearray.py”, line 103, in init
gpu_data = devices.get_context().memalloc(self.alloc_size)
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 1376, in memalloc
return self.memory_manager.memalloc(bytesize)
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 1060, in memalloc
ptr = self._attempt_allocation(allocator)
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 867, in _attempt_allocation
return allocator()
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 1058, in allocator
return driver.cuMemAlloc(size)
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 352, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/cm/shared/apps/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 412, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

Marking J2590/imported/004347276147038339169_OhyA-1.5-substrate-stearicacid-FAD_90-85_0009.tif as incomplete and continuing…

Any help will be appreciated.
Thanks

Kevin

On what type of GPU(s) did the job run – what is the output of
nvidia-smi on the worker?

The output of nvidia-smi looks like this :

[kevin@headnode ~]$ nvidia-smi
Mon Feb 26 18:48:21 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti      On | 00000000:04:00.0 Off |                  N/A |
| 27%   27C    P8               21W / 250W|      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |

What is unusual for me, is that when I split my 4100 movies into 5 subsets, motion correction job tends to give less error.

~ 1200 incomplete movies with one unique motion correction job
~ 20 incomplete movies when I use 5 subsets of ~ 800 movies

It’s not necessarily the same movies which are impacted every time but it looks like a random thing.
I have enough disk space as well

Is this a node with a single physical GPU?
Are other GPU tasks (Linux GUI, other CryoSPARC or non-CryoSPARC GPU calculations) running on the node at the same time?
Does defining

export NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT=0

inside cryosparc_worker/config.sh
(see Patch Motion Correction - RuntimeError: Could not allocate GPU array: CUDA_ERROR_OUT_OF_MEMORY - #32 by nfrasser) mitigate the problem?

It’s a node with 8 gpu 2080 ti and 55 cpu.

No other tasks are running on the node.

I’ll definitely try this and get back to you. Thanks