Patch Motion Correction - RuntimeError: Could not allocate GPU array: CUDA_ERROR_OUT_OF_MEMORY

olibclarke · December 24, 2023, 6:59pm

Hi @hsnyder & @mmclean, We are having the same issue on on of our systems - on a 2080Ti, where we previously had no issues, Patch Motion always fails on super-res K3 data, even with Low Memory mode set. Nvidia driver version is 525.60.13 according to nvidia-smi.

This is a new job, not cloned, CS v4.4.0. Here is the error message:

Error occurred while processing J509/imported/003325998359943907799_23dec20b_2_00004gr_00035sq_v03_00002hln_00003enn.frames.tif
Traceback (most recent call last):
  File "/home/user/software/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/pipeline.py", line 61, in exec
    return self.process(item)
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 192, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 195, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 224, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 201, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 292, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 710, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 188, in cryosparc_master.cryosparc_compute.gpu.gpucore.transfer_ndarray_to_cudaarray
  File "/home/user/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context
    return fn(*args, **kws)
  File "/home/user/software/cryosparc/cryosparc2_worker/cryosparc_compute/gpu/driver.py", line 151, in create_array
    handle = allocator()
  File "/home/user/software/cryosparc/cryosparc2_worker/cryosparc_compute/gpu/driver.py", line 137, in <lambda>
    allocator = lambda: cuda_check_error(cuda.cuArrayCreate(desc), "Could not allocate GPU array")
  File "/home/user/software/cryosparc/cryosparc2_worker/cryosparc_compute/gpu/driver.py", line 265, in cuda_check_error
    raise RuntimeError(f"{msg}: {err.name}")
RuntimeError: Could not allocate GPU array: CUDA_ERROR_OUT_OF_MEMORY

Marking J509/imported/003325998359943907799_23dec20b_2_00004gr_00035sq_v03_00002hln_00003enn.frames.tif as incomplete and continuing...

hsnyder · January 3, 2024, 4:10pm

Hi @olibclarke, we’re aware of this issue and are looking into it. Sorry for the inconvenience in the meantime.

olibclarke · January 3, 2024, 4:32pm

Thanks Harris - is there a patch for the meantime to revert Patch Motion to the previous version? Or any suggestion for reducing memory requirements beyond just using the low memory mode?

hsnyder · January 3, 2024, 4:33pm

Unfortunately not… The change that caused this had nothing to do with patch motion specifically, it was the changes involved in shipping our own cuda version. The workaround would be to downgrade CryoSPARC versions.

AlexHouser · January 3, 2024, 7:59pm

Just to add, while for others the issue resolved when using Full Frame Motion Correction, we are still getting a similar error with Full Frame.

The error is

numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

Thank you for looking into this! In the meantime we will downgrade as suggested.

wtempel · January 3, 2024, 8:24pm

Welcome to the forum @AlexHouser.
On which GPU model did you observe this error?
nvidia-smi --query-gpu=name --format=csv

AlexHouser · January 3, 2024, 8:52pm

NVIDIA GeForce RTX 2080

AlexHouser · January 3, 2024, 9:25pm

After talking to our neighboring lab, they have been having the same issue but have been working around it by Fourier cropping to 1/2. This worked for us!

olibclarke · January 3, 2024, 9:38pm

Overriding the number of knots (to X=6, Y=4) seemed to do the trick for us using super-res K3 data on a 2080Ti (using F-crop=1/2 also).

Just using F-crop=1/2 with patch motion didn’t do the trick, but reducing the number of knots as well worked

AlexHouser · January 3, 2024, 9:44pm

Our neighboring lab also recommended overriding the knots (Z=5, Y=5, X=7) as well as cropping to 1/2, but we found with our data only the cropping was necessary. Both our lab and their lab are using super-res K3 data.

olibclarke · January 4, 2024, 1:23am

Maybe has to do with the number of movie frames? For us (50 frame super-res K3 movies on 2080Ti cards) it only works with X=6, Y=4, low memory mode, F-crop=1/2. Any more knots, or switching off low memory mode, or altered, F-crop, and it crashes. Glad to have a workaround!

EDIT: I spoke too soon - it ran ok for 15 mics and then started failing again Back to tweaking params

nfrasser · January 10, 2024, 11:04pm

Hi @olibclarke, I have a potential workaround that may address this. For background, v4.4 includes a new GPU memory management system (using the numba Python library) that does not immediately free memory when it’s no longer required. Instead, it frees in batches or when memory is low.

Your Patch Motion job appears to fail during a special allocation step that is unaware of this memory management system. So there may be some GPU memory that could be freed to make this work.

We’ll should have a fix for this in a future version of CryoSPARC, but in the mean time you could try disabling batched-memory deallocation by adding the following line to cryosparc_worker/config.sh:

export NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT=0

Let me know if you get a chance to try this and it works for you.

@AlexHouser I am not sure whether the same fix will apply to you; based on the text, the error appears to be coming from a different place that is correctly managed. We are still investigating other memory usage changes in v4.4.

olibclarke · January 11, 2024, 12:12am

Great I will give this a go, thanks!!

Suyog · February 27, 2024, 10:36am

I am using CS4.4.1. I tried adding the above mentioned line in the cryosparc_worker/config.sh. But the problem did not solved, showing up same as mentioned by @olibclarke.

wtempel · February 27, 2024, 4:12pm

Welcome to the forum @Suyog.

What is the output of the command
nvidia-smi
on the CryoSPARC worker?

Suyog · February 28, 2024, 10:00am

Thank you for your reply. After reading this thread, I realized that, I am using GPU with 8 GB VRAM. I used the same patch correction job by using F-crop = 1/4, and it worked for me. In this also, I noticed that both of my GPUs were full in use. After reading some suggestion above, that patch motion correction is working fine in CS v4.3.1 or below. Is it better for me to rollback to CS v4.3.1 for my set of hardware configurations or else I have to use the F-crop =1/4? I have attached the nvidia-smi output below:

wtempel · February 28, 2024, 10:58pm

These are tough choices. For a potential alternative, have you already tried

Suyog · February 29, 2024, 3:49am

Yes, I have tried that. Here is the screenshot of the file:

AlexHouser · March 6, 2024, 6:44pm

Adding that line to cryosparc_work/config.sh did not fix it for us either. Interestingly we are no longer able to fix it with Fourier cropping, low memory mode, and overriding knots during Patch Motion correction for our latest data set collected. We ended up having to roll back to v4.3.1.

Suyog · March 28, 2024, 11:54am

Thank you for the suggestion. We also downgraded to v4.3.1, and Patch motion correction is working fine. Thank you @AlexHouser @nfrasser @wtempel.