NU-refinement with 648 px box size - RTX2080Ti, Quadro M6000 24GB, A100 80GB

matt1 · December 17, 2023, 2:27am

We have an issue with CS 4.4.1, which we did not have with previous versions. We are running NU-refinement with a box size of 648, which should be handled by the 2080Ti with 11GB VRAM. Here are the settings we have enabled: minimize over per-particle scale, optimize per-particle defocus, and optimize per-group CTF params.

On a server with 768GB RAM, 4x 2080Ti, we get the following error:
Traceback (most recent call last):
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 855, in _attempt_allocation
return allocator()
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 1058, in allocator
return driver.cuMemAlloc(size)
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 352, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 412, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2192, in run_with_except_hook
run_old(*args, **kw)
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/threading.py”, line 870, in run
self._target(*self._args, **self._kwargs)
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 2702, in cryosparc_master.cryosparc_compute.engine.newengine.process.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 2868, in cryosparc_master.cryosparc_compute.engine.newengine.process.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 1148, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.project_model
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 390, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py”, line 270, in empty
return device_array(shape, dtype, stream=stream)
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py”, line 226, in device_array
arr = GPUArray(shape=shape, strides=strides, dtype=dtype, stream=stream)
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py”, line 21, in init
super().init(shape, strides, dtype, stream, gpu_data)
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devicearray.py”, line 103, in init
gpu_data = devices.get_context().memalloc(self.alloc_size)
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 1376, in memalloc
return self.memory_manager.memalloc(bytesize)
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 1060, in memalloc
ptr = self._attempt_allocation(allocator)
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 867, in _attempt_allocation
return allocator()
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 1058, in allocator
return driver.cuMemAlloc(size)
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 352, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 412, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

Trying to run the same job with the low-memory mode enabled and it crashes with the same error.

Now, I can run it on a different machine with the quadro M6000 24GB, but only with the low-memory mode enabled.

Curiously, we tried to run it on our most powerful server, which has 1TB RAM and 8x A100 80GB GPUs. Here too, it will only run with the low-memory enabled.

What has changed from previous versions of CS? We could run such a job with our 2080Ti GPU with the low-memory mode disabled.

Any advice will be greatly appreciated. Thank you.

rbs_sci · December 18, 2023, 12:27am

Haven’t got a fix, but have seen the same behaviour as you regarding NU Refine (at least the new, faster code). Don’t have A100s to test, but some NU Refines crash on 48GB GPUs, which should be plenty for boxes <1000 pixels.

hbridges1 · January 3, 2024, 5:05pm

Hi @matt1 , as the code for low-memory mode is essentially the same as the previous NU-refine we would not expect to see a much larger use of VRAM if all the other refinement settings are the same. Were you previously running the same settings for local, global refinement and per particle scale for successful runs in the previous version of cryoSPARC?

I have tested a box of 650 on a card with ~11GB in low-memory mode with and without without CTF refinement and NU refine was able to complete. Local refinement in particular can use quite a lot of VRAM with the default settings, so would you mind testing the following to find which step is problematic for your data:

A. NU refine with low memory mode, and with GPU batch size of images to 200 for local refinement

If this still crashes, could you please try and split the process up into the steps below:

NU-refine run in low-memory mode without local and global CTF refinement
Global CTF refinement
Local CTF refinement with GPU batch size of images: 200
Homogeneous reconstruct

With the new code (not low memory mode), the speed of NU-refine has been accelerated by keeping the maps in GPU memory, so we do expect this to require more VRAM than previously, as explained in this post.

matt1 · January 5, 2024, 12:10am

Hi @hbridges1, thank you for the suggestions. I ran the jobs and here are the results (with our RTX 2080Ti):

A. NU refine with low memory mode, and with GPU batch size of images to 200 for local refinement – crashed with the same error “numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY”

NU-refine run in low-memory mode without local and global CTF refinement – Crashed with the same error

With Global CTF refinement only – Crashed with the same error

Local CTF refinement with GPU batch size of images: 200 – Crashed with the same error

For the above jobs, the option “minimize over per-particle scale” is used.

Homogeneous reconstruct – This job ran and finished sucessfully.

We now know that similar jobs with a box size of 576 px run sucessfully. However, with a box size of 648 px, they crash on the RTX 2080Ti.

hbridges1 · January 10, 2024, 6:28pm

Hi @matt1,

Thank you for sharing the result of those tests and for your feedback! As you are still running into memory issues, I have performed some additional jobs with a box of 650, adjusting the available VRAM, and have found the following to complete successfully:

NU refine in low memory mode with minimize over per-particle scale - 9.5 GB VRAM (with 9 GB it fails with CUDA out of memory error)
Local CTF refinement with GPU batch size of 200 - 8 GB VRAM (didn’t try lower)
Global CTF refinement - 9.5 GB VRAM (didn’t try lower)
NU-refine in low memory mode including minimize over per-particle scale, local refinement (GPU batch size 200) and global refinement of tilt, trefoil, tetrafoil & spherical aberration – 9.5 GB VRAM

Are there any other non-default settings that you are using, or can you think of anything unusual about your data or anything that might be different with your setup?

I am sure that you already looked, but it may be worth re-checking that another process is not running on the GPU that could account for the discrepancy between our results and yours.