Hi All,
I got CUDA_ERROR_OUT_OF_MEMORY error for 2D classification after upgrading to v5 (currently on v5.0.2). This error always happens when using a class number of 400, either with 1 or 2 GPUs (GeForce RTX 4090 with 24 GB of VRAM). The 2D jobs work fine with a class number of 300 or 250. I used to be able to run two instances of 2D classification with 400 classes on one such GPU on v4.7.1, but now I have to run the same job on GPUs with 48 GB of VRAM (RTX A6000). Below is the error message I received:
Traceback (most recent call last):
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 851, in _attempt_allocation
return allocator()
^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1054, in allocator
return driver.cuMemAlloc(size)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cli/run.py", line 105, in cli.run.run_job
File "cli/run.py", line 210, in cli.run.run_job_function
File "compute/jobs/class2D/run.py", line 295, in compute.jobs.class2D.run.run_class_2D
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/compute/alignment.py", line 652, in greedy_align_2D_noqueue
align_res = align_pairs(
^^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/compute/alignment.py", line 440, in align_pairs
NET.ensure_allocated("denom", (N_H, N_KK, N_S, N_R), n.float32)
File "compute/gpu/gpucore.py", line 399, in compute.gpu.gpucore.EngineBaseThread.ensure_allocated
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/compute/gpu/gpuarray.py", line 377, in empty
return device_array(shape, dtype, stream=stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/compute/gpu/gpuarray.py", line 333, in device_array
arr = GPUArray(shape=shape, strides=strides, dtype=dtype, stream=stream) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/compute/gpu/gpuarray.py", line 122, in __init__
super().__init__(shape, strides, dtype, stream, gpu_data)
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/devicearray.py", line 103, in __init__
gpu_data = devices.get_context().memalloc(self.alloc_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1372, in memalloc
return self.memory_manager.memalloc(bytesize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1056, in memalloc
ptr = self._attempt_allocation(allocator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 863, in _attempt_allocation
return allocator()
^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1054, in allocator
return driver.cuMemAlloc(size)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcuser/Applications/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
I have been monitoring the VRAM usage and noticed that the VRAM usage is around 7-8GB during initial iterations and increases to ~15GB as soon as the 2D classification job starts the first full iteration, which is immediately followed by the CUDA_ERROR_OUT_OF_MEMORY error. I did not see the full 24GB VRAM being filled before the job crash.
Now I have downgraded to v4.7.1 due to this issue, but would really hope this issue can be solved soon for v5.
Thanks.