CUDA memory error during 2D classification

Hi all,
I keep getting the same error message when I run 2D classification on ~3M particles. After about 5-10 iterations, I get the following traceback error:

Blockquote Traceback (most recent call last):
File “/usr/local/CryoSPARC/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1726, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 130, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1096, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 500, in cryosparc_compute.engine.engine.EngineThread.cull_candidates
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 319, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory

It generally crashes once CPU memory reaches 8GB. My workstation is GPU optimized with 4x Nvidia GeForce RTX 2080. Nvidia-smi shows that I am using version 460.39 and CUDA version 11.2, although I specified --cudapath to /usr/local/cuda-10.0. It seems that the 2D classification job is using CPU memory and not GPU memory. Is there a workaround or fix to this issue?

I am running the most recent version of CryoSPARC.

Thanks in advance,
Karl

1 Like

@kherbine, thanks for posting. Which OS are you running?

1 Like

@spunjani Hi, I am running CentOS Linux release 7.9.2009 (Core).

Hi @kherbine,

Thanks for reporting. I’m going to direct message you instructions to update to a version of cryoSPARC that potentially fixes this issue for CentOS machines.

1 Like

Hi, I have the same issue when running 2D classification and ab-initio reconstruction on CentOS. I have the latest version of Cyrosparc. Could you please help me with that?

@Cwuz, are you currently running v3.2?

Yes. Current cryoSPARC version: v3.2.0

Thanks @Cwuz. CryoSPARC v3.2 contains an option to work around the bug in CUDA on CentOS 7 that causes cuMemHostAlloc failed errors in multiple job types. To engage this, please add export CRYOSPARC_NO_PAGELOCK=true to the cryosparc_worker/config.sh file.

3 Likes

Hi @spunjani, I am seeing the same problems on CentOS 7. I added this parameter to the config.sh, but I still see random GPU memory failures - it will happen one run, and then I restart the job and it will run ok. Is there anything else one has to do for this workaround to take effect?

Cheers
Oli

Hi @olibclarke,
Do you see the exact same error message
cuMemHostAlloc failed ?
This error indicates that it is CPU memory allocation that is failing, whereas if you see cuMemAlloc failed that indicates GPU memory issues (which the workaround does not affect)

Once you set the
export CRYOSPARC_NO_PAGELOCK=true
line in config.sh, you should start to see the following in the joblog (i.e if you do cryosparcm joblog PX JX):

HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)

thanks Ali - unfortunately I have overwritten the job in question, but I will double check the exact error, that explanation is helpful

Hi @apunjani, just got the error again (running a 5 class hetero refine job at 300px box size on a 3090 card):

  File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1791, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 1027, in cryosparc_compute.engine.engine.process.work
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 106, in cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
  File "cryosparc_worker/cryosparc_compute/engine/gfourier.py", line 32, in cryosparc_compute.engine.gfourier.fft2_on_gpu_inplace
  File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 134, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
    raise e
cryosparc_compute.skcuda_internal.cufft.cufftAllocFailed

Hi @apunjani, here is the relevant joblog - it does have that line, but is still failing. I guess we may have a hardware issue.

FSC Loose Mask...      0.143 at 62.500 radwn. 0.5 at 62.500 radwn. Took 0.380s.
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
**custom thread exception hook caught something
**** handle exception rc
set status to failed
========= sending heartbeat
========= main process now complete.
========= monitor process now complete.```

@spunjani just as a support information. I was also having the same issue of “pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory” on a CentOS 7 machine with 4 GPUs 2080ti and CryoSparc 3.2.0. Your advice ( CRYOSPARC_NO_PAGELOCK=true to the cryosparc_worker/config.sh file.)solved this problem. The 2D classification ran smoothly.
Thanks
Ariel

Thanks @atalavera for the additional info!

@olibclarke in your case the error is indeed cufftAllocFailed meaning that it was GPU memory allocation that failed (and therefore the PAGELOCK workaround would not have any effect). I wouldn’t say there is any evidence of a hardware issue yet (usually this would show as the GPU missing from nvidia-smi output). You’ve probably checked but any chance something else was running on the GPU at the same time?

Hi all,
I encountered similar “out of memory” errors in both the motion correction step and 2D classification step. My system has 64 CPU cores (with 252GB RAM) and 4 RTX3090 (24GB). For the motion correction step, I was able to rescue the movies in the incomplete blob (~5% of all movies) by setting another run. For the 2D classification step, I am still not able to solve the problem. The error did not show up if the dataset is small. I made GPU0 invisible, and tested 1,2 or 3 GPUs with the same error. Please share if you have some success. Thanks.

Here is the error message.
[CPU: 6.89 GB] Traceback (most recent call last):
File “/home/cryosparc_user/cryosparc3.3/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1811, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1090, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 306, in cryosparc_compute.engine.engine.EngineThread.compute_resid_pow
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 333, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory

Hi @olibclarke and @apunjani, i am wondering if this issue is solved? I have similar issues. Centos OS and 3090 card (running 4 classes hetero refine job, box size 384). NU-refine usually is fine, but hetero refine gives error message.

[CPU: 3.92 GB] Traceback (most recent call last):
File “/spshared/apps/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1844, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1090, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 306, in cryosparc_compute.engine.engine.EngineThread.compute_resid_pow
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 353, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
File “/spshared/apps/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/gpuarray.py”, line 210, in init
self.gpudata = self.allocator(self.size * self.dtype.itemsize)
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory

Please can you post the “joblog”
cryosparcm job log <project_uid> <job_uid>
as well as

Were there any GPU tasks other than jobs from that cryoSPARC instance running at the time?