Hi all,
I keep getting the same error message when I run 2D classification on ~3M particles. After about 5-10 iterations, I get the following traceback error:
Blockquote Traceback (most recent call last):
File “/usr/local/CryoSPARC/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1726, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 130, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1096, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 500, in cryosparc_compute.engine.engine.EngineThread.cull_candidates
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 319, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory
It generally crashes once CPU memory reaches 8GB. My workstation is GPU optimized with 4x Nvidia GeForce RTX 2080. Nvidia-smi shows that I am using version 460.39 and CUDA version 11.2, although I specified --cudapath to /usr/local/cuda-10.0. It seems that the 2D classification job is using CPU memory and not GPU memory. Is there a workaround or fix to this issue?
I am running the most recent version of CryoSPARC.
Thanks for reporting. I’m going to direct message you instructions to update to a version of cryoSPARC that potentially fixes this issue for CentOS machines.
Hi, I have the same issue when running 2D classification and ab-initio reconstruction on CentOS. I have the latest version of Cyrosparc. Could you please help me with that?
Thanks @Cwuz. CryoSPARC v3.2 contains an option to work around the bug in CUDA on CentOS 7 that causes cuMemHostAlloc failed errors in multiple job types. To engage this, please add export CRYOSPARC_NO_PAGELOCK=true to the cryosparc_worker/config.sh file.
Hi @spunjani, I am seeing the same problems on CentOS 7. I added this parameter to the config.sh, but I still see random GPU memory failures - it will happen one run, and then I restart the job and it will run ok. Is there anything else one has to do for this workaround to take effect?
Hi @olibclarke,
Do you see the exact same error message cuMemHostAlloc failed ?
This error indicates that it is CPU memory allocation that is failing, whereas if you see cuMemAlloc failed that indicates GPU memory issues (which the workaround does not affect)
Once you set the export CRYOSPARC_NO_PAGELOCK=true
line in config.sh, you should start to see the following in the joblog (i.e if you do cryosparcm joblog PX JX):
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
Hi @apunjani, just got the error again (running a 5 class hetero refine job at 300px box size on a 3090 card):
File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1791, in run_with_except_hook
run_old(*args, **kw)
File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 1027, in cryosparc_compute.engine.engine.process.work
File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 106, in cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
File "cryosparc_worker/cryosparc_compute/engine/gfourier.py", line 32, in cryosparc_compute.engine.gfourier.fft2_on_gpu_inplace
File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 134, in __init__
onembed, ostride, odist, self.fft_type, self.batch)
File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
cufftCheckStatus(status)
File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
raise e
cryosparc_compute.skcuda_internal.cufft.cufftAllocFailed
@spunjani just as a support information. I was also having the same issue of “pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory” on a CentOS 7 machine with 4 GPUs 2080ti and CryoSparc 3.2.0. Your advice ( CRYOSPARC_NO_PAGELOCK=true to the cryosparc_worker/config.sh file.)solved this problem. The 2D classification ran smoothly.
Thanks
Ariel
@olibclarke in your case the error is indeed cufftAllocFailed meaning that it was GPU memory allocation that failed (and therefore the PAGELOCK workaround would not have any effect). I wouldn’t say there is any evidence of a hardware issue yet (usually this would show as the GPU missing from nvidia-smi output). You’ve probably checked but any chance something else was running on the GPU at the same time?
Hi all,
I encountered similar “out of memory” errors in both the motion correction step and 2D classification step. My system has 64 CPU cores (with 252GB RAM) and 4 RTX3090 (24GB). For the motion correction step, I was able to rescue the movies in the incomplete blob (~5% of all movies) by setting another run. For the 2D classification step, I am still not able to solve the problem. The error did not show up if the dataset is small. I made GPU0 invisible, and tested 1,2 or 3 GPUs with the same error. Please share if you have some success. Thanks.
Here is the error message.
[CPU: 6.89 GB] Traceback (most recent call last):
File “/home/cryosparc_user/cryosparc3.3/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1811, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1090, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 306, in cryosparc_compute.engine.engine.EngineThread.compute_resid_pow
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 333, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory
Hi @olibclarke and @apunjani, i am wondering if this issue is solved? I have similar issues. Centos OS and 3090 card (running 4 classes hetero refine job, box size 384). NU-refine usually is fine, but hetero refine gives error message.
[CPU: 3.92 GB] Traceback (most recent call last):
File “/spshared/apps/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1844, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1090, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 306, in cryosparc_compute.engine.engine.EngineThread.compute_resid_pow
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 353, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
File “/spshared/apps/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/gpuarray.py”, line 210, in init
self.gpudata = self.allocator(self.size * self.dtype.itemsize)
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory