skcuda.cufft.cufftAllocFailed

wonderful · March 8, 2021, 6:32am

Hi,all

I always meet a err like this ‘skcuda.cufft.cufftAllocFailed’ in many kind of jobs.It can fix when I restart my station.But I will meet this err a day late.Is there any suggestions?My GPU are 3090,always rtx 8000.Thank very much for any suggestions.

kai

spunjani · March 8, 2021, 7:04pm

Dear @wonderful, I’m going to direct message you instructions to update to a version of cryoSPARC that potentially fixes this issue for CentOS machines.

donghuachen · March 14, 2021, 4:49pm

@spunjani, I got the same error when I was running 3D variability (v3.1.0, GeForce RTX 2080 Ti, CentOS Linux release 7.7.1908 (Core)) last night. Not sure how to fix this issue completely. Thanks.

Start iteration 0 of 20
[CPU: 6.53 GB] batch 3567 of 3567
[CPU: 7.38 GB] Done. Solving…
[CPU: 8.34 GB] diagnostic: min-ev 2.88105224609375
[CPU: 10.25 GB] diagnostic: num bad voxels 0
[CPU: 8.17 GB] batch 2529 of 3567
[CPU: 6.91 GB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/var3D/run.py”, line 526, in cryosparc_compute.jobs.var3D.run.run
File “cryosparc_worker/cryosparc_compute/jobs/var3D/run.py”, line 312, in cryosparc_compute.jobs.var3D.run.run.E_step
File “cryosparc_worker/cryosparc_compute/engine/newengine.py”, line 312, in cryosparc_compute.engine.newengine.EngineThread.load_models_rspace
File “cryosparc_worker/cryosparc_compute/engine/newgfourier.py”, line 153, in cryosparc_compute.engine.newgfourier.rfft3_on_gpu_inplace
File “cryosparc_worker/cryosparc_compute/engine/newgfourier.py”, line 72, in cryosparc_compute.engine.newgfourier.get_plan_R2C_3D
File “/data/donghua/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/fft.py”, line 127, in init
onembed, ostride, odist, self.fft_type, self.batch)
File “/data/donghua/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cufft.py”, line 742, in cufftMakePlanMany
cufftCheckStatus(status)
File “/data/donghua/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cufft.py”, line 117, in cufftCheckStatus
raise e
skcuda.cufft.cufftAllocFailed

mmiotto · May 25, 2021, 3:24pm

Hi, can you please also send me the instructions? I’m having the same issues, also CentOS.

stephan · May 25, 2021, 3:27pm

Hi @mmiotto,

If you’re on the latest version of cryoSPARC, the fix is included.
If you encounter cuMemHostAlloc failed errors, there is an additional step you need to run:
add export CRYOSPARC_NO_PAGELOCK=true to the cryosparc_worker/config.sh file, then re-run the job.

mmiotto · May 25, 2021, 4:28pm

Thank you Stephan, actually I have cryosparc_compute.skcuda_internal.cufft.cufftAllocFailed during Extract from micrographs. I will try it to check.

olibclarke · July 14, 2021, 9:00pm

Hi @stephan - we are still seeing this on our CentOS system with the latest version. Heterogeneous refinements are commonly failing with a cryosparc_compute.skcuda_internal.cufft.cufftAllocFailed error, even though when I check using nvidia_smi they don’t seem anywhere close to exceeding the capabilities of the cards (RTX-3090s). Also sometimes a hetero refine job will run to completion, and sometimes the same job will fail midway through with this error (and I have checked for other processes, so it isn’t that). Thoughts?

Cheers
Oli

EDIT: here is the job log in case it is useful:

========= sending heartbeat
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
**custom thread exception hook caught something
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

apunjani · July 16, 2021, 2:52pm

Hi @olibclarke,

We are more or less at a loss on this unfortunately - are there any other patterns you can detect? Anything to do with box size, number of classes, etc?
Sorry we can’t be more helpful!