cufftInvalidPlan error raised in patch motion correction

HamishGBrown · January 21, 2021, 11:01pm

Running Cryosparc v3.0.1 on RHEL7.9 with CUDA 11.2 with 4 NVIDIA GeForce GTX 1080 graphics cards.

After launching a patch motion correction it seems to run smoothly and successfully aligns a number of frames before crashing with the following error:

File "/home/bio21em1/cryosparc3/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1722, in run_with_except_hook run_old(*args, **kw)
File "/home/bio21em1/cryosparc3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs)
File "/home/bio21em1/cryosparc3/cryosparc_worker/cryosparc_compute/jobs/pipeline.py", line 164, in thread_work work = processor.process(item)
File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 190, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 193, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 195, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 251, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 423, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 410, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction.get_framedata
File "cryosparc_worker/cryosparc_compute/engine/newgfourier.py", line 107, in cryosparc_compute.engine.newgfourier.do_fft_plan_inplace
File "/home/bio21em1/cryosparc3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cufft.py", line 346, in cufftExecR2C cufftCheckStatus(status)
File "/home/bio21em1/cryosparc3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cufft.py", line 117, in cufftCheckStatus raise e skcuda.cufft.cufftInvalidPlan

This crash seems to happen randomly (ie. on a different image within the dataset each time) and doesn’t seem to be related to GPU memory or the number of GPUs I include in the calculation. The Cryosparc job log is as follows:

BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/home/bio21em1/cryosparc3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/home/bio21em1/cryosparc3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/bio21em1/cryosparc3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)

Which seems to be more related to the job failing and then becoming uncontactable. Full frame motion correction runs fine on the GPUs.

stephan · January 21, 2021, 11:03pm

Hi @HamishGBrown,

Thanks for reporting. Is it possible if you can enable the “Low memory mode” parameter in Patch Motion Correction and try again?
Also, can you report the attributes of the input data? (resolution, pixel size, # frames)

HamishGBrown · January 21, 2021, 11:14pm

Hi @stephan,

Yes, sorry forgot to mention, I’ve tried the low memory mode to no avail. The dataset is EMD-11638, the 1.22 A resolution apoferritin dataset from Nakane et al. 2020, so .eer format with a pixel size of 0.457 Angstrom. I’ve made 40 fractions out of it with an upsampling of 2 (default settings).

Another thing, in an earlier post regarding a cufft error it was suggested to run the command cryosparcw force deps which I have also done.

mchakra · January 23, 2021, 5:43pm

I just wanted to add that we are encountering the exact same error when running patch motion correction in cryoSPARC v3.0.1. In our case, we are processing our own dataset, which consists of several thousand movies (each containing ~40 frames) taken on a K3 with super-resolution. The pixel size is 0.5034 Angstroms. Based on suggestions in other posts involving super-resolution data, we have tried to use a smaller “Output F-crop factor” (either 1/2 or even 1/4) combined with “Low memory mode”, but the error above still persists. As mentioned, full frame motion correction also runs without error on the GPUs in our case.

stephan · January 23, 2021, 10:17pm

Hi @mchakra,

Thank you for reporting. Can you also show us the output of nvidia-smi? Just so we know what NVIDIA driver version/CUDA Toolkit + GPUs you’re using

mchakra · January 23, 2021, 10:44pm

Hi @stephan,

Thank you for your response. We are using NVIDIA driver version 418.87.00 and Cuda Version 10.1. Our system has 2 NVIDIA GeForce RTX 2080 Ti GPUs with 10.989 GB of memory each. (The GPUs are currently being utilized in a job and thus show some memory occupied). When running the patch motion job, we choose the option to parallelize over 2 GPUs.

stephan · January 26, 2021, 11:10pm

Hi @mchakra,

Can you try the following:

Log onto a worker node (the node with GPUs) as the cryoSPARC user
Navigate to cryosparc_worker

Run the commands:

eval $(./bin/cryosparcw env)
conda activate cryosparc_worker_env
pip uninstall pycuda
export PATH="/usr/local/cuda-10.1/bin:$PATH"
pip install pycuda==2019.1.2 --no-cache-dir

Once that’s done, try running the job again to see if you get a cufftInvalidPlan error

apunjani · February 4, 2021, 4:33pm

Hi @HamishGBrown, @mchakra, are you still seeing this issue?

@HamishGBrown can you tell us:

NVIDIA driver version (nvidia-smi)

@mchakra can you tell us:

OS Version

Also, have either of you updated to v3.1.0 ?

mchakra · February 4, 2021, 5:04pm

Hi @apunjani,

Thank you for responding. I haven’t yet been able to test the instructions provided by @stephan above, since we are currently transferring some files from our workstation due to disk space limitations. We are currently using CentOS 7.6. Also, we did indeed update to v3.1.0 (revised update with compatibility for GLIBC_2.17), but I am not yet sure if this update resolved the issue. I did notice that the later version of pycuda==2020.1 was installed by conda when performing the update to v.3.1.0.

apunjani · February 4, 2021, 5:16pm

Hi @mchakra,
Thanks for the info! We are seeing a pattern that so far, confirmed cases of this issue (cufftInvalidPlan) are exclusively happening on CentOS 7. We are trying to figure out why.

HamishGBrown · February 8, 2021, 4:25am

Hi @apunjani,

Sorry for the delay, we were processing a large dataset so I couldn’t update to v3.1.0 Now that that is complete I’ve performed the upgrade and the project that was causing the issue now runs to completion. Thanks for fixing the problem in this latest update.

In case it still matters my NVIDIA driver version is 460.32.03. I’ll comment on this thread if I encounter th
is issue again.

Cheers,
Hamish