cufftAllocFailed during Patch motion correction (multi)

posertinlab · September 30, 2019, 10:46pm

I am trying patch motion correction on the movies from the cryosparc tutorial. When I run the job, I get the error below. The only other topic I have seen on this issue describes someone else using the node, which is not the case for us (I am the only user currently on this machine).

We have 4 RTX 2080s with CUDA 10.1, driver version 418.87.00. I do not think the issue is with CUDA, since patch CTF works just fine, but I am happy to be proven wrong.

Job will process this many movies:  20

parent process is 18714

Calling CUDA init from 18778

Calling CUDA init from 18779

-- 1.0: processing J1/imported/14sep05c_00024sq_00003hl_00002es.frames.tif
        loading /troll/scratch/houser/cryosparc/P26/J1/imported/14sep05c_00024sq_00003hl_00002es.frames.tif
        Loading raw movie data from J1/imported/14sep05c_00024sq_00003hl_00002es.frames.tif ...
        Done in 2.70s
        Loading gain data from J1/imported/norm-amibox05-0.mrc ...
        Done in 0.07s
        Processing ...

-- 0.0: processing J1/imported/14sep05c_00024sq_00003hl_00005es.frames.tif
        loading /troll/scratch/houser/cryosparc/P26/J1/imported/14sep05c_00024sq_00003hl_00005es.frames.tif
        Loading raw movie data from J1/imported/14sep05c_00024sq_00003hl_00005es.frames.tif ...
        Done in 2.61s
        Loading gain data from J1/imported/norm-amibox05-0.mrc ...
        Done in 0.05s
        Processing ...

Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1481, in run_with_except_hook
    run_old(*args, **kw)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 53, in stage_target
    work = processor.process(item)
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 146, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 149, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 150, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 77, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 393, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/engine/newgfourier.py", line 22, in cryosparc2_compute.engine.newgfourier.get_plan_R2C_2D
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/fft.py", line 127, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 742, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 117, in cufftCheckStatus
    raise e
cufftAllocFailed

Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1481, in run_with_except_hook
    run_old(*args, **kw)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 53, in stage_target
    work = processor.process(item)
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 146, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 149, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 150, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 77, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 393, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/engine/newgfourier.py", line 22, in cryosparc2_compute.engine.newgfourier.get_plan_R2C_2D
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/fft.py", line 127, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 742, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 117, in cufftCheckStatus
    raise e
cufftAllocFailed

posertinlab · October 1, 2019, 2:43am

Sorry, just found the troubleshooting guidelines.

cryoSPARC version info:

$ cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/cryosparc/cryosparc2/cryosparc2_master
Current cryoSPARC version: v2.11.0
----------------------------------------------------------------------------

cryosparcm process status:

app                              STOPPED   Not started
app_dev                          STOPPED   Not started
command_core                     RUNNING   pid 13125, uptime 4:10:40        
command_proxy                    RUNNING   pid 13166, uptime 4:10:37        
command_rtp                      STOPPED   Not started
command_vis                      RUNNING   pid 3356, uptime 0:00:04
database                         RUNNING   pid 13039, uptime 4:10:42        
watchdog_dev                     STOPPED   Not started
webapp                           RUNNING   pid 13181, uptime 4:10:36        
webapp_dev                       STOPPED   Not started

----------------------------------------------------------------------------

global config variables:

export CRYOSPARC_LICENSE_ID="xxxx"
export CRYOSPARC_MASTER_HOSTNAME="troll.ohsu.edu"
export CRYOSPARC_DB_PATH="/home/cryosparc/cryosparc2/cryosparc2_database"   
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false

CUDA version info:
export CRYOSPARC_CUDA_PATH=/usr/local/cuda-10.1

OS: CentOS 7.7.1908
This cryoSPARC installation is running as a standalone

I’m on the latest version of cryoSPARC, have restarted both cryoSPARC and the machine. Have also updated drivers etc. As I said, the only other issue I found related to this one was solved by having other users not running jobs, which we already have.

apunjani · October 1, 2019, 3:10pm

Hi @posertinlab,
Thanks for reporting. Can you try to run the same job but monitor
watch -n 1 nvidia-smi
which should report the GPU memory usage before the job fails. The error cufftAllocFailed means that the GPU has run out of memory - your RTX2080 GPUs have only 8GBs and the tutorial data is K2-super-res which may require more.

posertinlab · October 1, 2019, 6:02pm

Hi @apunjani, thanks for the quick response.

I tried watch -n 1 and didn’t see the memory usage go above 57%. I assumed I missed the peak, and tried watch -n 0.2 and when the job failed only 51% of the memory was being used. Is it worth trying to find a smaller movie (all of our experimental data is larger than the tutorial data) or is it certainly a memory issue?

apunjani · October 1, 2019, 6:42pm

I think another user just today actually confirmed that if you change the “Output F-crop factor” in patch motion to 3/4 or 1/2 then the memory footprint should be reduced and you can process the data successfully.
This will Fourier-crop (i.e. downsample, without aliasing) the micrographs as they are being motion corrected. Typically if you are collecting in super-res mode, it’s common that your raw pixel size will be very small and you will want to F-crop by 1/2 anyway.

posertinlab · October 1, 2019, 6:48pm

I tried decreasing the “Output F-crop factor” all the way down to 1/4 and still get the error. I also tried 3/4 and 1/2. The 1/4 error is copied below.

Importing job module for job type patch_motion_correction_multi...

Job ready to run

***************************************************************

Job will process this many movies:  1

parent process is 15629

Calling CUDA init from 15686

-- 0.0: processing J25/imported/14sep05c_c_00003gr_00014sq_00011hl_00003es.frames.tif
        loading /troll/scratch/houser/cryosparc/P26/J25/imported/14sep05c_c_00003gr_00014sq_00011hl_00003es.frames.tif
        Loading raw movie data from J25/imported/14sep05c_c_00003gr_00014sq_00011hl_00003es.frames.tif ...
        Done in 2.26s
        Loading gain data from J25/imported/norm-amibox05-0.mrc ...
        Done in 0.05s
        Processing ...
Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1481, in run_with_except_hook
    run_old(*args, **kw)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 53, in stage_target
    work = processor.process(item)
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 146, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 149, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 150, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 77, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 393, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/engine/newgfourier.py", line 22, in cryosparc2_compute.engine.newgfourier.get_plan_R2C_2D
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/fft.py", line 127, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 742, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 117, in cufftCheckStatus
    raise e
cufftAllocFailed

stephan · October 11, 2019, 3:43pm

Hi @posertinlab,

Can you confirm the detector used to capture these movies? At this point, you might be limited by the GPU you are using. Do you have access to GPUs with more memory (1080Ti, Titan X, V100, etc)?

prash · April 20, 2020, 1:51am

Hi @apunjani

I had similar issue where my patch motion correction would crash. I have only one GTX1080 8,000 mb gpu. I tried many different combination of parameters mentioned in other tickets and it always fails.
But when I perform a Full frame motion correction job with all default parameters the job completes successfully.

Is the memory handling different for both these motion correction methods?

apunjani · April 27, 2020, 4:11pm

Hi @prash,

Yes, patch based motion does have different (larger) memory requirements than full-frame motion. Full-frame motion only computes a single rigid motion trajectory across the whole micrograph, whereas patch-motion estimates the rigid trajectory as well as a local anisotropic motion field representing the sample deformation in the ice.
On an 8GB card, right now it is not possible to process patch-motion for K2 superres or K3 data, though a 4kx4k Falcon image (with < ~50 frames) shuold fit. We recommend 11GB+ cards generally.

Ricky · June 17, 2020, 7:03pm

Hi, @apunjani, I am running a local motion correction job on my workstation wit 4 GTX1080Ti GPU. It was running fine but failed with “cufftAllocFailed” after ~2300 movies. I checked the only problem could be out of memory. Is there anyway we can continue the job from where it failed rather than clear the job and start all over? Thanks.

nfrasser · June 18, 2020, 10:13pm

Hi @Ricky, the failed job should have a movies and movies_incomplete outputs:

In the Job Details panel for the failed local motion correction job, select “Mark Job as Complete”. Create a new motion correction job with the movies_incomplete output and the same particles used in the previous job as inputs. Run the job to completion.

Then for the next job the requires those motion-corrected particles, provide as input the particles from both jobs.

Let me know how that goes,

Nick

Ricky · June 19, 2020, 7:57pm

Hi @Nick, the failed job after "marked job as complete” has 0 counts in particle, movies and movies_incompleted.even though the job has processed 2324 of 3905 movies.

When I created a new job with “particles” from the original input of the failed job and “movies_incompleted” as input, it failed again with the following message.

Thanks,

[CPU: 182.7 MB] Traceback (most recent call last): File “cryosparc2_worker/cryosparc2_compute/run.py”, line 85, in cryosparc2_compute.run.main File “cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_local.py”, line 542, in cryosparc2_compute.jobs.motioncorrection.run_local.run_local_motion_correction

AttributeError: ‘NoneType’ object has no attribute ‘get_items’

Screen Shot 2020-06-18 at 4.13.51 PM.png

nfrasser · June 22, 2020, 7:47pm

Hi @Ricky, my mistake, Local Motion Correction unfortunately does not currently have the ability to save partial results like some other jobs. You won’t be able to recover the already processed data.

Instead I suggest using the “Exposure Sets Tool” job to break up your movies into multiple batches and run Local Motion Correction individually on each one (you should be able to use the same particles as input for each new Local Motion Correction job). That way if one of the batches fails you won’t have lost as much work.

Let me know if you need more help with this.