cufftAllocFailed during Patch motion correction (multi)

open

#1

I am trying patch motion correction on the movies from the cryosparc tutorial. When I run the job, I get the error below. The only other topic I have seen on this issue describes someone else using the node, which is not the case for us (I am the only user currently on this machine).

We have 4 RTX 2080s with CUDA 10.1, driver version 418.87.00. I do not think the issue is with CUDA, since patch CTF works just fine, but I am happy to be proven wrong.

Job will process this many movies:  20

parent process is 18714

Calling CUDA init from 18778

Calling CUDA init from 18779

-- 1.0: processing J1/imported/14sep05c_00024sq_00003hl_00002es.frames.tif
        loading /troll/scratch/houser/cryosparc/P26/J1/imported/14sep05c_00024sq_00003hl_00002es.frames.tif
        Loading raw movie data from J1/imported/14sep05c_00024sq_00003hl_00002es.frames.tif ...
        Done in 2.70s
        Loading gain data from J1/imported/norm-amibox05-0.mrc ...
        Done in 0.07s
        Processing ...

-- 0.0: processing J1/imported/14sep05c_00024sq_00003hl_00005es.frames.tif
        loading /troll/scratch/houser/cryosparc/P26/J1/imported/14sep05c_00024sq_00003hl_00005es.frames.tif
        Loading raw movie data from J1/imported/14sep05c_00024sq_00003hl_00005es.frames.tif ...
        Done in 2.61s
        Loading gain data from J1/imported/norm-amibox05-0.mrc ...
        Done in 0.05s
        Processing ...

Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1481, in run_with_except_hook
    run_old(*args, **kw)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 53, in stage_target
    work = processor.process(item)
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 146, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 149, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 150, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 77, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 393, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/engine/newgfourier.py", line 22, in cryosparc2_compute.engine.newgfourier.get_plan_R2C_2D
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/fft.py", line 127, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 742, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 117, in cufftCheckStatus
    raise e
cufftAllocFailed

Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1481, in run_with_except_hook
    run_old(*args, **kw)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 53, in stage_target
    work = processor.process(item)
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 146, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 149, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 150, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 77, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 393, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/engine/newgfourier.py", line 22, in cryosparc2_compute.engine.newgfourier.get_plan_R2C_2D
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/fft.py", line 127, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 742, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 117, in cufftCheckStatus
    raise e
cufftAllocFailed

#2

Sorry, just found the troubleshooting guidelines.

cryoSPARC version info:

$ cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/cryosparc/cryosparc2/cryosparc2_master
Current cryoSPARC version: v2.11.0
----------------------------------------------------------------------------

cryosparcm process status:

app                              STOPPED   Not started
app_dev                          STOPPED   Not started
command_core                     RUNNING   pid 13125, uptime 4:10:40        
command_proxy                    RUNNING   pid 13166, uptime 4:10:37        
command_rtp                      STOPPED   Not started
command_vis                      RUNNING   pid 3356, uptime 0:00:04
database                         RUNNING   pid 13039, uptime 4:10:42        
watchdog_dev                     STOPPED   Not started
webapp                           RUNNING   pid 13181, uptime 4:10:36        
webapp_dev                       STOPPED   Not started

----------------------------------------------------------------------------

global config variables:

export CRYOSPARC_LICENSE_ID="xxxx"
export CRYOSPARC_MASTER_HOSTNAME="troll.ohsu.edu"
export CRYOSPARC_DB_PATH="/home/cryosparc/cryosparc2/cryosparc2_database"   
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false

CUDA version info:
export CRYOSPARC_CUDA_PATH=/usr/local/cuda-10.1

OS: CentOS 7.7.1908
This cryoSPARC installation is running as a standalone

I’m on the latest version of cryoSPARC, have restarted both cryoSPARC and the machine. Have also updated drivers etc. As I said, the only other issue I found related to this one was solved by having other users not running jobs, which we already have.


#3

Hi @posertinlab,
Thanks for reporting. Can you try to run the same job but monitor
watch -n 1 nvidia-smi
which should report the GPU memory usage before the job fails. The error cufftAllocFailed means that the GPU has run out of memory - your RTX2080 GPUs have only 8GBs and the tutorial data is K2-super-res which may require more.


#4

Hi @apunjani, thanks for the quick response.

I tried watch -n 1 and didn’t see the memory usage go above 57%. I assumed I missed the peak, and tried watch -n 0.2 and when the job failed only 51% of the memory was being used. Is it worth trying to find a smaller movie (all of our experimental data is larger than the tutorial data) or is it certainly a memory issue?


#5

I think another user just today actually confirmed that if you change the “Output F-crop factor” in patch motion to 3/4 or 1/2 then the memory footprint should be reduced and you can process the data successfully.
This will Fourier-crop (i.e. downsample, without aliasing) the micrographs as they are being motion corrected. Typically if you are collecting in super-res mode, it’s common that your raw pixel size will be very small and you will want to F-crop by 1/2 anyway.


#6

I tried decreasing the “Output F-crop factor” all the way down to 1/4 and still get the error. I also tried 3/4 and 1/2. The 1/4 error is copied below.

Importing job module for job type patch_motion_correction_multi...

Job ready to run

***************************************************************

Job will process this many movies:  1

parent process is 15629

Calling CUDA init from 15686

-- 0.0: processing J25/imported/14sep05c_c_00003gr_00014sq_00011hl_00003es.frames.tif
        loading /troll/scratch/houser/cryosparc/P26/J25/imported/14sep05c_c_00003gr_00014sq_00011hl_00003es.frames.tif
        Loading raw movie data from J25/imported/14sep05c_c_00003gr_00014sq_00011hl_00003es.frames.tif ...
        Done in 2.26s
        Loading gain data from J25/imported/norm-amibox05-0.mrc ...
        Done in 0.05s
        Processing ...
Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1481, in run_with_except_hook
    run_old(*args, **kw)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 53, in stage_target
    work = processor.process(item)
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 146, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 149, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 150, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 77, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 393, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/engine/newgfourier.py", line 22, in cryosparc2_compute.engine.newgfourier.get_plan_R2C_2D
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/fft.py", line 127, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 742, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 117, in cufftCheckStatus
    raise e
cufftAllocFailed

#7

Hi @posertinlab,

Can you confirm the detector used to capture these movies? At this point, you might be limited by the GPU you are using. Do you have access to GPUs with more memory (1080Ti, Titan X, V100, etc)?