Local resolution - skcuda_internal.cufft.cufftInternalError

Hi all,

when running a Local Resolution estimation job, I get the following traceback:
All parameters are default.
The job runs if CPU is specified, albeit slowly.

[CPU: 1006.8 MB] Using local box size of 96 voxels.
[CPU: 1006.8 MB] Using zeropadded box size of 192 voxels.
[CPU: 1006.8 MB] Using step size of 1 voxels.
[CPU: 1006.8 MB] Using FSC threshold of 0.500.
[CPU: 1.05 GB]   Started computing local resolution estimates.
[CPU: 1.05 GB]   Number of voxels to compute: 3628483
[CPU: 1.56 GB]   Traceback (most recent call last):
  File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1790, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/jobs/local_resolution/run.py", line 788, in cryosparc_compute.jobs.local_resolution.run.standalone_locres.work
  File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 134, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
    raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError

Current cryoSPARC version: v3.2.0

The job fails:

  • either using SLURM or using cryosparc’s default queue
  • tried on 32, 128 or 384 Gb RAM nodes
  • on GTX1080, GTX 2070 SUPER, GTX 2080Ti
  • Using NVIDIA driver 460 + CUDA 11.2 (Ubuntu 18)
  • Using NVIDIA driver 465 + CUDA 11.3 (Ubuntu 18, Ubuntu 20)
    And various combinations thereof.

All other cryosparc job types seem to be running fine on all machines / on SLURM nodes.

Further logs:

$>cryosparcm joblog P85 J105
Traceback (most recent call last):
  File "/soft/cryosparc_hodgkin/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/soft/cryosparc_hodgkin/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/soft/cryosparc_hodgkin/cryosparc_master/cryosparc_compute/client.py", line 89, in <module>
    print(eval("cli."+command))
  File "<string>", line 1, in <module>
  File "/soft/cryosparc_hodgkin/cryosparc_master/cryosparc_compute/client.py", line 62, in func
    assert False, res['error']
AssertionError: {'code': 500, 'data': None, 'message': "OtherError: argument of type 'NoneType' is not iterable", 'name': 'OtherError'}
$>cryosparcm log command_core
[trimmed - in this case the job is being sent to a GTX 1080 with the Select GPU option of cryosparc]
[More logs are available for the other configurations, but the whole thing is quite hefty to attach]

[EXPORT_JOB] : Request to export P83 J105
[EXPORT_JOB] :    Exporting job to /mnt/DATA/andrea/AM_20210709_XXX/P83/J105
[EXPORT_JOB] :    Exporting all of job's images in the database to /mnt/DATA/andrea/AM_20210709_XXX/P83/J105/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.00s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P83 J105 in 0.01s
---------- Scheduler running --------------- 
Jobs Queued:  [('P83', 'J105')]
Licenses currently active : 0
Now trying to schedule J105
  Need slots :  {'CPU': 2, 'GPU': 1, 'RAM': 1}
  Need fixed :  {'SSD': False}
  Master direct :  False
   Running job directly on GPU id(s): [0] on hopper.rhpc.nki.nl
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P83.J105 status launched
      Running project UID P83 job UID J105 
        Running job on worker type node
        Running job using:  /soft/cryosparc_hodgkin/cryosparc_worker/bin/cryosparcw
        Running job on remote worker node hostname hopper.rhpc.nki.nl
        cmd: bash -c "nohup /soft/cryosparc_hodgkin/cryosparc_worker/bin/cryosparcw run --project P83 --job J105 --master_hostname hodgkin.rhpc.nki.nl --master_command_core_port 39002 > /mnt/DATA/andrea/AM_20210709_XXX/P83/J105/job.log 2>&1 & "

---------- Scheduler finished --------------- 
Changed job P83.J105 status started
Changed job P83.J105 status running
Changed job P83.J105 status failed

The database logs seem normal (they are quite hefty to attach though).

I can provide more info if necessary.

Thanks for your help, this one is a head scratcher!
Best,

Andrea

HI @Andrea,
Thanks for the very clear and detailed post!
Given the extent of your testing, it’s likely there is a bug in the Local Resolution job. Can you confirm:

  • Does the error only happen on a specific dataset? If you run Local Resolution for a different project does it work? What are the details of the dataset where it does fail (box size, average global resolution, pixel size, etc?)
  • Does the error appear regardless of parameters? If you change things like the step size, the window factor, or other parameters of the Local Resolution job, can the job run to completion?

Hi @apunjani,

It seems indeed that this is a “bad” dataset.

I have tried running Local Resolution jobs for other datasets, and they run fine.This is on the same machines, immediately after or before the bad dataset fails.
The ‘bad’ dataset has box size 256, pixel size 0.836 (0.413 downsample 2x) , and global resolution ~6.5. The other, ‘succesful’ datasets have the same pixel size, global resolutions in the 4.5-7.5 A, and box sizes of 256 - 420.

For some mysterious reasons, the traceback on the bad dataset is now complaining about about cuda memory allocation (see block code below). This is on the exact same 2080Ti that ran the box 420 dataset; The good box 256 dataset runs on this card as well, so once again this speaks of “bad” dataset. I have no clue what is wrong with those particles, though…

[CPU: 1010.0 MB] Using local box size of 96 voxels.

[CPU: 1010.0 MB] Using zeropadded box size of 192 voxels.

[CPU: 1010.0 MB] Using step size of 1 voxels.

[CPU: 1010.0 MB] Using FSC threshold of 0.500.

[CPU: 1.06 GB]   Started computing local resolution estimates.

[CPU: 1.06 GB]   Number of voxels to compute: 3628483

[CPU: 1.62 GB]   Traceback (most recent call last):
  File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1790, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/jobs/local_resolution/run.py", line 788, in cryosparc_compute.jobs.local_resolution.run.standalone_locres.work
  File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 134, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
    raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError

[CPU: 1.62 GB]   Traceback (most recent call last):
  File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1790, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/jobs/local_resolution/run.py", line 791, in cryosparc_compute.jobs.local_resolution.run.standalone_locres.work
  File "/soft/cryosparc_hodgkin/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/gpuarray.py", line 1066, in zeros
    result = GPUArray(shape, dtype, allocator, order=order)
  File "/soft/cryosparc_hodgkin/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/gpuarray.py", line 210, in __init__
    self.gpudata = self.allocator(self.size * self.dtype.itemsize)
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory

I have a similar error with Naive Local Refinement (Legacy) jobs with Non-uniform refinement after upgrading cryosparc version from v2.15 to v3.2 (and CUDA from 10.X to 11.4). Error seems to appear at the local decomposition step; perhaps the problem is similar to @Andrea’s problem with the local resolution job (I don’t know how similar the two things are!). I used identical parameters as I had previously (successfully) used for naive local refinement of a different particle subset from the same dataset on cryosparc v2.15 and CUDA v10.

box size: 360 px
pixel size: 0.834 Ang/px

Non default parameters:
Fulcrum = 164, 165, 93
Local shift search extent = 10 px
Local rotation search extent = 25 px
Use NU-refinement = True
Dynamic mask threshold = 0.1
Dynamic mask near = 6
Dynamic mask far = 12

> [CPU: 4.98 GB]     Using previous iteration scale factors for each particle during alignment
> 
> [CPU: 4.98 GB]     Current alpha values  (  0.52 |  0.88 |  1.02 |  1.12 |  1.62 ) 
> 
> [CPU: 4.98 GB]     Using best alpha for reconstruction in each iteration
> 
> [CPU: 4.98 GB]   -- DEV 0 THR 0 NUM 1500 TOTAL 10.618029 ELAPSED 54.741592 --
> 
> [CPU: 8.31 GB]     Processed 11634 images in 59.343s.
> 
> [CPU: 8.70 GB]     Computing FSCs... 
> 
> [CPU: 9.39 GB]       Done in 66.214s
> 
> [CPU: 9.39 GB]     Using Filter Radius 35.458 (8.467A) | Previous: 15.012 (20.000A)
> 
> [CPU: 11.65 GB]  Local decomposition...
> 
> [CPU: 16.13 GB]  Traceback (most recent call last):
>   File "/home/emuser/emtools/cryosparc2/cryosparc2_worker/cryosparc_compute/jobs/runcommon.py", line 1790, in run_with_except_hook
>     run_old(*args, **kw)
>   File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
>   File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
>   File "cryosparc_worker/cryosparc_compute/jobs/local_resolution/run.py", line 788, in cryosparc_compute.jobs.local_resolution.run.standalone_locres.work
>   File "/home/emuser/emtools/cryosparc2/cryosparc2_worker/cryosparc_compute/skcuda_internal/fft.py", line 134, in __init__
>     onembed, ostride, odist, self.fft_type, self.batch)
>   File "/home/emuser/emtools/cryosparc2/cryosparc2_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
>     cufftCheckStatus(status)
>   File "/home/emuser/emtools/cryosparc2/cryosparc2_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
>     raise e
> cryosparc_compute.skcuda_internal.cufft.cufftInternalError

Accidentally using slightly incorrect fulcrums (I cloned the above job and forgot to change fulcurm coordinates) on a different particle subset, I also got the memory allocation error as @Andrea posted.

Curious!

>  > [CPU: 4.99 GB]     Using previous iteration scale factors for each particle during alignment
> > 
> > [CPU: 4.99 GB]     Current alpha values  (  0.49 |  0.89 |  1.02 |  1.12 |  1.55 ) 
> > 
> > [CPU: 4.99 GB]     Using best alpha for reconstruction in each iteration
> > 
> > [CPU: 4.99 GB]   -- DEV 1 THR 0 NUM 1930 TOTAL 18.612077 ELAPSED 84.984036 --
> > 
> > [CPU: 8.31 GB]     Processed 13721 images in 88.492s.
> > 
> > [CPU: 8.70 GB]     Computing FSCs... 
> > 
> > [CPU: 9.40 GB]       Done in 54.778s
> > 
> > [CPU: 9.40 GB]     Using Filter Radius 35.423 (8.476A) | Previous: 15.012 (20.000A)
> > 
> > [CPU: 11.66 GB]  Local decomposition...
> > 
> > [CPU: 16.14 GB]  Traceback (most recent call last):
> >   File "/home/emuser/emtools/cryosparc2/cryosparc2_worker/cryosparc_compute/jobs/runcommon.py", line 1790, in run_with_except_hook
> >     run_old(*args, **kw)
> >   File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
> >   File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
> >   File "cryosparc_worker/cryosparc_compute/jobs/local_resolution/run.py", line 773, in cryosparc_compute.jobs.local_resolution.run.standalone_locres.work
> >   File "/home/emuser/emtools/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/gpuarray.py", line 1066, in zeros
> >     result = GPUArray(shape, dtype, allocator, order=order)
> >   File "/home/emuser/emtools/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/gpuarray.py", line 210, in __init__
> >     self.gpudata = self.allocator(self.size * self.dtype.itemsize)
> > pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
> > 
> > 
> > [CPU: 16.14 GB]  Traceback (most recent call last):
> >   File "/home/emuser/emtools/cryosparc2/cryosparc2_worker/cryosparc_compute/jobs/runcommon.py", line 1790, in run_with_except_hook
> >     run_old(*args, **kw)
> >   File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
> >   File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
> >   File "cryosparc_worker/cryosparc_compute/jobs/local_resolution/run.py", line 788, in cryosparc_compute.jobs.local_resolution.run.standalone_locres.work
> >   File "/home/emuser/emtools/cryosparc2/cryosparc2_worker/cryosparc_compute/skcuda_internal/fft.py", line 134, in __init__
> >     onembed, ostride, odist, self.fft_type, self.batch)
> >   File "/home/emuser/emtools/cryosparc2/cryosparc2_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
> >     cufftCheckStatus(status)
> >   File "/home/emuser/emtools/cryosparc2/cryosparc2_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
> >     raise e
> > cryosparc_compute.skcuda_internal.cufft.cufftInternalError

Hi @apunjani,

I’ve been getting the same error in some heterogeneous refinement jobs. My particles have a box size of 600, pixel size 0.825 and all of the parameters are default.

[CPU: 1.52 GB] Done iteration 51 in 26.323s. Total time so far 1067.666s

[CPU: 1.52 GB] – Iteration 52

[CPU: 1.52 GB] Batch size 2000

[CPU: 1.52 GB] Using Alignment Radius 44.983 (11.004A)

[CPU: 1.52 GB] Using Reconstruction Radius 63.000 (7.857A)

[CPU: 1.52 GB] Number of BnB iterations 4

[CPU: 1.52 GB] DEV 0 THR 1 NUM 500 TOTAL 11.655135 ELAPSED 11.935647 –

[CPU: 1.52 GB] Processed 1000.000 images with 2 models in 12.506s.

[CPU: 1.52 GB] Engine Started.

[CPU: 2.87 GB] Traceback (most recent call last):
File “/nfs/hms/Kahne/dk-cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1811, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1028, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 107, in cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
File “cryosparc_worker/cryosparc_compute/engine/gfourier.py”, line 32, in cryosparc_compute.engine.gfourier.fft2_on_gpu_inplace
File “/nfs/hms/Kahne/dk-cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py”, line 134, in init
onembed, ostride, odist, self.fft_type, self.batch)
File “/nfs/hms/Kahne/dk-cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py”, line 749, in cufftMakePlanMany
cufftCheckStatus(status)
File “/nfs/hms/Kahne/dk-cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py”, line 124, in cufftCheckStatus
raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError

@kpahil Please let us know:

  • cryoSPARC version
  • the kernel version: uname -a
  • version of the CUDA toolkit: grep CUDA /nfs/hms/Kahne/dk-cryosparc/cryosparc_worker/config.sh
  • pycuda version
eval $(/nfs/hms/Kahne/dk-cryosparc/cryosparc_worker/bin/cryosparcw env)
python -c "import pycuda.driver; print(pycuda.driver.get_version())"
  • GPU model and driver version (from nvidia-smi)

cryosparc version: v3.3.1+211214
Kernel version: 3.10.0-1160.45.1.el7.x86_64 #1 SMP x86_64 GNU/Linux
That command for getting the CUDA version just showed that my cuda path is /usr/local/cuda. nvidia-smi says the CUDA version is 11.2
Pycuda version: (11, 2, 0)
Driver Version: 460.84
GPU: RTX2080Ti

Thank you!

Thank for the info; this configuration looks reasonable and does not immediately suggest a cause for the error.
I wonder if, even though heterogeneous refinement may downsample “on-the-fly”, the error persists if you explicitly downsampled (to a box size of 300, perhaps) your particles before input to this job.
Also

  • Does the file <project_dir>/<job_dir>/job.json list additional errors in the "errors_run" section?
  • Does the output of /usr/local/cuda/bin/nvcc --version confirm release 11.2?
  • How many classes were specified for the failing refinement jobs?

I wonder if, even though heterogeneous refinement may downsample “on-the-fly”, the error persists if you explicitly downsampled (to a box size of 300, perhaps) your particles before input to this job.

That’s what I ended up doing and it did work. It’s not entirely ideal since I started hitting Nyquist limits

  • Does the file <project_dir>/<job_dir>/job.json list additional errors in the "errors_run" section?

It just says false (?!?). The job.log contains the following:

========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat

After a fair number of heartbeats this is how it ends:

**custom thread exception hook caught something
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

  • Does the output of /usr/local/cuda/bin/nvcc --version confirm release 11.2?

Yes

How many classes were specified for the failing refinement jobs?

2

Thank you!!