2d classification fails with error cryosparc_compute.skcuda_internal.cufft.cufftInternalError

hrutledg · February 18, 2023, 10:08am

I can run small 2d classification jobs fine. If I try running one with 10,000 particles it fails. If I split the 10,000 particles into 10 stacks of 1000, each stack runs on 2d classification fine. I ran the check particles job, and it looks like there aren’t any corrupted particles.

Here’s the traceback:

Traceback (most recent call last):
  File "/home/dobby/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2061, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_master/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1028, in cryosparc_compute.engine.engine.process.work
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 107, in cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
  File "cryosparc_master/cryosparc_compute/engine/gfourier.py", line 32, in cryosparc_compute.engine.gfourier.fft2_on_gpu_inplace
  File "/home/dobby/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 132, in __init__
    self.worksize = cufft.cufftMakePlanMany(
  File "/home/dobby/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/dobby/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
    raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError

I’m using v4.1.2

hrutledg · February 18, 2023, 10:11am

here’s an image of the tree where I was troubleshooting

hrutledg · February 20, 2023, 12:43pm

so we figured out that the issue was that another (non-cryosparc) process hadn’t let go of the GPU memory and it was just a GPU memory problem.

All is good now.