cryosparc_compute.skcuda_internal.cufft.cufftInternalError

Hi all:
when running extract_micrographs_multi and 2D classification job, I got the following traceback:
Traceback (most recent call last):
File “/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/jobs/runcommon.py”, line 1860, in run_with_except_hook
run_old(*args, **kw)
File “/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/threading.py”, line 870, in run
self._target(*self._args, **self._kwargs)
File “/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/jobs/pipeline.py”, line 86, in stage_target
work = processor.exec(item)
File “/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/jobs/pipeline.py”, line 43, in exec
return self.process(item)
File “/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/jobs/extract/run.py”, line 470, in process
update_alignments3D=update_alignments3D)
File “/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/jobs/extract/extraction_gpu.py”, line 141, in do_extract_particles_single_mic_gpu
stream=stream)
File “/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/skcuda_internal/fft.py”, line 134, in init
onembed, ostride, odist, self.fft_type, self.batch)
File “/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/skcuda_internal/cufft.py”, line 749, in cufftMakePlanMany
cufftCheckStatus(status)
File “/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/skcuda_internal/cufft.py”, line 124, in cufftCheckStatus
raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError

Current cryoSPARC version: v3.3.1
on NVIDIA A100
Using NVIDIA driver 460 + CUDA 11.2
Linux version 3.10.0-862.el7.x86_64 (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) )

I can provide more info if necessary.

Thanks for your help!
Best,

Kuangyi

Welcome to the forum @Kuangyi .
Please can you post for the failed extraction job:

  • job parameters
  • job log (cryosparcm joblog ..., guide)

Hi, the parameters are:
extract_micrographs_multi job:
Extraction box size (pix): 256
Fourier crop to box size (pix) : 128
Number of GPUs to parallelize : 4

JOB LOG:
extract.run cryosparc_compute.jobs.jobregister


HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
**custom thread exception hook caught something
**** handle exception rc
set status to failed
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/micrographs.py:400: RuntimeWarning: divide by zero encountered in true_divide
return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/micrographs.py:400: RuntimeWarning: divide by zero encountered in true_divide
return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/micrographs.py:400: RuntimeWarning: divide by zero encountered in true_divide
return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/micrographs.py:400: RuntimeWarning: divide by zero encountered in true_divide
return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/micrographs.py:400: RuntimeWarning: invalid value encountered in true_divide
return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/micrographs.py:400: RuntimeWarning: divide by zero encountered in true_divide
return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
/cache_ssd/opt/cryo_sparc/cryosparc_worker_gpu03/cryosparc_compute/micrographs.py:400: RuntimeWarning: divide by zero encountered in true_divide
return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)

I changed numbers of gpus from 4 to 2, the job maybe succeed

Did the job succeed with two GPUs?