cufftInternalError during 2d classification

Hi, I have a couple of more questions about 2D classification jobs. I am running 4.0.1 version as well, have 4 RTX 2080 TI GPUs, used two of them for the job. Got an error message:
[CPU: 15.73 GB]

Traceback (most recent call last):
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1925, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1028, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 107, in cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
File “cryosparc_worker/cryosparc_compute/engine/gfourier.py”, line 32, in cryosparc_compute.engine.gfourier.fft2_on_gpu_inplace
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py”, line 134, in init onembed, ostride, odist, self.fft_type, self.batch) File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py”, line 749, in cufftMakePlanMany cufftCheckStatus(status)
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py”, line 124, in cufftCheckStatus raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError.
That never happened before the upgrade.
Also, and I have seen that before, sometimes classes display fails to set-up histogram properly, in some cases (attached) it is all black, or could be all white without any grey levels.


At the same time the side display window shows reasonable image:
cs4.0.1_cl2d_side
. During that particular run, the main display window changed from normal grey level display to all white, then to all black, back and forth.
Like in @MichaelZ case, it is a dedicated headless cryosparc station.
Thanks,
Michael

Please can you send us the job report for the job that failed with cryosparc_compute.skcuda_internal.cufft.cufftInternalError.

Are these class averages for the job that failed with

?

Please can you run Check For Corrupt Particles on your input particles with Check for NaN values enabled


and, if no corruption is found, re-try 2D classification with Cache particle images on SSD disabled.
Has, by any chance, 2d classification of this dataset succeeded in an earlier version of CryoSPARC?

Hi @wtempel,
Checked for corrupted particles, there were none. Rerun the job without caching on SSD, failed again.
This is fairly new dataset, did not run it in earlier version of CS. But I was able to run it on a different lane in CS 4.0.1 with different GPU (RTX A6000 instead of RTX 2080TI). One more peculiarity, the lane on which the jobs fails is listed as having M6000 with 24Gb VRAM while in reality it has 2080TIs. workspace display insists on M6000 though. Sometimes job directories are listed incorrectly as well. All of that is probably unrelated to the issue.
Michael

Have there been any hardware swaps, hostname or dhcp server registration changes on this worker since CryoSPARC has first been installed?

Please can you provide additional details? Is there any chance that either

  • more than one CryoSPARC master installations are running on the same server or
  • the project directory in question is being used/controlled by more than one CryoSPARC instance?

No to both, there is only one instance is running and there is only one installation. But right now that particular job is listed correctly in terms of working directory, but the node information is incorrect. at times I’ve seen both, correct and incorrect info for jobs and nodes with the current CS instance.
Michael

Still unsure what exactly is going on, I suggest, assuming the scheduler targets are hosts, not clusters:

  1. Preserve the output of cryosparcm cli "get_scheduler_targets()" for reference later
  2. For each configured worker workerhostname:
    1. cryosparcm cli "remove_scheduler_target_node('workerhostname')" (guide)
    2. ensure cryosparc_worker software was installed using using an installation/version of the CUDA toolkit that both is still available at $CRYOSPARC_CUDA_PATH and supports the CUDA devices. (I am not sure which version first supported the A6000 cards.). This needs be ensured even if the cryosparc_worker directory is shared between multiple worker hosts.
    3. cryosparcw connect [...] (referring to the preserved get_scheduler_targets() output for possible customization/options, guide)

Thanks, will ask my sysadmin to do that, I do not have access to cli, , cryosparcm, and cryosparcw. And yes, targets are hosts, not clusters.
Michael

[quote=“mbs, post:4, topic:9576”]
Hi @wtempel,
After disconnect/reconnect that particular host the hardware info is correct now, but the job still fails even without caching particles on SSD.
Michael

Just to confirm: is the job still failing with the same

"/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError

error?

Please can you test if setting
export CRYOSPARC_NO_PAGELOCK=true
inside
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/config.sh
would allow the job to complete?
(guide, CUDA memory error during 2D classification - #9 by spunjani)

Hi @wtempel, CRYOSPARC_NO_PAGELOCK=true is already in the config.
Michael

Thanks for confirming, Michael.
Other things you may want to try:

  1. confirm that the worker’s CUDA toolkit and CryoSPARC are still “in sync”:
  1. Clone the job from above and run on a single GPU.
  2. If that job also fails (with the same error message?), clone the job from the previous step, but input down-sampled particles instead.

Hello, I had the same issue when using a large box size (1000px). Re-extracting the particles with a smaller box size (800px) and setting Fourier crop to box size 400pix solved the issue in the 2D classification.