Dear all,
I’m running a consensus NU refinement job on a large dataset of 960k particles, 560 px box at 0.86 Å/px and 0.85 Å/px. I have Cryosparc v3.3.1+220315, running on a cluster with NVidia V100 (32GB) GPUs. I have collected two datasets of identical grids on two very similar Krios setups with slight changes in nominal pixel sizes (<2% difference). I forced these to the same pixel size for processing, and now refining them together with anisotropic magnification correction ON to compensate for the difference in pixel size. Even without this, we’re looking at sub-3Å resolution by FSC.
When running the final NU-refinement the run fails around iteration 11 with the following error message:
[CPU: 62.01 GB] Traceback (most recent call last):
File "/projappl/project_2004941/usrappl/kumpula/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1837, in run_with_except_hook
run_old(*args, **kw)
File "/projappl/project_2004941/usrappl/kumpula/cryoSPARC/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "cryosparc_worker/cryosparc_compute/engine/newengine.py", line 2443, in cryosparc_compute.engine.newengine.process.work
File "cryosparc_worker/cryosparc_compute/engine/newengine.py", line 2467, in cryosparc_compute.engine.newengine.process.work
File "cryosparc_worker/cryosparc_compute/engine/newengine.py", line 543, in cryosparc_compute.engine.newengine.EngineThread.preprocess_image_data
File "cryosparc_worker/cryosparc_compute/engine/newgfourier.py", line 121, in cryosparc_compute.engine.newgfourier.rfft2_on_gpu_inplace
File "/projappl/project_2004941/usrappl/kumpula/cryoSPARC/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 353, in cufftExecR2C
cufftCheckStatus(status)
File "/projappl/project_2004941/usrappl/kumpula/cryoSPARC/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
raise e
cryosparc_compute.skcuda_internal.cufft.cufftInvalidPlan
On rerun with the same parameters, I get the same error earlier, so it seems somewhat sporadic.
I thought that even though it is not directly stated, this could be a GPU memory issue, so tried to force the GPU batch size to 300, but this caused the run to fail earlier with the same error message. When I run without magnification correction, job runs without issues.
All help appreciated!