Hi All, I got the following error after iteration 0 during Heterogeneous Refinement. Using CryoSPARC v2.13.2 and cuda/9.2.148 on CentOS Linux release 7.6.1810. Anyone can help? Thanks.
[CPU: 10.83 GB] Done in 21.410s.
[CPU: 10.83 GB] Outputting files…
[CPU: 10.79 GB] Done in 7.830s.
[CPU: 10.79 GB] Done iteration 0 in 203.650s. Total time so far 203.650s
[CPU: 10.79 GB] – Iteration 1
[CPU: 10.79 GB] Batch size 6000
[CPU: 10.79 GB] Using Alignment Radius 23.670 (27.363A)
[CPU: 10.79 GB] Using Reconstruction Radius 35.504 (18.242A)
[CPU: 10.79 GB] Randomizing assignments for identical classes…
[CPU: 10.79 GB] Number of BnB iterations 3
[CPU: 10.79 GB] Engine Started.
[CPU: 13.97 GB] Traceback (most recent call last):
File “cryosparc2_compute/jobs/runcommon.py”, line 1547, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 1053, in cryosparc2_compute.engine.engine.process.work
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 308, in cryosparc2_compute.engine.engine.EngineThread.compute_resid_pow
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 293, in cryosparc2_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
LogicError: cuMemHostAlloc failed: invalid argument
What was the box size, pixel size and number of images in the dataset? What GPU model was this running on? The error message generically indicates an out-of-memory condition.
This is strange because the box size and number of partilces is not very large. How much CPU RAM do you have on this node? Is this still an issue or did you find a way to get things working?
@donghuachen sorry we haven’t been able to help much with this so far. One question: when you created the heterogeneous refinement job, did you connect multiple of the same input reference volume, or multiple different reference volumes (or a combination)?
And can’t quite piece together what the problem is…we are getting this error when running on our RTX 2080s on a local workstation. It happens when I use 1 or 4 GPUs, on several different job types, and sometimes if I just re-run the job it works without a problem.
[CPU: 966.1 MB] Traceback (most recent call last): File “cryosparc2_compute/jobs/runcommon.py”, line 1685, in run_with_except_hook run_old(*args, **kw) File “/home/hiter/software/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py”, line 754, in run self.__target(*self.__args, **self.__kwargs) File “cryosparc2_compute/jobs/pipeline.py”, line 165, in thread_work work = processor.process(item) File “cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py”, line 157, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process File “cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py”, line 160, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process File “cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py”, line 161, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process File “cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py”, line 77, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction File “cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py”, line 387, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 312, in cryosparc2_compute.engine.cuda_core.EngineBaseThread.ensure_allocated File “/home/hiter/software/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/gpuarray.py”, line 210, in init self.gpudata = self.allocator(self.size * self.dtype.itemsize) MemoryError: cuMemAlloc failed: out of memory