cuMemHostAlloc errors

donaldb · December 17, 2017, 9:52pm

I have been having some errors which all end in “cuMemHostAlloc failed: out of memory” in both Ab-Initio and Homogeneous refinement jobs.

I don’t think I am running out of GPU memory. I am running a 1080Ti GPU with CUDA 8. My dataset is in a 512 pixel box. At the point of crashing the memory usage of the GPU seems to be peaking at ~8GB, far below the 11GB capacity. I get these errors also when using the same dataset with 2x binned particles. I am running v0.6.3.

These errors are a little sporadic and I can get jobs to run to completion sometimes by simply cloning and restarting.

The error for Ab initio is:

Traceback (most recent call last):
  File "/raid/cryosparc/cryosparc-compute/sparc/streamlog.py", line 438, in run_with_except_hook
run_old(*args, **kw)
  File "/raid/cryosparc/cryosparc-compute/engine/cuda_core.py", line 98, in run
self.target(*self.args, dev=self.dev, thidx=self.thidx)
  File "/raid/cryosparc/cryosparc-compute/engine/engine.py", line 860, in work
rs, ts = ET.cull_candidates(r_factor, t_factor)
  File "/raid/cryosparc/cryosparc-compute/engine/engine.py", line 433, in cull_candidates
self.ensure_allocated('error_r', (self.N_D, self.N_K, self.N_RS_aligned), n.float32)
  File "/raid/cryosparc/cryosparc-compute/engine/engine.py", line 46, in ensure_allocated
new = cuda_core.allocate_cpu(shape, dtype, curr)
  File "/raid/cryosparc/cryosparc-compute/engine/cuda_core.py", line 106, in allocate_cpu
ret = cudrv.pagelocked_empty(shape, dtype=dtype)
MemoryError: cuMemHostAlloc failed: out of memory

and for Homogeneous refinement:

Traceback (most recent call last):
  File "/raid/cryosparc/cryosparc-compute/sparc/streamlog.py", line 438, in run_with_except_hook
    run_old(*args, **kw)
  File "/raid/cryosparc/cryosparc-compute/engine/cuda_core.py", line 98, in run
    self.target(*self.args, dev=self.dev, thidx=self.thidx)
  File "/raid/cryosparc/cryosparc-compute/engine/engine.py", line 855, in work
    if do_align: ET.setup_current_poses_and_shifts(rs, dr, ts, dt, shared=(bnb_iter == 0)) # rs and ts are shared in first iter
  File "/raid/cryosparc/cryosparc-compute/engine/engine.py", line 234, in setup_current_poses_and_shifts
    self.ensure_allocated('Rs', (self.N_D, self.N_K, self.N_RS_aligned, 6), n.float32)
  File "/raid/cryosparc/cryosparc-compute/engine/engine.py", line 46, in ensure_allocated
    new = cuda_core.allocate_cpu(shape, dtype, curr)
  File "/raid/cryosparc/cryosparc-compute/engine/cuda_core.py", line 106, in allocate_cpu
    ret = cudrv.pagelocked_empty(shape, dtype=dtype)
MemoryError: cuMemHostAlloc failed: out of memory

Do you have any suggestions to get these jobs to complete?

Thanks,

Donald.

DanielAsarnow · December 25, 2017, 10:39pm

I also get these errors sporadically in 2D classification (where < 3.5 GB of the 1080Ti is used).

Maybe during heavy background I/O, but I’m not sure if there’s any other common elements. Notably perhaps there are some other processes using a few hundred MB of GPU memory - but these aren’t changed throughout a number of sporadic errors and eventually a successful run.

I do not have the recommended cryosparc crontab entries - @donaldb I suggest you see if those help you.

marino-j · May 16, 2018, 12:04pm

Hi! I am experiencing the same kind of error, it complains that can’t allocate CPU. That’s funny because I had cryosparc initially running well even with a bigger dataset, and now I consistently get this error. I am wondering whether I should re-install cryosparc on the same machine.

Any suggestion?

Thank you for your help
Jacopo

spunjani · October 11, 2018, 7:48pm