Continue upon failure

jparmache · June 10, 2017, 4:20am

Hi,

I noticed that occasionally some jobs will fail, even though the number of particles+box sizes is very small. In that situation I get
:
Traceback (most recent call last):
File “/mnt/cache/cryosparc/cryosparc/cryosparc-compute/sparc/streamlog.py”, line 321, in run_with_except_hook
run_old(*args, **kw)
File “/mnt/cache/cryosparc/cryosparc/cryosparc-compute/engine/cuda_core.py”, line 86, in run
self.target(*self.args, dev=self.dev, thidx=self.thidx)
File “/mnt/cache/cryosparc/cryosparc/cryosparc-compute/engine/engine.py”, line 626, in work
ET.compute_resid_pow() # do this even if not do_align because we have to compare the different structures
File “/mnt/cache/cryosparc/cryosparc/cryosparc-compute/engine/engine.py”, line 251, in compute_resid_pow
self.ensure_allocated(‘resid_pow’, (self.N_D, self.N_K, self.N_RS_aligned, self.N_SS), n.float32)
File “/mnt/cache/cryosparc/cryosparc/cryosparc-compute/engine/engine.py”, line 52, in ensure_allocated
new = cuda_core.allocate_gpu(shape, dtype, curr)
File “/mnt/cache/cryosparc/cryosparc/cryosparc-compute/engine/cuda_core.py”, line 109, in allocate_gpu
ret = gpuarray.empty(shape, dtype=dtype)
File “/mnt/data/cryosparc/cryosparc/cryosparc/anaconda2/lib/python2.7/site-packages/pycuda/gpuarray.py”, line 209, in init
self.gpudata = self.allocator(self.size * self.dtype.itemsize)
MemoryError: cuMemAlloc failed: out of memory

I’ve experienced it a few times, and if repeated with the same data, the same seed etc, it often runs through when tried again. So, in this case, instead of running it again from the scratch, would it make sense to have a “continue from last saved state” ?