GPU Memory Errors on system with 2080Tis but not system with 1080Tis

yoshiokc · July 11, 2020, 12:20am

I updated 2.16beta a little bit ago and noticed today that patch motion correction jobs on one of our GPU servers 8x2080Tis were failing, but where fine on the other system with 8x1080Tis. Some quick troubleshooting didn’t turn up the usual quick fixes, but I did notice that the 2080Tis have about 128MB less VRAM than the 1080Tis and was wondering wether the recent update is now more aggressive about taking GPU mem and maybe is working from the 1080Tis limit rather than the 2080Tis?

yoshiokc · July 11, 2020, 5:52am

so it’s not from being in the same lane, but still haven’t figured out what has changed.

hsnyder · July 13, 2020, 3:42pm

Hi @yoshiokc,

Interesting, thanks for reporting this. The new version should not use more VRAM, if anything it should use less… we’ve introduced a switch for a beta version of the algorithm that uses less GPU memory. Just to clarify, does a job that formerly worked on 2080Tis now fail (with the same input data and parameters as before)?

– Harris

yoshiokc · July 13, 2020, 4:19pm

Hi Harris,

Yes, they used to work on the 2080Tis, but not sure when they stopped, the CS update could have been coincidence- the software upgrade was also recently accompanied by kernel, driver and CUDA updates. I just rebuilt pycuda against 10.0 instead of 10.2 and it seems better, but still not completely fixed. It’s only motion correction that seems unhappy, most other jobs I’ve tested seem to work fine. I’ll try to get you more actionable debug information.

This is version: v2.16.1-live_deeppick_privatebeta
There is no option for ‘low memory’ in the patch motion job any longer, was this switched to be automatic?

Is there a way to get the CUDA version information being used by the worker? (version and paths). I’ve been trying to rebuild the worker against different versions using bin/cryosparcw newcuda /new/path but I’m not sure how to confirm this worked.

I tried rolling back the kernel and Nvidia driver (440.x, 450.x), CUDA (10, 10.1, 10.2, 11.0), pycuda (2019.1, 2019.1.2) and cryoSPARC (2.15.2_beta, 2.16.1_beta). The only combo that worked on the 2080Tis was 2.15.2_beta and CUDA 10.0. This combination worked with multiple different versions of the Nvidia driver and pycuda.

[CPU: 1.11 GB]   Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1685, in run_with_except_hook
    run_old(*args, **kw)
  File "/pncc/storage/1/cryosparc/sw/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 165, in thread_work
    work = processor.process(item)
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 157, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 160, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 161, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 77, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 446, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 197, in cryosparc2_compute.engine.cuda_core.transfer_ndarray_to_cudaarray
MemoryError: cuArrayCreate failed: out of memory

hsnyder · July 14, 2020, 5:46pm

Hi @yoshiokc,

The switch for low memory use should still be present - we still consider it a beta feature. We’ve looked into this and the feature was in fact missing from the private beta releases. This was an issue on our end, thanks very much for bringing it to our attention. We have released a patch for v2.16.1-live_deeppick_privatebeta and v2.16.0-deeppick_privatebeta which can be downloaded as per the instructions here: https://guide.cryosparc.com/setup-configuration-and-management/software-updates#apply-patches

Harris

yoshiokc · July 14, 2020, 6:37pm

Thanks!, I will test at the next opportunity. Is the low-mem code-path on by default? Even if it works, we use Live 95% of the time, so the workers would have to have it enabled.

hsnyder · July 14, 2020, 6:53pm

Yes, the low-memory code path is enabled by default. It should work correctly in Live.

Harris

yoshiokc · July 19, 2020, 4:56am

Just got the chance to re-update and apply the patch. Can confirm it is working.