Hi, we’ve been trying to run the RBMC job with a set of EER micrographs ~ 700 movies (60 frames, 8192x8192 at 0.49A pixel size).
There have been many memory issues that we have hit along the way and have been able to partially succeed at this job but cannot complete the last stage of motion correction. We have broken this into 3 jobs for the (1) optimize hyperparameters, (2) dose weighting, and then (3) motion correction steps, passing the outputs to subsequent jobs as we try to debug this and solve the memory limits.
We’ve tried different box sizes and GPUs, but continue to have issues that prevent us from completing all 3 steps of this RBMC:
(1) hyperparameters: We enabled 16-bit floating point, we increased the GPU oversubscription to 200 GB (>> than memory on cards to avoid oversubscription of GPU), and set In-memory cache size at 0.8 (80% of 768GB main ram available to use). This was successful on system with RTX 6000 24GB memory and ran for about 2h 6m.
(2) compute empirical dose weighting: We kept 16-bit floating point, set as 500 target particles in dose weighting. We have oversubscription at 200 GB and In-memory cache at 0.8. This job would fail on the systems with RTX6000 24GB memory; same could succeed on A6000 GPUs with 48GB memory - took 3m 39s total to run on successful attempt. There appears to be something here very high usage on GPU memory.
(3) motion-correct particles: Still with 16-bit floating point, 500 particles (box sizes of 216 or 360px), and setting oversubscription at 200 GB and In-memory cache at 0.8 (80%).
This last step fails on either 24GB or 48GB GPU memory cards with example output from job.log.err:
DIE: [refmotion worker 1 (Quadro RTX 6000)] ERROR: cuMemAlloc(size=26983568640): CUDA ERROR: (CUDA_ERROR_OUT_OF_MEMORY) out of memory
On the A6000 48GB memory it could succeed for several movies but then fail at this stage:
File “cryosparc_master/cryosparc_compute/run.py”, line 95, in cryosparc_master.cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_reference_motion.py”, line 495, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_reference_motion.run_reference_motion_correction
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py”, line 1220, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py”, line 1235, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py”, line 620, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.slice_vol
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 417, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.load_models_rspace
File “cryosparc_master/cryosparc_compute/engine/newgfourier.py”, line 156, in cryosparc_master.cryosparc_compute.engine.newgfourier.rfft3_on_gpu_inplace
File “cryosparc_master/cryosparc_compute/engine/newgfourier.py”, line 72, in cryosparc_master.cryosparc_compute.engine.newgfourier.get_plan_R2C_3D
File “/opt/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py”, line 112, in init
self.handle = gpufft.gpufft_get_plan(
RuntimeError: cuda failure (driver API): cuMemAlloc(&plan_cache.plans[idx].workspace, plan_cache.plans[idx].worksz)
→ CUDA_ERROR_OUT_OF_MEMORY out of memory
What are the factors that drive the GPU memory use for the motion correction stage, and are there any steps or changes we could make that could help keep within GPU memory limits? Please let us know if there are factors we could consider to try to succeed through the last motion-correction stage or by making changes upstream in earlier jobs.
Thanks,
Matt