Memory issues with Reference Based Motion Correction?

larsonmattr · November 30, 2023, 8:58pm

Hi, we’ve been trying to run the RBMC job with a set of EER micrographs ~ 700 movies (60 frames, 8192x8192 at 0.49A pixel size).

There have been many memory issues that we have hit along the way and have been able to partially succeed at this job but cannot complete the last stage of motion correction. We have broken this into 3 jobs for the (1) optimize hyperparameters, (2) dose weighting, and then (3) motion correction steps, passing the outputs to subsequent jobs as we try to debug this and solve the memory limits.

We’ve tried different box sizes and GPUs, but continue to have issues that prevent us from completing all 3 steps of this RBMC:

(1) hyperparameters: We enabled 16-bit floating point, we increased the GPU oversubscription to 200 GB (>> than memory on cards to avoid oversubscription of GPU), and set In-memory cache size at 0.8 (80% of 768GB main ram available to use). This was successful on system with RTX 6000 24GB memory and ran for about 2h 6m.

(2) compute empirical dose weighting: We kept 16-bit floating point, set as 500 target particles in dose weighting. We have oversubscription at 200 GB and In-memory cache at 0.8. This job would fail on the systems with RTX6000 24GB memory; same could succeed on A6000 GPUs with 48GB memory - took 3m 39s total to run on successful attempt. There appears to be something here very high usage on GPU memory.

(3) motion-correct particles: Still with 16-bit floating point, 500 particles (box sizes of 216 or 360px), and setting oversubscription at 200 GB and In-memory cache at 0.8 (80%).

This last step fails on either 24GB or 48GB GPU memory cards with example output from job.log.err:

DIE: [refmotion worker 1 (Quadro RTX 6000)] ERROR: cuMemAlloc(size=26983568640): CUDA ERROR: (CUDA_ERROR_OUT_OF_MEMORY) out of memory

On the A6000 48GB memory it could succeed for several movies but then fail at this stage:

File “cryosparc_master/cryosparc_compute/run.py”, line 95, in cryosparc_master.cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_reference_motion.py”, line 495, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_reference_motion.run_reference_motion_correction
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py”, line 1220, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py”, line 1235, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py”, line 620, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.slice_vol
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 417, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.load_models_rspace
File “cryosparc_master/cryosparc_compute/engine/newgfourier.py”, line 156, in cryosparc_master.cryosparc_compute.engine.newgfourier.rfft3_on_gpu_inplace
File “cryosparc_master/cryosparc_compute/engine/newgfourier.py”, line 72, in cryosparc_master.cryosparc_compute.engine.newgfourier.get_plan_R2C_3D
File “/opt/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py”, line 112, in init
self.handle = gpufft.gpufft_get_plan(
RuntimeError: cuda failure (driver API): cuMemAlloc(&plan_cache.plans[idx].workspace, plan_cache.plans[idx].worksz)
→ CUDA_ERROR_OUT_OF_MEMORY out of memory

What are the factors that drive the GPU memory use for the motion correction stage, and are there any steps or changes we could make that could help keep within GPU memory limits? Please let us know if there are factors we could consider to try to succeed through the last motion-correction stage or by making changes upstream in earlier jobs.

Thanks,
Matt

rbs_sci · December 1, 2023, 5:06am

RBMC is shockingly system heavy.

I’ve got one dataset which randomly crashes during the first iteration of hyperparameter optimisation, regardless of how few or many particles I ask it to optimise (have’t tried <1,000, yet) on 16GB GPUs. Others crash on 24GB GPUs, and three datasets fail even on A6000s.

I’m beginning to think we might need a CPU fallback… 48GB GPUs aren’t exactly common, even now, and I know several collaborators who have been happily using CryoSPARC to do their own processing with 24GB GPUs, but 48GB ones (plus system RAM requirements) are well out of their budget.

edit: Just tried 1,000 particles on 16GB GPUs. Not crashed yet, up to iteration 6. Have some concerns regarding how good hyperparameter estimation will be with so few particles, however.

hsnyder · December 12, 2023, 9:45pm

Hi @larsonmattr and @rbs_sci. Sorry to hear you’re having VRAM problems with RBMC. We can try to address this at some point in the future but for now perhaps I can offer some ideas about reducing RBMC’s VRAM demands.

Very roughly, the baseline VRAM usage is proportional to:

Number of particles per movie
Number of frames per movie
Number of Fourier components in the alignment band. This warrants some explanation. The trajectory alignment is done in Fourier space, but not using all of the Fourier components (so the memory demand generally isn’t as bad as (box size)^2). In the hyperparameter estimation step, half of the Fourier components between 20A and the FSC cutoff frequency are used by default. In the two later steps, all of them between 20A and FSC frequency are used (which is why you might run out of VRAM on the dose weight stage even if hyperparam optimization succeeds).
Naturally, oversubscription doubles the required VRAM per GPU.

Perhaps counterintuitively, the number of particles used for dose weight computation and hyperparameter search don’t affect VRAM demands. 16-bit float also has no effect as all actual math is still done in float32. And the memory used for caching also affects only CPU RAM, not VRAM.

Some things that could be done to reduce these factors:

Turn off oversubscription
If you’re processing a heterogeneous dataset and you have a lot of particles per movie, you can reduce the effective number of particles per movie by processing each species in a separate reference motion job.
You can lower the “Fraction of FCs used for alignment” parameter to reduce the VRAM requirement of hyperparameter estimation.
You can turn off “Use all Fourier components” in the dose weight and final reconstruction stage parameters. This will cause those stages to use the same subset of the Fourier components that is used in hyperparameter estimation.

There’s also some VRAM set aside on one of the GPUs for slicing the volumes, and that’s proportional to the box size cubed. If your box has an excessive and unneeded amount of empty padding around it, cropping it may help with this.