I’ve found CS4.4 to be a lot more demanding, hardware wise.
The new optimisations in 4.4 have changed the hardware targets dramatically - NU Refine with the new codepath will crash on 700 pixel boxes on a 24GB GPU due to lack of VRAM (enabling low memory mode works, but is back to “classic” NU Refine speeds, possibly even slower). I used to be able to NU Refine 840 pixel boxes without issues prior to 4.4 (i.e.: with the old NU Refine codepath) on the same GPUs (approx same number of particles, too: ~110,000).
Also, RBMC is extremely VRAM hungry; 512 pixel boxes crash with “out of memory” error on 16GB GPUs on micrographs with a lot of particles, even just using one so that system RAM is not exceeded…
A lot of the improvements in speed for 4.4 come at the cost of significantly higher hardware demands… it’s a little scary that 32 or even 48GB GPUs are needed for what are now fairly standard box sizes.
Would appreciate others chiming in with their experiences.
Hi @rbs_sci, “low memory mode” in NU Refine is virtually the same as the old code path. It shouldn’t be slower. But yes, part of the way we were able to speed up NU refinement was by taking advantage of more RAM.
Regarding running out of VRAM with 16 GB and a 512 box size in reference motion:
Roughly how many particles per micrograph do you have in this dataset?
Is the GPU oversubscription memory threshold is set to a number higher than 16? By default it’s 20, which would not result in oversubscribing a 16 GB GPU, but I figured it wouldn’t hurt to check…
And to chime in more (if I may), the main way the speedup was possible was by reducing the number of host-to-device (and vice versa) memory transfers that happened (PCIe bandwidth was the biggest bottleneck in NU Refine). By keeping all the maps in GPU memory, the GPU’s scheduler is able to execute kernels one after the other without having to synchronize processing streams.
Maybe when 64GB+ GPUs are common the Structura team will make it even faster
I’ll check how many particles were in the micrograph which crashed, it was one of the more densely packed ones. I was playing with EMPIAR 10928, wasn’t expecting issues.
I left GPU oversubscribe at 20GB. Good to confirm, though.
I’ve found that NU Refine will sometimes take tens of thousands of seconds for one half of a pass, while the other will complete in a few thousand (larger boxes). I’ve checked disk I/O, RAM utilisation, CPU, corrupt particles, etc, and can’t see anything out of the ordinary; the only potential connecting factor is running multiple NU Refine jobs at once (sometimes if I kill 50% of the NU refine jobs, the others will suddenly speed up and run “normally” again…?) I’ll see if I can trigger the issue again and send logs if you wish?
Looking at PCI-E and GPU utilisation during (old vs. new) NU Refine it was immediately apparent you’d optimised PCI-E bandwidth.
Might be nice if there was a fallback mode for RBMC - there is no chance to successfully “polish” one of the more extreme datasets I have (16Kx16Kx185 frames, ~1,500-2,500 particles per micrograph) with current GPUs; even with RELION it will eat ~1.6TB of RAM on a single MPI process on the “worst case” micrograph.
Oh this immediately sounds like a transparent hugepages- related problem
(@hsnyder correct me if I’m wrong) try disabling this and check if your refinement related jobs immediately speed up: echo never >/sys/kernel/mm/transparent_hugepage/enabled
Well, running 7 NU Refine jobs (on a box with 8 GPUs) with THP set to ‘never’ resulted in four of them crashing. Three are still running. No help in the job.log, but dmesg, looks like they ran out of memory (or the OOM reaper got overly enthusiastic as they should have had plenty of breathing room…)
The three running finished successfully, results look in line with what I was expecting. Have restarted the four failed jobs, let’s see.
I’ll put this in this thread as it’s still related to high demands on the system.
768 pixel boxes (not too large, IMO) successfully completed hyperparameter calculation (with Extensive mode), but will crash during trajectory correction/particle output on 24GB GPUs. Crash appears related to number of particles on the micrograph, as some early micrographs (with fewer particles) complete successfully.
Is there any solution/workaround for this beyond trying 48GB GPUs?
@rbs_sci I assume you’re talking about a RBMC job here? You could try turning off the “use all fourier components” switch for the dose weight and final reconstruct stages. That will bring down the VRAM demand for the later stages somewhat.
This might be a challenge, but you could use cryosparc tools, or perhaps some clever curation, to divide the picks into two particle stacks, ideally where each subset alternates picks along a given filament, and process them separately…
Trying something along those lines now. Seems to be working so far… when I’ve got all the outputs, I’ll reconstruct them independently, then in pairs, then all together, and see whether or not the FSC goes pathological at any point as a result.
On that note, a function in the particle sets tool to split based on a maximum number of particles in each set from each micrograph might be useful…
And I’m not being a permanent negative nancy about issues with RBMC; I really, really like the ability to add multiple reconstructions to the RBMC run, and keep the particles independent for output - I’ve got an RBMC run going now (different dataset) with six reconstructions fed into it, extensive parameterisation and hammering along on 9/10 GPUs in a server and it is tearing along impressively quickly (and hasn’t crashed yet …)
On the RBMC topic explicitly; just had the fourth of five big RBMC runs complete successfully; extensive parameterisation, 9 GPUs, 800,000 particles, 12,000 micrographs, 440 pixel boxes from 4K micrographs on 16GB A4000s. Was approximately six times faster (two days and a bit, not two weeks) than a single GPU - I suspect the imperfect scaling in GPU count is due to the data server, as it’s been getting hammered the last few weeks.
8K micrographs need more careful curation to keep particles per micrograph down, and 24GB GPUs, but also run successfully into 768 pixel boxes. If feeding from 384 pixel boxes, memory load is decreased, but will still work with 768 pixel input with careful curation.
Now I know some tricks to optimise, I’ll go try the 512 pixel box K3 TIFF data I’ve been experimenting with, see if I can get it working on 16GB cards because so far it always crashes…