cryoSPARC benefits from more GPU Vram or processing power?

lecongmi001 · December 3, 2024, 4:48am

Dear cryoSPARC community,

We are looking to procure a new workstation. Since the GPU prices are so high, we are looking to trade off either one or the other spec. Our current workstation has the following

4x RTX 4070Ti 12Gb
AMD Ryzen Threadripper Pro 5975WX 32 cores
512gb DDR4 3200 MHz RAM

With this workstation, we noticed that for bigger boes (700 pix), the refinement becomes very slow, probably due to a lack of Vram. Therefore, I am thinking of getting an RTX 3090 instead, to increase the Vram capacity to 24gb without breaking the buck.
Therefore, my question is whether cryoSPARC jobs (particularly NU-Refinement) benefit from more GPU Vram or higher processing power?

Thank you and kind regards,
Khoa.

rbs_sci · December 3, 2024, 6:01am

If you have 24GB cards, you should be good. While the Ada Lovelace generation cards are (on paper) significantly faster than their Ampere generation predecessors, in practice due to other factors (disk I/O, mostly) they’re not dramatically faster in most practical cryo-EM image processing scenarios. If you can find 3090’s at reasonable prices (they’re EoL now, I believe) then it’s probably a better idea than 12GB or 16GB 4000 series cards. Also power budget; the Ada Lovelace consumer cards are pushed to the edge, and setting some power limits will dramatically reduce their power consuption and heat output without crippling their performance (25-40% reduction in power draw for 5-10% drop in performance if I remember the benchmarks I saw correctly…)

Since Blackwell is supposed to come H1 next year, I would normally counsel waiting, but the rumours I’ve seen about prices of the Blackwell cards are that they’re going to be even more ridiculously expensive than the Ada Lovelace generation. Take that for what it’s worth, given that it is rumour, but it’s a reasonable extrapolation given what nVidia focussed on during GTC and their investor calls (improvements in AI inference speed, etc).

A few semi-related thoughts:

PyFFTW imposes a limit on box size in CryoSPARC regardless of GPU VRAM (1120 pixel boxes are the largest I’ve run successfully, even on 48GB GPUs, while 1480 is possible in RELION on 48GB GPUs with Fourier padding disabled)
Even on 48GB GPUs, 700 pixel boxes crash in NU refine (for me) unless “Low memory mode” is enabled
12GB/16GB cards are attractive, but many of the more demanding features of modern CryoSPARC will max out and crash (3D flex, high class count and/or high resolution 3D classification, 3D variability analysis, larger box size Local Resolution Estimation) on <24GB cards

leetleyang · December 3, 2024, 11:04am

Hi,

We’ve accumulated quite few 4090 cards in our cluster. Do you happen to have a reference for this that I could pass on to our HPC team?

Cheers,
Yang

rbs_sci · December 3, 2024, 11:05am

edit: Sorry, didn’t realise Ctrl+Enter just immediately posted.

These are game focussed, but I wouldn’t be surprised if it is actually more efficient outside of gaming scenarios. I saw some others which were more HPC focussed, but I’ll need to check my bookmarks at home.

I’ve avoided 4090’s because of their insane power draw. 450W (600W in some cases for the CLC cooled cards) is absolutely ridiculous.

edit 2: The fact that increasing the power limit 20% but only yields <2% gains shows just how close to the limit the Ada Lovelace cards are.

leetleyang · December 3, 2024, 11:14am

Thank you very much. Would appreciate any HPC-related benchmarks as well if you can find them. I’m surprised our racks/PSUs haven’t melted already.

Cheers,
Yang

rbs_sci · December 3, 2024, 11:29am

It’s very possible your HPC team have already power limited them.

The Quadros are much saner than their consumer focussed cards. 300W power limit on the A6000 Ada, for example, like it’s older sibling from Ampere gen.

For me, here, the Ada cards are 60-80% more than their Ampere equivalents, so my reticence was both price and power demand. But our suppliers now tell me Ampere is EoL and getting more is next to impossible. I’m not looking forward to the quote for an octa-Ada GPU box, it’ll be practically double what the equivalent Ampere was when Ada launched.

It’s actually bad enough that a CPU path for CryoSPARC would be extremely appealing, simply because throwing a few 128-core Zen 5 Epyc CPUs at the problem will actually be cheaper than a high-GPU-count Ada or Blackwell based system.

I’ve been tinkering with AMD GPUs, but ROCm still suffers from poor support in all directions - I have neither the time nor the patience to deal with “supported hardware/OS” requirements which feel like to make headway I need to evoke the spirit of Mussorgsky’s “Night on the Bare Mountain”.

leetleyang · December 3, 2024, 11:50am

Indeed, I advocated for the Quadro variants in our last round of procurement, but following discussions with our supplier, it proved prohibitively expensive. I suppose we can pipe some of the heat coming off the 8x4090 nodes into the building’s heating.

Would this be reflected in the pwr cap value reported in nvidia-smi? Or is this not necessarily updated transparently?

Cheers,
Yang

rbs_sci · December 3, 2024, 12:51pm

Yes, set and checked in nvidia-smi, but I’m not sure if it’s shown correctly in the “normal” output…?

To set:

sudo nvidia-smi -pl 200

Where 200 would be 200W, for example…

And:

nvidia-smi -q -d POWER

Checks whether it’s set or not. Or should, if nVidia aren’t lying to me with out of date docs.

leetleyang · December 3, 2024, 1:42pm

Thank you. That’s helpful. 450W reported. I’ll have a chat with our HPC guy.

Cheers,
Yang

Mark-A-Nakasone · December 3, 2024, 9:19pm

~600-700 pix is the max for me on RTX-3090 or A5000 (all of our workstations are based on 24GB cards) and NVLINK does not help. In general, I have found that Homogeneous reconstruction and Homogeneous Refinement will run with larger box sizes, while NU-refine will not.

3D Flex really only runs at 440 pix, see @hbridges1 post https://discuss.cryosparc.com/t/flex-reconstruction-failing/15016/4?u=mark-a-nakasone.

RELION does things a little differently, but a lot of people see this table https://guide.cryosparc.com/processing-data/tutorials-and-case-studies/performance-metrics and think 1024 pix box size will run fine on a 3090/A5000 - but it will not.

Can use $nvtop or $htop to record the logs of GPU core use, VRAM, and power. I find it helpful to watch some jobs through tmux with nvtop and htop.

some times if the GPUs are not under heavy load you can override the scheduler (run now on a specific GPU 0,1,2,n) in cryosparc and it can save time.

I am grateful to @rbs_sci, the CryoSparc Team, and the other users that post their experiences. I have witnessed many of my users thinking they had the wrong input, something was wrong with the computer, etc.

rbs_sci · December 3, 2024, 11:28pm

I’m beginning to really dislike that table. I have the same problem; some new users see that, and start throwing accusations around when 1024 pixel boxes don’t work. It’s very, very out of date and really needs to be removed or updated.

Also, NVLINK needs to be explicitly supported in the software AFAIK, and, well, basically nothing bothers.

rbs_sci · December 10, 2024, 12:26am

For whomever might be interested…?

I’ve dug back through my notes and some projects which are still attached to the various systems and here’re some numbers (systems run either SATA SSD RAID or PCI-E Gen 4 NVMe RAID, but results are from different versions of CryoSPARC from 4.4 to 4.6.2):

8GB (RTX 2080) (don’t ask, this is for giggles: 8GB cards no longer officially supported!):
256 pix: NU refine OK
360 pix: NU refine crash
512 pix: NU refine crash

8GB (RTX 2080) (don’t ask, this is for giggles: 8GB cards no longer officially supported!) (Low memory mode):
256 pix: NU refine OK
360 pix: NU refine OK
512 pix: NU refine OK
560 pix: NU refine OK (13 days for a full NU refine run of 680,000 particles without Global CTF refinement! Do not recommend!)
600 pix: NU refine crash

16GB (A4000) (optimised NU refine):
600 pix: NU refine OK
700 pix: NU refine crash
768 pix: NU refine crash, Local resolution estimation crash

16GB (A4000) (Low memory mode):
700 pix: NU refine OK (slow)
768 pix: NU refine OK (sloooooooooow!)
800 pix: NU refine crash

24GB (A5000) (optimised NU refine):
600 pix: NU refine OK
700 pix: NU refine crash
768 pix: Local resolution OK

24GB (A5000) (Low memory mode):
600 pix: NU refine OK
700 pix: NU refine OK
768 pix: NU refine OK
850 pix: NU refine OK
1024 pix: NU refine crash (this might have had another reason, but was a CUDA out of memory error)
1050 pix: NU refine OK

48GB (A6000) (optimised NU refine):
600 pix: NU refine OK
700 pix: NU refine crash

48GB (A6000) (Low memory mode):
700 pix: NU refine OK
768 pix: NU refine OK
900 pix: NU refine OK
1050 pix: NU refine OK
1100 pix: NU refine OK
1120 pix: NU refine OK
1150 pix: NU refine crash
1200 pix: NU refine crash
1400 pix: NU refine crash

And for comparison, with Fourier padding disabled on A6000s, RELION is OK with 1440 pixel boxes (with Xorg running) and 1480 pixel boxes (with Xorg disabled). On 80GB A100s, I think 1600 pixel boxes are OK (but don’t have 80GB cards to test). In CPU mode, I’ve done 1800 pixel boxes successfully. Larger is possible, but needs more RAM than we can afford, or willingness to hammer an NVMe SSD for a large swapfile.

On A6000’s, I successfully RBMC’d a 1400 pixel box (original data was 700 pix @ 4K, 1400 pix @ 8K, which I then downsampled to 1050 for processing.

This isn’t exhaustive, but might provide a little additional guidance?

rbs_sci · December 10, 2024, 2:03am

As a further aside, 1050 pixel boxes eat over 440GB of system RAM when NU refinement is running:

Screenshot from 2024-12-10 11-00-53

It can be even more than pictured (it spiked to 468GB about a second after I took this screenshot) so on multi-user systems it will be very easy to have everything come crashing down.

lecongmi001 · December 17, 2024, 7:05am

Hi there,
Interesting discussion going on! Thank you for all of your inputs!

On the note of power and temperature, I realized that our RTX 4070Ti crashes (nvidia-smi cannot find it) a lot under Relion’s refinement. One of the reason I am thinking is that it cannot throttle by itself and ended up overheating (temp usually 92++). So I am playing around with limiting the power draw to 250W (275W from default 285W it still crashes) and GPU-Target-Temp to 75C (somehow it still hovers around 85-90C).
Has anyone faced this issue with this line of GPU? Or is it my GPU is defective?

rbs_sci · January 3, 2025, 2:54am

Uncertain. I have no Ada generation cards to test. I’m pretty sure even if it is defective, any RMA would be rejected since most manufacturers will argue that they are gaming cards, not for GPGPU (like how they would reject RMAs for mining cards). If RELION crashes with an error it would probably be more valuable to ask on the github or ccpem mailing list. If it’s a silent crash with something in dmesg that might reveal the reason.