Local decomposition in NU local refinement

zhangrui_wustl · May 13, 2021, 6:04pm

Hi, in my single workstation, for a 512 box, cryoSPARC typically spend 1-3 hours on the step of local decomposition during local refinement Legacy with Non-uniform refinement option turned on.

I wonder if this is normal or is there a way to reduce the time spent on local decomposition.
Of course, I am getting higher reported resolution when using NU so I don’t want to switch it off.

Thanks!

apunjani · May 14, 2021, 2:54pm

Hi @zhangrui_wustl,
Unfortunately yes the local decomposition in legacy non-uniform refinement (or legacy local refinement with NU-refinement) is quite slow with larger boxes and can’t be sped up. But the new implementation in the new non-uniform refinement and local refinement jobs is much improved and should be a lot faster (a few minutes in most cases).

zhangrui_wustl · May 14, 2021, 3:19pm

Hi, @apunjani, thank you so much for the explanation!
Please see my other post below. For some reason I consistently got ~1Å worse resolution using the new local refinement compared to the legacy one. And overall the new job run even slower than the legacy one And suggestions would be greatly appreciated!

zhangrui_wustl · May 18, 2021, 2:57am

Hi @apunjani, two more feedbacks:

The NU-refinement in the new local refinement is indeed faster, however, for a big 512 box, computing the “Local cross validation A/B” takes ~1 h per iteration Is it possible to skip/speed up this step?
For a single workstation with 8x RTX 3090 GPU cards and lots of CPUs/memories, I can at most run 2 local refinement jobs (512 box) simultaneously, otherwise all the jobs would run much slower. Any suggestions? Thanks!

apunjani · May 19, 2021, 9:08pm

Hi @zhangrui_wustl,

Unfortunately the local cross validation step is the crux of nonuniform refinement - without it the refinement is just a standard homogeneous refinement! If possible, while the job is running and on this step, can you check that your system is not swapping CPU RAM (you can use htop to watch if the memory gets full and starts to swap) and whether the GPUs are continuously active (you can watch by checking nvidia-smi repeatedly)
This behaviour typically means that the system is swapping or is bottlenecked by a non-GPU system resource (CPU memory bandwidth, IO bandwidth from the SSD caches, etc). htop is a good tool to check on memory usage during processing to see if the system CPU RAM is getting full. You can also check that memory usage is the case by seeing if more than 2 simpler less memory-intensive jobs (2D class, etc) can be run concurrently without slowing down.

zhangrui_wustl · May 21, 2021, 2:15am

Hi @apunjani Thank you so much for the suggestions!
After looking at the system together with the IT expert, I think we can rule out the possibility of GPU or GPU or RAM being the bottleneck, so it seems the disk IO is the most likely cause.
For a local refinement job, will disk IO be more intense for a large box size?
I did use the “Cache particle images on SSD” option though.
Thanks!

olibclarke · June 7, 2021, 5:46pm

@apunjani there used to be an option to set a resolution criterion for when to switch to non uniform regularization - this seems to be gone in the latest versions, and it tries local cross validation even at say 8Å. This would be handy to have back as an option, to speed up the initial iterations of refinement for large box size jobs.

mmclean · June 8, 2021, 1:52pm

Good catch, I believe this was omitted when creating the new versions of the NU-refine and local refine jobs in v3.0+. We’ll add this to our feature tracker.

olibclarke · June 8, 2021, 2:56pm

excellent, thanks!!!

zhangrui_wustl · February 17, 2022, 5:48pm

Hi @apunjani
We are still having this issue and cannot run more than two local refinement (New) jobs or local refinement (Legacy) jobs with NU turned on, otherwise the system hangs.
I want to track down the bottleneck before purchasing a new workstation.

My particle box is typically 512 pixels, and with more than 2 jobs running, CS spends unusually long time (a few hours) on the Local decomposition step of local refinement (Legacy) or the Local cross validation A/B step of local refinement (New). While at the steps that use GPU, the GPU usage (RTX3090 cards) stays at 0% for most of the time, and it takes a few seconds to show one heartbeat. The CPU memory usage is never close to saturation (we have 512 GB memory).

We can simultaneously run more than 2 simpler less memory-intensive jobs (2D class, etc) without significant slowing down.

We’ve already tried your trick to constantly empty the swap. It did help and we could run 8 local refinement (Legacy) jobs with NU turned off.

Thank you in advance for your suggestions!