Calculate local cross validation for both half-sets in parallel instead of serially in non-uniform refinement

We experience extremely long run times in non-uniform refinement for large boxes. This is caused by long processing times for the calculation of the local cross-correlation. Oddly, sets A and B are processed serially on one CPU even if more CPUs are assigned to the job.

I have two suggestions:

  1. Calculate for sets A and B in parallel instead of serially since the calculation only uses one CPU.
  2. Calculate only within the mask once a mask has been established (can be turned on/off).

Hi @martinhallberg ,

Thanks for the suggestion and report. The local cross-validation in non-uniform refinement is GPU accelerated, so you should see that it uses one CPU core but keeps the GPU quite busy. There is one nuance, which is that unlike other parts of CryoSPARC, this is one area that can be bottlenecked by host-to-device (CPU->GPU) data transfers. So the type of interconnect and speed (PCIE3/4, x8/x16 etc) will matter in terms of performance.
Because of the GPU usage, it’s not trivial to run both half-sets at the same time (since for larger boxes, only the data for one half-set will fit on GPU at a time, and once the data is transferred there the GPU is fully busy with that data). We will however look into this, and see if there are possible speedups (including your suggestion related to the mask).

Thanks, @apunjani !
Currently running this very step in a 600 pix box NU-refinement with 1 core at 100%, CPU memory usage 24% of 384 GB, 20-40% volatile GPU utilization, but only 3-4% of GPU memory utilization (!).

The workstation used is equipped with three RTX 8000 on PCI-E 3/16, two 2.6 Mhz XEON Gold 6240 (18 cores each), 384 GB DDR4/2933 MHz, and particles on a PCI-E 3 NVME. So an older, but not ancient setup.

For this type of workstation (or better) it wouldn’t be a problem to run both half-sets at the same time and more GPU memory can be utilized also for a single half set to possibly reduce the effect of the –slower than bleeding edge – interconnect?

@martinhallberg thanks for the additional details! We will look into this and see if we can get any speedup having the GPU do both half-maps at the same time (when there is enough GPU RAM like on the RTX 8000). In terms of using the additional memory to save interconnect time, unfortunately that probably won’t work since we have to repeatedly transfer volume to/from the GPU to compute intermediate results during local CV (maybe 100+ times) so storing all the inputs and intermediate results on the GPU rather than transferring wouldn’t fit even on the largest GPUs.

Thanks, @apunjani !
This is a good argument for moving on to a PCI-E 4 system with roughly double the bus speed. Furthermore, since the single-core that is active seems to be speed limiting, it might be advantageous, actually also in general for CS, to go for a frequency-optimized CPU with slightly fewer cores.