Unfortunately yes the local decomposition in legacy non-uniform refinement (or legacy local refinement with NU-refinement) is quite slow with larger boxes and can’t be sped up. But the new implementation in the new non-uniform refinement and local refinement jobs is much improved and should be a lot faster (a few minutes in most cases).
Hi, @apunjani, thank you so much for the explanation!
Please see my other post below. For some reason I consistently got ~1Å worse resolution using the new local refinement compared to the legacy one. And overall the new job run even slower than the legacy one And suggestions would be greatly appreciated!
The NU-refinement in the new local refinement is indeed faster, however, for a big 512 box, computing the “Local cross validation A/B” takes ~1 h per iteration Is it possible to skip/speed up this step?
For a single workstation with 8x RTX 3090 GPU cards and lots of CPUs/memories, I can at most run 2 local refinement jobs (512 box) simultaneously, otherwise all the jobs would run much slower. Any suggestions? Thanks!
Unfortunately the local cross validation step is the crux of nonuniform refinement - without it the refinement is just a standard homogeneous refinement! If possible, while the job is running and on this step, can you check that your system is not swapping CPU RAM (you can use htop to watch if the memory gets full and starts to swap) and whether the GPUs are continuously active (you can watch by checking nvidia-smi repeatedly)
This behaviour typically means that the system is swapping or is bottlenecked by a non-GPU system resource (CPU memory bandwidth, IO bandwidth from the SSD caches, etc). htop is a good tool to check on memory usage during processing to see if the system CPU RAM is getting full. You can also check that memory usage is the case by seeing if more than 2 simpler less memory-intensive jobs (2D class, etc) can be run concurrently without slowing down.
Hi @apunjani Thank you so much for the suggestions!
After looking at the system together with the IT expert, I think we can rule out the possibility of GPU or GPU or RAM being the bottleneck, so it seems the disk IO is the most likely cause.
For a local refinement job, will disk IO be more intense for a large box size?
I did use the “Cache particle images on SSD” option though.
@apunjani there used to be an option to set a resolution criterion for when to switch to non uniform regularization - this seems to be gone in the latest versions, and it tries local cross validation even at say 8Å. This would be handy to have back as an option, to speed up the initial iterations of refinement for large box size jobs.
We are still having this issue and cannot run more than two local refinement (New) jobs or local refinement (Legacy) jobs with NU turned on, otherwise the system hangs.
I want to track down the bottleneck before purchasing a new workstation.
My particle box is typically 512 pixels, and with more than 2 jobs running, CS spends unusually long time (a few hours) on the Local decomposition step of local refinement (Legacy) or the Local cross validation A/B step of local refinement (New). While at the steps that use GPU, the GPU usage (RTX3090 cards) stays at 0% for most of the time, and it takes a few seconds to show one heartbeat. The CPU memory usage is never close to saturation (we have 512 GB memory).
We can simultaneously run more than 2 simpler less memory-intensive jobs (2D class, etc) without significant slowing down.
We’ve already tried your trick to constantly empty the swap. It did help and we could run 8 local refinement (Legacy) jobs with NU turned off.