Reference based motion correction - resource usage

Hello Folks,

We are trying to run reference-based motion correction job on our cluster with the following parameters,

44548 motion-corrected exposures and 134725 particles as well as their corresponding volume as inputs with the following compute resources from the cryosparc gui

the job is submitted to a cluster lane requesting 10 cpus and 250GB RAM and on one H100 card with 80GB VRAM. The job progressed upto 70% and is killed due to 48 hr time limit on the cluster.

We see the gpu utilization is very very low (inefficient use of the resources) and it is sporadic. The only way to the make the job complete within the 48 hrs time limit is to run on multiple gpus, that in turn just occupies the VRAM on the gpus and the gpu utilization is low and sporadic.

What is the best way to run Reference based motion correction job? Any suggestion please?

Thanks,

Asif

When does the 48 hour deadline hit? If it’s during parameter estimation there isn’t much you can do beyond talk to your IT people and ask for an exemption/extension (if possible). If it’s during actual particle motion correction (i.e.: you’re seeing RBMC’d particles being output with little motion trajectory plots) then a solution would be:

  1. Do first two steps of RBMC with full stack to estimate hyperparams and dose weights.
  2. split exposures into 4 or 5 sets
  3. Do last RBMC step with same hyperparam/dose set and particle stacks, but feed it one of the exposure subsets.
  4. Repeat (3) with the other exposure subsets.
  5. Recombine particle sets as appropriate for further refinement.
1 Like

Dear @rbs_sci ,

In this case, the job finished 70% of the final particle motion correction step. I do have a few questions.

  1. Does using more GPU cards help to accelerate the parameters estimation step, or the final particle motion correction step, or both?
  2. Based on your description of the alternative workflow, do you think requesting more cards can be more time efficient? It seems repeating those steps multiple times is very time consuming.
  3. To estimate the hyperparameters and dose weights, do you stop the job somewhere once they are done (e.g., before the final particle motion correction step), even though the job is still running?

Thank you very much.

Bryan

More GPUs definitely accelerates the particle motion step. But it will also depend on your storage for raw data. Splitting the job up into stages is not really any less efficient, it will only estimate hyperparameters and dose weighting once, and it will only carry out motion correction on appropriate particles once. It just means there are more Job Cards on the workflow. Right click → Clone job is your friend there. :wink: Just remember to change the exposures input to each clone.

RBMC has the option to control how far a job proceeds - You can choose one of three steps - hyperparameter estimation, dose weighting estimation and particle motion estimation. If you follow the layout I originally described, it should not be appreciably slower than doing it “all in one” (and the job won’t fail when your cluster manager kills it). :slight_smile:

I endorse rbs_sci’s suggestions as probably the quickest way to solve your problem.

It is useful feedback that on a modern H100 machine, the balance of CPU to GPU is off. Though to some extent the current state of things was dictated by practical non-technical project constraints, there was also a deliberate attempt to design resource usage around the machines that we had available at the time.

If your cluster manager actually enforces CPU limits by CPU pinning, then I suspect you’ll see some benefit by increasing the number of CPUs you request (though I don’t want to oversell the expectations).

What CPU model are you using?