RBMC Empirical dose weights stalling

Hey,

I’m trying to run RBMC on a ribosome dataset. Nothing special, about the dataset.
The Hyperparameter search works perfectly, but when it reaches the Dose weight step it just stalls for ever.
I starts on the first 5-6 micrographs (in a few minutes) and then does nothing (i let it run up to two hours) without any error message.

This is how it looks, the progression bar just doesn’t move

[2023-12-12 22:54:06.05]
[CPU: 2.27 GB]

    STARTING: COMPUTE EMPIRICAL DOSE WEIGHTS

[2023-12-12 22:54:06.06]
[CPU: 2.27 GB]
Using hyperparameters:
Spatial prior strength: 3.6279e-03
Spatial correlation distance: 500
Acceleration prior strength: 2.6015e-02

[2023-12-12 22:54:06.06]
[CPU: 2.27 GB]
Using all FCs for doseweighting

[2023-12-12 22:54:06.15]
[CPU: 2.28 GB]
Working with 320 movies containing 20046 particles

[2023-12-12 22:54:07.55]
[CPU: 3.08 GB]
Movies processed:
[▇-------------------------------------------------------------------------------] 6/320 (2%)

At first I thought it was because the particles come for 2 sets of micrographs with different dose, so i divided everything in 2 jobs and ran just the hyperparameter search step on each micrograph sets. This again worked, then i launched the RBMC jobs with the previously calculated hyperparameters, again, dose weighting starts and nothing happens.

Any idea why?

Thanks in advance

Please can you post a screenshot of the htop program that you collect when an RBMC job has stalled in this way.
You may want to

Thank you for your answer.
However i don’t think i can acces these info easily since I am subumiting these jobs to the high performance computing cluster of the university, not a local machine.

Do you think changing the GPU oversubscription memory threshold (GB) and In-memory cache size
parameters would help?

Ok just to update.

The NVIDIA drivers were updated on our RTX8000 GPU node. On the this node it works, but not on the A100 GPU node (the one I was using originally - NVIDIA drivers were already up to date).

I am still not sure why it’s not working on A100s, but at least it’s working with the other one …

You may want to ask your IT support about:

  • RAM on the A100 node. What is the output of the command
    free -g
    
  • The current setting for transparent_hugepage
    cat /sys/kernel/mm/transparent_hugepage/enabled
    
    They may want to try setting this to never.

You may also try raising the oversubscription threshold to a value that prevents oversubscription on the A100. Are the A100s of the 40GB or the 80GB variety?