Bizarre job scheduling issues with Reference Based Motion Correction

Hi all,

I’m running into some bizarre issues with the job scheduler when I run a RBMC job. My workstation has 4 GPUs and up until now, I’ve been able to schedule and run any combination of jobs as long as the total number of GPUs used was <= 4. Additionally, any multi-GPU job could also use all 4 GPUs simultaneously.

However, when I started using RBMC for the first time today, I noticed that whenever an RBMC job is running, I sometimes can’t use all 4 GPUs simultaneously:

  • If RBMC uses 1 GPU: At most 1 additional single-GPU job (like a refinement) can run (permanently stuck in queue). However, any number of multi-GPU jobs (like 2D classification, extraction) can run, as long as the total GPU utilization <= 4.
  • If RBMC uses 2 GPUs: No additional single-GPU job can run. However, any number of multi-GPU jobs can run, as long as the total GPU utilization <= 4.
  • If RBMC uses 3 GPUs: No additional single-GPU job can run. However, any number of multi-GPU jobs can run, as long as the total GPU utilization <= 4.
  • RBMC will not run if I build it with 4 GPUs, and it remains stuck in the queue. However, any number of jobs (single- and multi-GPU) can run, as long as the total GPU utilization <= 4.

When I check the output of nvidia-smi, everything seems normal: the number of GPUs that are running is consistent with the total number of GPUs that are being used by cryoSPARC. This issue persists even after restarting cryoSPARC.

So to summarize, my main questions/problems are:

  • When I run RBMC and a refinement job simultaneously, why can’t the combined GPU utilization be higher than 2?
  • Why does RBMC have no problem running simultaneously with multi-GPU jobs?
  • Why can’t RBMC run using all 4 GPUs?

Information requested by the troubleshooting guidelines

CryoSPARC instance information

  • Type: single workstation, 4 GPUs
  • Software version from cryosparcm status: v4.4.1

CryoSPARC worker environment

(base) cryosparcuser@egret:/$ eval $(/home/cryosparcuser/cryosparc/cryosparc_worker env)
bash: /home/cryosparcuser/cryosparc/cryosparc_worker: Is a directory
(base) cryosparcuser@egret:/$ env | grep PATH
WINDOWPATH=2
PATH=/programs/x86_64-linux/anaconda/2022.10/bin:/home/cryosparcuser/cryosparc/cryosparc_master/bin:/home/cryosparcuser/cryosparc/cryosparc_master/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
(base) cryosparcuser@egret:/$ /sbin/ldconfig -p | grep -i cuda
	libicudata.so.66 (libc6,x86-64) => /lib/x86_64-linux-gnu/libicudata.so.66
	libcudadebugger.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudadebugger.so.1
	libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
	libcuda.so.1 (libc6) => /lib/i386-linux-gnu/libcuda.so.1
	libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so
	libcuda.so (libc6) => /lib/i386-linux-gnu/libcuda.so
(base) cryosparcuser@egret:/$ uname -a
Linux egret 5.15.0-105-generic #115~20.04.1-Ubuntu SMP Mon Apr 15 17:33:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
(base) cryosparcuser@egret:/$ free -g
              total        used        free      shared  buff/cache   available
Mem:            125           9           1           0         115         115
Swap:             1           0           1
(base) cryosparcuser@egret:/$ nvidia-smi
Mon Apr 29 15:39:50 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080        Off | 00000000:05:00.0  On |                  N/A |
| 28%   29C    P8               9W / 180W |    331MiB /  8192MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080        Off | 00000000:06:00.0 Off |                  N/A |
| 28%   25C    P8               6W / 180W |     11MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce GTX 1080        Off | 00000000:09:00.0 Off |                  N/A |
| 27%   28C    P8               6W / 180W |     11MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce GTX 1080        Off | 00000000:0A:00.0 Off |                  N/A |
| 27%   26C    P8               6W / 180W |     11MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3364      G   /usr/lib/xorg/Xorg                           35MiB |
|    0   N/A  N/A      3748      G   /usr/lib/xorg/Xorg                          161MiB |
|    0   N/A  N/A      3874      G   /usr/bin/gnome-shell                         27MiB |
|    0   N/A  N/A      8089      G   /usr/lib/firefox/firefox                     93MiB |
|    1   N/A  N/A      3364      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      3748      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      3364      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      3748      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      3364      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      3748      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+


Hi,

How many CPU cores and how much RAM is the workstation equipped with? I wonder if it’s not necessarily the free GPUs but the nominal availability of another resource that is causing subsequent jobs to be queued.

You can tally up the sums that cryoSPARC does by looking at Resources allocated towards the start of the streamlog: each CPU corresponds to a core; each RAM slot represents 8GB.

It’s not a mechanism I use, but I believe Queueing Directly to a GPU offers some degree of manual override, for circumstances when direct control is preferable.

Cheers,
Yang

1 Like

Hi Yang,

The workstation has 16 CPU cores and 128 GB of RAM. Thank you for the comment about each RAM slot representing 8GB, I wasn’t aware of that before. With 3 GPUs, RBMC is using 16 RAM slots, which is all of the RAM. I didn’t know that RBMC required that much - other jobs only use a fraction of that amount.

After reading your comment, I went back to watch the RBMC tutorial video. Towards the end, the developer mentions that by default, the job sets aside 80 GB of RAM, which is also a large amount. Additionally, I found a post from 2017 describing a similar issue where refinement jobs would queue but not run:

CryoSPARC’s scheduler checks both GPU and RAM availability when launching jobs, and refinement jobs require the system to have at least 24GB RAM before they get launched.

Together, all this information appears to explain why the refinement jobs specifically would queue but not run: RBMC was already reserving nearly all my RAM, and there wasn’t enough for the refinement jobs to launch. Thank you for your help, Yang!

Hi,

To clarify, these are estimates that are baked in to each cryoSPARC jobtype and may not reflect the true memory usage of the job. If you’re interested, you can poke around to see what these estimates are in cryosparc_master/cryosparc_compute/jobs/<job_type>/build.py under the definition recompute_resources.

If you were to run each routine within the RBMC work-flow–hyperparameter search, empirical dose-weighting calculation, and motion-correction–separately, you’ll notice also that each routine applies different resource estimates.

In its current iteration, RBMC estimates and consumes a lot of memory, particularly in the first two routines, mostly because it holds the particle images extracted from each dose-fractionated movie frame in memory while it runs its calculations. This may change in later versions of the job.

Cheers,
Yang

1 Like