Hi all,
I’m running into some bizarre issues with the job scheduler when I run a RBMC job. My workstation has 4 GPUs and up until now, I’ve been able to schedule and run any combination of jobs as long as the total number of GPUs used was <= 4. Additionally, any multi-GPU job could also use all 4 GPUs simultaneously.
However, when I started using RBMC for the first time today, I noticed that whenever an RBMC job is running, I sometimes can’t use all 4 GPUs simultaneously:
- If RBMC uses 1 GPU: At most 1 additional single-GPU job (like a refinement) can run (permanently stuck in queue). However, any number of multi-GPU jobs (like 2D classification, extraction) can run, as long as the total GPU utilization <= 4.
- If RBMC uses 2 GPUs: No additional single-GPU job can run. However, any number of multi-GPU jobs can run, as long as the total GPU utilization <= 4.
- If RBMC uses 3 GPUs: No additional single-GPU job can run. However, any number of multi-GPU jobs can run, as long as the total GPU utilization <= 4.
- RBMC will not run if I build it with 4 GPUs, and it remains stuck in the queue. However, any number of jobs (single- and multi-GPU) can run, as long as the total GPU utilization <= 4.
When I check the output of nvidia-smi
, everything seems normal: the number of GPUs that are running is consistent with the total number of GPUs that are being used by cryoSPARC. This issue persists even after restarting cryoSPARC.
So to summarize, my main questions/problems are:
- When I run RBMC and a refinement job simultaneously, why can’t the combined GPU utilization be higher than 2?
- Why does RBMC have no problem running simultaneously with multi-GPU jobs?
- Why can’t RBMC run using all 4 GPUs?
Information requested by the troubleshooting guidelines
CryoSPARC instance information
- Type: single workstation, 4 GPUs
- Software version from
cryosparcm status
: v4.4.1
CryoSPARC worker environment
(base) cryosparcuser@egret:/$ eval $(/home/cryosparcuser/cryosparc/cryosparc_worker env)
bash: /home/cryosparcuser/cryosparc/cryosparc_worker: Is a directory
(base) cryosparcuser@egret:/$ env | grep PATH
WINDOWPATH=2
PATH=/programs/x86_64-linux/anaconda/2022.10/bin:/home/cryosparcuser/cryosparc/cryosparc_master/bin:/home/cryosparcuser/cryosparc/cryosparc_master/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
(base) cryosparcuser@egret:/$ /sbin/ldconfig -p | grep -i cuda
libicudata.so.66 (libc6,x86-64) => /lib/x86_64-linux-gnu/libicudata.so.66
libcudadebugger.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudadebugger.so.1
libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
libcuda.so.1 (libc6) => /lib/i386-linux-gnu/libcuda.so.1
libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so
libcuda.so (libc6) => /lib/i386-linux-gnu/libcuda.so
(base) cryosparcuser@egret:/$ uname -a
Linux egret 5.15.0-105-generic #115~20.04.1-Ubuntu SMP Mon Apr 15 17:33:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
(base) cryosparcuser@egret:/$ free -g
total used free shared buff/cache available
Mem: 125 9 1 0 115 115
Swap: 1 0 1
(base) cryosparcuser@egret:/$ nvidia-smi
Mon Apr 29 15:39:50 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1080 Off | 00000000:05:00.0 On | N/A |
| 28% 29C P8 9W / 180W | 331MiB / 8192MiB | 6% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce GTX 1080 Off | 00000000:06:00.0 Off | N/A |
| 28% 25C P8 6W / 180W | 11MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce GTX 1080 Off | 00000000:09:00.0 Off | N/A |
| 27% 28C P8 6W / 180W | 11MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce GTX 1080 Off | 00000000:0A:00.0 Off | N/A |
| 27% 26C P8 6W / 180W | 11MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3364 G /usr/lib/xorg/Xorg 35MiB |
| 0 N/A N/A 3748 G /usr/lib/xorg/Xorg 161MiB |
| 0 N/A N/A 3874 G /usr/bin/gnome-shell 27MiB |
| 0 N/A N/A 8089 G /usr/lib/firefox/firefox 93MiB |
| 1 N/A N/A 3364 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 3748 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 3364 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 3748 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 3364 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 3748 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+