Hi,
I have a cryoSPARC v4.2.1 instance running on a node with 4x NVIDIA A40 GPUs (i.e. devices 0,1,2,3). I configured it to only see GPUs 1 and 2, and it works fine most of the time.
However, when doing a 3D Flex Training job, I notice the process runs on both GPUs 0 and 1, even though in theory it grabbed only GPU 1:
Info from the job starting:
License is valid.
Launching job on lane default target worker08.cluster.bc2.ch ...
Running job on master node hostname worker08.cluster.bc2.ch
[CPU: 167.0 MB Avail:1988.64 GB]
Job J455 Started
[CPU: 167.1 MB Avail:1988.64 GB]
Master running v4.2.1, worker running v4.2.1
[CPU: 167.1 MB Avail:1988.64 GB]
Working in directory: /scicore/home/engel0006/GROUP/pool-engel/Cryosparc_projects/CS-mitoribo-cf/J455
[CPU: 167.1 MB Avail:1988.64 GB]
Running on lane default
[CPU: 167.1 MB Avail:1988.64 GB]
Resources allocated:
[CPU: 167.1 MB Avail:1988.64 GB]
Worker: worker08.cluster.bc2.ch
[CPU: 167.1 MB Avail:1988.64 GB]
CPU : [0, 1, 2, 3]
[CPU: 167.1 MB Avail:1988.64 GB]
GPU : [1]
[CPU: 167.1 MB Avail:1988.64 GB]
RAM : [0, 1, 2, 3, 4, 5, 12, 13]
[CPU: 167.1 MB Avail:1988.64 GB]
SSD : False
[CPU: 167.1 MB Avail:1988.64 GB]
--------------------------------------------------------------
[CPU: 167.1 MB Avail:1988.64 GB]
Importing job module for job type flex_train...
[CPU: 344.6 MB Avail:1988.41 GB]
Job ready to run
[CPU: 344.6 MB Avail:1988.41 GB]
***************************************************************
[CPU: 422.7 MB Avail:1988.33 GB]
====== 3D Flex Training Model Setup =======
[CPU: 422.7 MB Avail:1988.33 GB]
Loading mesh...
The output from nvidia-smi:
Fri Apr 21 10:05:04 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:23:00.0 Off | 0 |
| 0% 36C P0 76W / 300W | 326MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:41:00.0 Off | 0 |
| 0% 52C P0 111W / 300W | 3830MiB / 46068MiB | 18% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 On | 00000000:A1:00.0 Off | 0 |
| 0% 32C P8 32W / 300W | 25MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 On | 00000000:C1:00.0 Off | 0 |
| 0% 29C P8 30W / 300W | 25MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 5890 G /usr/bin/X 23MiB |
| 0 N/A N/A 36522 C python 257MiB |
| 0 N/A N/A 55304 G ...nuxAMD64-Optimize/Amira3D 43MiB |
| 1 N/A N/A 5890 G /usr/bin/X 22MiB |
| 1 N/A N/A 36522 C python 3805MiB |
| 2 N/A N/A 5890 G /usr/bin/X 22MiB |
| 3 N/A N/A 5890 G /usr/bin/X 22MiB |
+-----------------------------------------------------------------------------+
So we have this python process with PID 36522 running on both GPU 0 and 1, which is cryoSPARC job J455 above:
diogori 36522 543 1.8 84883072 39639236 ? Rl Apr20 3739:36 python -c import cryosparc_compute.run as run; run.run() --project P3 --job J455 --master_hostname worker08.cluster.bc2.ch --master_command_core_port 39002
It’s true that only GPU 1 is doing the heavy lifting, so maybe it’s not too bad, but cryosparc should not even know about the existence of GPU 0. I reserve that GPU for other graphical stuff and don’t wnat cryosparc jobs running there.
How does it find this GPU? Could it be related to the external modules of the 3D Flex job?
And more importantly, can I prevent this somehow?
Thank you!