Adding GPU to system

Hello.

We added a 2nd GPU to a workstation install. The GPU lists fine with nvidia-smi and test ok with gpu burn-in tools. However, I don’t see it listed in cryosparc, I only see the original 1 gpu when issuing “cryosparcw gpulist”.

What is the best practice for enabling the 2nd gpu in this case?

Thanks!

Hi @yodamoppet,

Please see:

Hi @stephan

Thanks for the info.

This command simply returns “none”:

root]$ cryosparcm cli “get_gpu_info()”
None

So, I tried restarting cryosparc, but the webgui still shows “1” for number of GPU’s under instance information.

Hi @yodamoppet,

On your GPU workstation, can you log into a shell and run the following command:
cryosparc_worker/bin/cryosparcw connect --master <master_hostname> --worker <worker_hostname> --port <base_port> --update

Once that’s done, restart cryoSPARC: cryosparcm restart

Hi @stephan,

Thanks for this reply. I try this, but it doesn’t work as expected…

Current configuration:
resource_slots : {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7], ‘GPU’: [0], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7]}

Final configuration:
resource_slots : {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7], ‘GPU’: [0], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7]}

So, these both show the same GPU resource (0). But we added a second GPU, and it is detected by nvidia-smi:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 TITAN V Off | 00000000:02:00.0 Off | N/A |
| 35% 50C P8 25W / 250W | 25MiB / 12066MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 TITAN V Off | 00000000:03:00.0 Off | N/A |
| 33% 48C P8 29W / 250W | 1MiB / 12066MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

So, I should see a 2nd GPU in the “final configuration” after doing the update command, but I don’t see this. I also don’t see the 2nd GPU listed after restarting the cryosparcm process.

Am I missing something?

Hi @stephan,

I’ve tried a number of things, but still can’t get this new GPU to be recognized. What else can we try?

Hi @stephan

I’ve just tried removing and reinstall Cryosparc, but it still only detects GPU 0:

resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7], 'GPU': [0], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7]}

GPU 1 is present, and lists in nvidia-smi, in /dev/nvidia*, and I am able to run other tasks on it:

nvidia-smi
Mon Aug 30 14:19:35 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:02:00.0 Off |                  N/A |
| 36%   52C    P8    25W / 250W |     25MiB / 12066MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             Off  | 00000000:03:00.0 Off |                  N/A |
| 34%   49C    P8    29W / 250W |      1MiB / 12066MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2016      G   /usr/bin/X                          9MiB |
|    0   N/A  N/A      2689      G   /usr/bin/gnome-shell               14MiB |

They are both listed as valid system devices…

ls -lah /dev/nvidia*
crw-rw-rw- 1 root root          195,   0 Aug 30 14:00 /dev/nvidia0
crw-rw-rw- 1 root root          195,   1 Aug 30 14:00 /dev/nvidia1

Any thoughts on why cryosparc continues to be unaware of this GPU, and how we might update the config to reflect both GPU0 and GPU1?

Thanks!

Hi, we would really like to get this going. Both GPUs should work from a reinstall. Any ideas what could be the problem?

Hi @jcoleman, @yodamoppet,

Can you try doing the following:
On the GPU workstation, run the command cryosparc_worker/bin/cryosparcw connect --master <master_hostname> --worker <worker_hostname> --port <base_port> --update --gpus 0,1
Then, restart cryoSPARC. If that doesn’t work, I’ll send you instructions on how to manually update the cryoSPARC database so it can queue directly to the second GPU.

Hi @stephan,

This results in the following:

cryosparcw connect --worker equinox.structbio.pitt.edu --master equinox.structbio.pitt.edu --port 61000 --update --gpus 0,1

 ---------------------------------------------------------------
  CRYOSPARC CONNECT --------------------------------------------
 ---------------------------------------------------------------
  Attempting to register worker equinox.structbio.pitt.edu to command equinox.structbio.pitt.edu:61002
  Connecting as unix user cryosparcuser
  Will register using ssh string: cryosparcuser@equinox.structbio.pitt.edu
  If this is incorrect, you should re-run this command with the flag --sshstr <ssh string> 
 ---------------------------------------------------------------
  Connected to master.
 ---------------------------------------------------------------
  Current connected workers:
    equinox.structbio.pitt.edu
 ---------------------------------------------------------------
 ---------------------------------------------------------------
  Updating target equinox.structbio.pitt.edu
  Current configuration:
               cache_path :  /data/opt/cryosparc/scratch
           cache_quota_mb :  None
         cache_reserve_mb :  10000
                     desc :  None
                     gpus :  [{'id': 0, 'mem': 12652445696, 'name': 'NVIDIA TITAN V'}]
                 hostname :  equinox.structbio.pitt.edu
                     lane :  default
             monitor_port :  None
                     name :  equinox.structbio.pitt.edu
           resource_fixed :  {'SSD': True}
           resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7], 'GPU': [0], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7]}
                  ssh_str :  cryosparcuser@equinox.structbio.pitt.edu
                    title :  Worker node equinox.structbio.pitt.edu
                     type :  node
          worker_bin_path :  /data/opt/cryosparc/cryosparc_worker/bin/cryosparcw
 ---------------------------------------------------------------
  Autodetecting available GPUs...
  Detected 1 CUDA devices.

   id           pci-bus  name
   ---------------------------------------------------------------
       0      0000:02:00.0  NVIDIA TITAN V
   ---------------------------------------------------------------
   Devices specified: 0, 1
Traceback (most recent call last):
  File "bin/connect.py", line 166, in <module>
    gpu_devidxs = check_gpus()
  File "bin/connect.py", line 111, in check_gpus
    assert all([v in range(num_devs) for v in gpu_devidxs]), "Some specified devices do not exist."
AssertionError: Some specified devices do not exist.

Hi @yodamoppet,

That’s really odd. Especially since nvidia-smi is reporting both GPUs normally.
Do the following in cryoSPARC’s interactive python shell (cryosparcm icli):

from cryosparc_compute.jobs import common as com
gpus_available = [0,1]
gpu_info = [
    {'id': 0, 'mem': 12652445696, 'name': 'NVIDIA TITAN V'}, 
    {'id': 1, 'mem': 12652445696, 'name': 'NVIDIA TITAN V'}
]
worker_hostname = 'equinox.structbio.pitt.edu'
targets = cli.get_scheduler_targets()
target = com.query(targets, lambda t : t['hostname'] == worker_hostname)
target['resource_slots']['GPU'] = gpus_available
cli.set_scheduler_target_property(worker_hostname, 'resource_slots', target['resource_slots']) 
cli.set_scheduler_target_property(worker_hostname, 'gpus', gpu_info) 

Once that’s done, you should be able to see the second GPU in the queue window.

Hi @stephan

This results in the following error:

In [1]: from cryosparc_compute.jobs import common as com
…: gpus_available = [0,1]
…: gpu_info = [
…: {‘id’: 0, ‘mem’: 12652445696, ‘name’: ‘NVIDIA TITAN V’},
…: {‘id’: 1, ‘mem’: 12652445696, ‘name’: ‘NVIDIA TITAN V’}
…: ]
…: worker_hostname = ‘equinox.structbio.pitt.edu’
…: targets = cli.get_scheduler_targets()
…: target = com.query(targets, lambda t : t[‘hostname’] == worker_hostname)
…: target[‘resource_slots’][‘GPU’] = gpus_available
…: set_scheduler_target_property(worker_hostname, ‘resource_slots’, target[‘resource_slots’])
…: set_scheduler_target_property(worker_hostname, ‘gpus’, gpu_info)

NameError Traceback (most recent call last)
in
9 target = com.query(targets, lambda t : t[‘hostname’] == worker_hostname)
10 target[‘resource_slots’][‘GPU’] = gpus_available
—> 11 set_scheduler_target_property(worker_hostname, ‘resource_slots’, target[‘resource_slots’])
12 set_scheduler_target_property(worker_hostname, ‘gpus’, gpu_info)

NameError: name ‘set_scheduler_target_property’ is not defined

Sorry, those two commands were supposed to be prefixed with “cli.”. I’ve updated my post with the change.

Very good, I now see 2 GPU’s listed.

Just in case we need to reverse this process, would that simply be the following process:

from cryosparc_compute.jobs import common as com
gpus_available = [0]
gpu_info = [
{‘id’: 0, ‘mem’: 12652445696, ‘name’: ‘NVIDIA TITAN V’}
]
worker_hostname = ‘equinox.structbio.pitt.edu’
targets = cli.get_scheduler_targets()
target = com.query(targets, lambda t : t[‘hostname’] == worker_hostname)
target[‘resource_slots’][‘GPU’] = gpus_available
cli.set_scheduler_target_property(worker_hostname, ‘resource_slots’, target[‘resource_slots’])
cli.set_scheduler_target_property(worker_hostname, ‘gpus’, gpu_info)

Yes, that’s correct.

Perfect. Thanks so much for the excellent advice and support!

2 Likes