Hello,
I’m running the latest version of CS v4.5.1 on a single workstation, with OS Rocky Linux 8.9 (Green Obsidian), with 3 GPUs: 2x RTX A4000 and 1x Quadro P620.
I have not had any issue with the master installation. After installing the master, I updated the worker node so that the Quadro (gpu id 2) would not be listed (to avoid using it in the default lane). This worked well as well.
./cryosparcw connect --worker $(hostname) \
--master $(hostname) \
--port 39000 \
--update \
--ssdpath /scratch/users/cryosparc_cache/ \
--ssdreserve 256 \
--gpus 0,1
The event log showed:
Detected 3 CUDA devices.
id pci-bus name
---------------------------------------------------------------
0 23 NVIDIA RTX A4000
1 24 NVIDIA RTX A4000
2 101 Quadro P620
---------------------------------------------------------------
Devices specified: 0, 1
Devices 0, 1 will be enabled now.
However, while running the “Test Worker GPUs” with both custom options on “Tensorflow test” and “PyTorch test”, it always passes the TensorFlow test but I get this assertion error with the PyTorch test:
License is valid.
Launching job on lane default target ...
Running job on master node hostname
[CPU: 93.9 MB Avail: 712.16 GB]
Job J2 Started
[CPU: 93.9 MB Avail: 712.15 GB]
Master running v4.5.1, worker running v4.5.1
[CPU: 93.9 MB Avail: 712.15 GB]
Working in directory: /home/cryosparcuser/cryosparc_benchmarks/CS-benchmark/J2
[CPU: 93.9 MB Avail: 712.15 GB]
Running on lane default
[CPU: 93.9 MB Avail: 712.20 GB]
Resources allocated:
[CPU: 93.9 MB Avail: 712.20 GB]
Worker: chapelhill.cicbiogne.int
[CPU: 93.9 MB Avail: 712.19 GB]
CPU : [0]
[CPU: 93.9 MB Avail: 712.19 GB]
GPU : [0]
[CPU: 93.9 MB Avail: 712.18 GB]
RAM : [0]
[CPU: 93.9 MB Avail: 712.18 GB]
SSD : True
[CPU: 93.9 MB Avail: 712.18 GB]
--------------------------------------------------------------
[CPU: 93.9 MB Avail: 712.18 GB]
Importing job module for job type worker_gpu_test...
[CPU: 228.0 MB Avail: 712.12 GB]
Job ready to run
[CPU: 228.0 MB Avail: 712.12 GB]
***************************************************************
[CPU: 259.1 MB Avail: 712.11 GB]
Obtaining GPU info via `nvidia-smi`...
[CPU: 259.1 MB Avail: 712.11 GB]
NVIDIA RTX A4000 @ 00000000:17:00.0
[CPU: 259.1 MB Avail: 712.11 GB]
driver_version :550.78
[CPU: 259.1 MB Avail: 712.11 GB]
persistence_mode :Disabled
[CPU: 259.1 MB Avail: 712.11 GB]
power_limit :140.00
[CPU: 259.1 MB Avail: 712.11 GB]
sw_power_limit :Not Active
[CPU: 259.1 MB Avail: 712.11 GB]
hw_power_limit :Not Active
[CPU: 259.1 MB Avail: 712.11 GB]
compute_mode :Default
[CPU: 259.1 MB Avail: 712.11 GB]
max_pcie_link_gen :3
[CPU: 259.1 MB Avail: 712.11 GB]
current_pcie_link_gen :3
[CPU: 259.1 MB Avail: 712.11 GB]
temperature :61
[CPU: 259.1 MB Avail: 712.11 GB]
gpu_utilization :0
[CPU: 259.1 MB Avail: 712.11 GB]
memory_utilization :0
[CPU: 259.1 MB Avail: 712.11 GB]
NVIDIA RTX A4000 @ 00000000:18:00.0
[CPU: 259.1 MB Avail: 712.11 GB]
driver_version :550.78
[CPU: 259.1 MB Avail: 712.11 GB]
persistence_mode :Disabled
[CPU: 259.1 MB Avail: 712.11 GB]
power_limit :140.00
[CPU: 259.1 MB Avail: 712.11 GB]
sw_power_limit :Not Active
[CPU: 259.1 MB Avail: 712.10 GB]
hw_power_limit :Not Active
[CPU: 259.1 MB Avail: 712.10 GB]
compute_mode :Default
[CPU: 259.1 MB Avail: 712.10 GB]
max_pcie_link_gen :3
[CPU: 259.1 MB Avail: 712.10 GB]
current_pcie_link_gen :3
[CPU: 259.1 MB Avail: 712.11 GB]
temperature :62
[CPU: 259.1 MB Avail: 712.11 GB]
gpu_utilization :0
[CPU: 259.1 MB Avail: 712.11 GB]
memory_utilization :0
[CPU: 259.1 MB Avail: 712.11 GB]
Quadro P620 @ 00000000:65:00.0
[CPU: 259.1 MB Avail: 712.11 GB]
driver_version :550.78
[CPU: 259.1 MB Avail: 712.11 GB]
persistence_mode :Disabled
[CPU: 259.1 MB Avail: 712.11 GB]
power_limit :[N/A]
[CPU: 259.1 MB Avail: 712.11 GB]
sw_power_limit :Not Active
[CPU: 259.1 MB Avail: 712.11 GB]
hw_power_limit :Not Active
[CPU: 259.1 MB Avail: 712.11 GB]
compute_mode :Default
[CPU: 259.1 MB Avail: 712.11 GB]
max_pcie_link_gen :3
[CPU: 259.1 MB Avail: 712.11 GB]
current_pcie_link_gen :3
[CPU: 259.1 MB Avail: 712.11 GB]
temperature :51
[CPU: 259.1 MB Avail: 712.11 GB]
gpu_utilization :3
[CPU: 259.1 MB Avail: 712.11 GB]
memory_utilization :3
[CPU: 351.6 MB Avail: 712.01 GB]
Starting GPU test on: NVIDIA RTX A4000 @ 23
[CPU: 351.6 MB Avail: 712.01 GB]
With CUDA Toolkit version: 11.8
[CPU: 399.0 MB Avail: 712.00 GB]
Finished GPU test in 0.128s
[CPU: 399.0 MB Avail: 712.00 GB]
Testing Tensorflow...
[CPU: 734.6 MB Avail: 711.83 GB]
Tensorflow found 2 GPUs.
[CPU: 734.6 MB Avail: 711.83 GB]
Tensorflow test completed in 1.483s.
[CPU: 734.6 MB Avail: 711.83 GB]
Testing PyTorch...
[CPU: 979.5 MB Avail: 711.68 GB]
PyTorch found 3 GPUs.
[CPU: 979.5 MB Avail: 711.68 GB]
Traceback (most recent call last):
File "cryosparc_master/cryosparc_compute/run.py", line 115, in cryosparc_master.cryosparc_compute.run.main
File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/instance_testing/run.py", line 211, in run_gpu_job
assert devs == check_gpus, f"PyTorch detected {devs} of {check_gpus} GPUs."
AssertionError: PyTorch detected 3 of 2 GPUs.
It doesn’t matter if I run the test bench-marking all gpu devices or if I specify which one. It gives the same error every time.
Is it my configuration? Should I ignore it?
How should I proceed to solve the issue?
Thanks so much in advance!
Iker