Pytorch failure in "Test Worker GPUs"

nalab-cryoem · May 24, 2024, 12:43pm

Hello,

I’m running the latest version of CS v4.5.1 on a single workstation, with OS Rocky Linux 8.9 (Green Obsidian), with 3 GPUs: 2x RTX A4000 and 1x Quadro P620.

I have not had any issue with the master installation. After installing the master, I updated the worker node so that the Quadro (gpu id 2) would not be listed (to avoid using it in the default lane). This worked well as well.

./cryosparcw connect --worker $(hostname)  \
--master $(hostname) \
--port 39000 \
--update \
--ssdpath /scratch/users/cryosparc_cache/  \
--ssdreserve 256 \
--gpus 0,1

The event log showed:

  Detected 3 CUDA devices.

   id           pci-bus  name
   ---------------------------------------------------------------
       0                23  NVIDIA RTX A4000                                                                
       1                24  NVIDIA RTX A4000                                                                
       2               101  Quadro P620                                                                
   ---------------------------------------------------------------
   Devices specified: 0, 1
   Devices 0, 1 will be enabled now.

However, while running the “Test Worker GPUs” with both custom options on “Tensorflow test” and “PyTorch test”, it always passes the TensorFlow test but I get this assertion error with the PyTorch test:

License is valid.

Launching job on lane default target  ...

Running job on master node hostname 
[CPU:   93.9 MB  Avail: 712.16 GB]

Job J2 Started
[CPU:   93.9 MB  Avail: 712.15 GB]

Master running v4.5.1, worker running v4.5.1
[CPU:   93.9 MB  Avail: 712.15 GB]

Working in directory: /home/cryosparcuser/cryosparc_benchmarks/CS-benchmark/J2
[CPU:   93.9 MB  Avail: 712.15 GB]

Running on lane default
[CPU:   93.9 MB  Avail: 712.20 GB]

Resources allocated: 
[CPU:   93.9 MB  Avail: 712.20 GB]

  Worker:  chapelhill.cicbiogne.int
[CPU:   93.9 MB  Avail: 712.19 GB]

  CPU   :  [0]
[CPU:   93.9 MB  Avail: 712.19 GB]

  GPU   :  [0]
[CPU:   93.9 MB  Avail: 712.18 GB]

  RAM   :  [0]
[CPU:   93.9 MB  Avail: 712.18 GB]

  SSD   :  True
[CPU:   93.9 MB  Avail: 712.18 GB]

--------------------------------------------------------------
[CPU:   93.9 MB  Avail: 712.18 GB]

Importing job module for job type worker_gpu_test...
[CPU:  228.0 MB  Avail: 712.12 GB]

Job ready to run
[CPU:  228.0 MB  Avail: 712.12 GB]

***************************************************************
[CPU:  259.1 MB  Avail: 712.11 GB]

Obtaining GPU info via `nvidia-smi`...
[CPU:  259.1 MB  Avail: 712.11 GB]

NVIDIA RTX A4000 @ 00000000:17:00.0
[CPU:  259.1 MB  Avail: 712.11 GB]

    driver_version                :550.78
[CPU:  259.1 MB  Avail: 712.11 GB]

    persistence_mode              :Disabled
[CPU:  259.1 MB  Avail: 712.11 GB]

    power_limit                   :140.00
[CPU:  259.1 MB  Avail: 712.11 GB]

    sw_power_limit                :Not Active
[CPU:  259.1 MB  Avail: 712.11 GB]

    hw_power_limit                :Not Active
[CPU:  259.1 MB  Avail: 712.11 GB]

    compute_mode                  :Default
[CPU:  259.1 MB  Avail: 712.11 GB]

    max_pcie_link_gen             :3
[CPU:  259.1 MB  Avail: 712.11 GB]

    current_pcie_link_gen         :3
[CPU:  259.1 MB  Avail: 712.11 GB]

    temperature                   :61
[CPU:  259.1 MB  Avail: 712.11 GB]

    gpu_utilization               :0
[CPU:  259.1 MB  Avail: 712.11 GB]

    memory_utilization            :0
[CPU:  259.1 MB  Avail: 712.11 GB]

NVIDIA RTX A4000 @ 00000000:18:00.0
[CPU:  259.1 MB  Avail: 712.11 GB]

    driver_version                :550.78
[CPU:  259.1 MB  Avail: 712.11 GB]

    persistence_mode              :Disabled
[CPU:  259.1 MB  Avail: 712.11 GB]

    power_limit                   :140.00
[CPU:  259.1 MB  Avail: 712.11 GB]

    sw_power_limit                :Not Active
[CPU:  259.1 MB  Avail: 712.10 GB]

    hw_power_limit                :Not Active
[CPU:  259.1 MB  Avail: 712.10 GB]

    compute_mode                  :Default
[CPU:  259.1 MB  Avail: 712.10 GB]

    max_pcie_link_gen             :3
[CPU:  259.1 MB  Avail: 712.10 GB]

    current_pcie_link_gen         :3
[CPU:  259.1 MB  Avail: 712.11 GB]

    temperature                   :62
[CPU:  259.1 MB  Avail: 712.11 GB]

    gpu_utilization               :0
[CPU:  259.1 MB  Avail: 712.11 GB]

    memory_utilization            :0
[CPU:  259.1 MB  Avail: 712.11 GB]

Quadro P620 @ 00000000:65:00.0
[CPU:  259.1 MB  Avail: 712.11 GB]

    driver_version                :550.78
[CPU:  259.1 MB  Avail: 712.11 GB]

    persistence_mode              :Disabled
[CPU:  259.1 MB  Avail: 712.11 GB]

    power_limit                   :[N/A]
[CPU:  259.1 MB  Avail: 712.11 GB]

    sw_power_limit                :Not Active
[CPU:  259.1 MB  Avail: 712.11 GB]

    hw_power_limit                :Not Active
[CPU:  259.1 MB  Avail: 712.11 GB]

    compute_mode                  :Default
[CPU:  259.1 MB  Avail: 712.11 GB]

    max_pcie_link_gen             :3
[CPU:  259.1 MB  Avail: 712.11 GB]

    current_pcie_link_gen         :3
[CPU:  259.1 MB  Avail: 712.11 GB]

    temperature                   :51
[CPU:  259.1 MB  Avail: 712.11 GB]

    gpu_utilization               :3
[CPU:  259.1 MB  Avail: 712.11 GB]

    memory_utilization            :3
[CPU:  351.6 MB  Avail: 712.01 GB]

Starting GPU test on: NVIDIA RTX A4000 @ 23
[CPU:  351.6 MB  Avail: 712.01 GB]

    With CUDA Toolkit version: 11.8
[CPU:  399.0 MB  Avail: 712.00 GB]

Finished GPU test in 0.128s
[CPU:  399.0 MB  Avail: 712.00 GB]

Testing Tensorflow...
[CPU:  734.6 MB  Avail: 711.83 GB]

    Tensorflow found 2 GPUs.
[CPU:  734.6 MB  Avail: 711.83 GB]

Tensorflow test completed in 1.483s.
[CPU:  734.6 MB  Avail: 711.83 GB]

Testing PyTorch...
[CPU:  979.5 MB  Avail: 711.68 GB]

    PyTorch found 3 GPUs.
[CPU:  979.5 MB  Avail: 711.68 GB]

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 115, in cryosparc_master.cryosparc_compute.run.main
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/instance_testing/run.py", line 211, in run_gpu_job
    assert devs == check_gpus, f"PyTorch detected {devs} of {check_gpus} GPUs."
AssertionError: PyTorch detected 3 of 2 GPUs.

It doesn’t matter if I run the test bench-marking all gpu devices or if I specify which one. It gives the same error every time.

Is it my configuration? Should I ignore it?
How should I proceed to solve the issue?

Thanks so much in advance!
Iker

nwong · May 27, 2024, 6:13pm

Hi @nalab-cryoem,

This error message can be safely ignored, it does not indicate any issue with your configuration. We have noted it as a bug with the worker test and will work on a fix for it.