GPU Test and nvidia-smi

baileydo · July 19, 2023, 5:34pm

When running GPU test to validate install we get an error we traced back to the newer versions of nvidia-smi have dropped/changed some of the specified options. for example --querygpu is now --query-gpu. Driver package 530.30.02 looks like it should work,. 535.54.03 that came out on June 13th appears to have dropped some of these options and causes the GPU testes to fail. Has anyone else had this experience, any chance it could be “fixed” in a future patch/release? Thanks

--------------------
[CPU:  208.9 MB]
Obtaining GPU info via `nvidia-smi`...

[CPU:  209.0 MB]
Traceback (most recent call last):
  File "/scratch/cluster_scratch/cryosparc/ncif-wolin-cryosparc/cryosparc_worker/cryosparc_compute/jobs/instance_testing/nvidia_smi_util.py", line 41, in run_nvidia_smi_query
    memory_use_info = output_to_list(subprocess.check_output(
  File "/scratch/cluster_scratch/cryosparc/ncif-wolin-cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/scratch/cluster_scratch/cryosparc/ncif-wolin-cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=name,pci.bus_id,driver_version,persistence_mode,power.limit,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,compute_mode,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory', '--format=csv,noheader,nounits']' returned non-zero exit status 2.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
  File "/scratch/cluster_scratch/cryosparc/ncif-wolin-cryosparc/cryosparc_worker/cryosparc_compute/jobs/instance_testing/run.py", line 96, in run_gpu_job
    gpu_info = nvidia_smi_util.run_nvidia_smi_query(
  File "/scratch/cluster_scratch/cryosparc/ncif-wolin-cryosparc/cryosparc_worker/cryosparc_compute/jobs/instance_testing/nvidia_smi_util.py", line 44, in run_nvidia_smi_query
    raise RuntimeError(
RuntimeError: command '['nvidia-smi', '--query-gpu=name,pci.bus_id,driver_version,persistence_mode,power.limit,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,compute_mode,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory', '--format=csv,noheader,nounits']' returned with error (code 2): b'Field "clocks_throttle_reasons.sw_power_cap" is not a valid field to query.\n\n'
---------------------------------------------------

nfrasser · July 24, 2023, 3:03pm

@baileydo I’d like to see the full error message of that nvidia-smi command, could you this command on a GPU machine?

nvidia-smi \
    --query-gpu=name,pci.bus_id,driver_version,persistence_mode,power.limit,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,compute_mode,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory \
    --format=csv,noheader,nounits

baileydo · July 24, 2023, 6:18pm

nvidia-smi --query-gpu=name,pci.bus_id,driver_version,persistence_mode,power.limit,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,compute_mode,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory --format=csv,noheader,nounits
Tesla P100-PCIE-16GB, 00000000:37:00.0, 530.30.02, Enabled, 250.00, Not Active, Not Active, Default, 3, 3, 34, 0, 0
[ncif-wolin-cryosparc@fsitgl-hpc078p ~]$

All is working now since we set our cluster nodes back to Nvidia 530.30.02. When we were at Nvidia 535.54.03 that came out on June 13th some of the option names had changed or been dropped. I will try and find a node I can use to get you the exact error from the command you specified.

Thanks, Doug

nfrasser · July 24, 2023, 7:02pm

Hi Doug, thanks for checking that. We’re tracking this issue internally; a future version of CryoSPARC will use the correct nvidia-smi query for the GPU Test.

rgildea · November 29, 2023, 11:09am

Is there a fix for this in the pipeline? It still appears to be broken in the latest 4.4.0 release:

Current cryoSPARC version: v4.4.0+231114

$ nvidia-smi
...
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
...

$ cat /etc/os-release 
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"

$ cryosparcm test workers P1
Using project P1
Running worker tests...
2023-11-29 11:08:03,852 log                  CRITICAL | Worker test results
2023-11-29 11:08:03,852 log                  CRITICAL | hpc
2023-11-29 11:08:03,852 log                  CRITICAL |   ✓ LAUNCH
2023-11-29 11:08:03,852 log                  CRITICAL |   ✓ SSD
2023-11-29 11:08:03,852 log                  CRITICAL |   ✕ GPU
2023-11-29 11:08:03,852 log                  CRITICAL |     Error: command '['nvidia-smi', '--query-gpu=name,pci.bus_id,driver_version,persistence_mode,power.limit,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,compute_mode,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory', '--format=csv,noheader,nounits']' returned with error (code 2): b'Field "clocks_throttle_reasons.sw_power_cap" is not a valid field to query.\n\n'
2023-11-29 11:08:03,852 log                  CRITICAL |     See P1 J8 for more information

cryosparcm test workers is useful for validating a new installation/upgrade is working as expected, so it would be helpful if this could be fixed!

wtempel · November 29, 2023, 4:45pm

I wonder whether this is an issue with version 535.54.03 of the nvidia-driver.
I did not see the error on a machine with driver version 545.23.08:

$ nvidia-smi --query-gpu=driver_version,clocks_throttle_reasons.sw_power_cap --format=csv
driver_version, clocks_event_reasons.sw_power_cap
545.23.08, Not Active
[...]

Please can you try

after upgrading the nvidia driver?

rgildea · November 30, 2023, 10:04am

I believe 535.54.03 is the version that comes with the latest Amazon Linux 2 AMI. This is on a centrally maintained cluster so upgrading the driver version is unfortunately out of my control. I don’t see any version 545.. listed here - the latest listed appears to be 535.129.03.

wtempel · November 30, 2023, 10:41pm

… which seems to work (on ubuntu-22.04.3)

$ nvidia-smi --query-gpu=name,driver_version,clocks_throttle_reasons.sw_power_cap --format=csv
name, driver_version, clocks_event_reasons.sw_power_cap
Tesla T4, 535.129.03, Not Active