Tensorflow failure in CryoSPARC Worker Test

After updating to V4 started Testing:
cryosparcm test i successful

However, cryosparcm test w failed as below

(base) user@MU-CryoEM:~$ cryosparcm test w P1
Using project P1
Running worker tests...
2022-10-04 17:34:41,399 WORKER_TEST          log                  CRITICAL | Worker test results
2022-10-04 17:34:41,400 WORKER_TEST          log                  CRITICAL | MU-CryoEM
2022-10-04 17:34:41,400 WORKER_TEST          log                  CRITICAL |   ✓ LAUNCH
2022-10-04 17:34:41,400 WORKER_TEST          log                  CRITICAL |   ✓ SSD
2022-10-04 17:34:41,400 WORKER_TEST          log                  CRITICAL |   ✕ GPU
2022-10-04 17:34:41,400 WORKER_TEST          log                  CRITICAL |     Error: Tensorflow detected 0 of 2 GPUs.
2022-10-04 17:34:41,400 WORKER_TEST          log                  CRITICAL |     See P1 J30 for more information

Here output as below:
[CPU: 337.0 MB]
Starting PyCuda GPU test on: NVIDIA RTX A6000 @ 0000:2D:00.0

[CPU: 337.0 MB]
    PyCuda was compiled with CUDA: (11, 6, 0)

[CPU: 338.2 MB]
Finished PyCuda GPU test in 0.019s

[CPU: 338.2 MB]
Testing Tensorflow...

[CPU: 552.7 MB]
    Tensorflow found 0 GPUs.

[CPU: 552.7 MB]
Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 93, in cryosparc_compute.run.main
  File "/mnt/nvme/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/instance_testing/run.py", line 161, in run_gpu_job
    assert devs == total_gpus, f"Tensorflow detected {devs} of {total_gpus} GPUs."
AssertionError: Tensorflow detected 0 of 2 GPUs.

Please help with troubleshoot.

Hi @Rajiv-Singh,

Can you please send us the job error report for P1 J30. You can send the .zip file to our email at feedback@structura.bio For more information, see: Guide: Download Error Reports - CryoSPARC Guide

1 Like

Hey @stephan,

I just sent an email with attachment.

Just as another data point, I get the same error (Tensorflow not detecting GPUs) on my systems when running cryosparcw test, but GPU-dependent jobs seem to launch and run just fine…

2 Likes

Hey @olibclarke,

Thanks for reporting. Tensorflow is only used in the Deep Picker jobs, which is why the other jobs work.

1 Like

Hmm - DeepPicker used to work fine on the same systems using 3.32 - I haven’t tested yet with v4. Will check.

1 Like

Hi @Rajiv-Singh,

Looking at the joblog from your attachment, this is the error:

2022-10-04 17:33:48.550500: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /mnt/nvme/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio:/mnt/nvme/cryoSPARC/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib:/mnt/nvme/cryoSPARC/cryosparc_worker/deps/external/cudnn/lib:/usr/local/cuda-11.6/lib64:/mnt/nvme/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib:/mnt/nvme/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib

Hey @stephan,

How to troubleshoot this and load dynamic library?

Hey @Rajiv-Singh, take a look at the following post. The reason this happens may depend on how you installed CUDA.

Hey @stephan,

I checked the Lib64 file, this doesn’t have libcusolver.so.10; however have these two libcusolver.so.11 libcusolver.so. Shall I create a hard link as advised as:

cd /usr/local/cuda-11.6/lib64
sudo ln libcusolver.so.11 libcusolver.so.10

Kindly advise!

Hi @Rajiv-Singh,

That looks good, let me know if it works after making the link.

Hey @stephan,

Yeah, this worked!

The output as follows:

(base) user@MU-CryoEM:~$ cryosparcm test w P1
Using project P1
Running worker tests...
2022-10-05 01:36:04,200 WORKER_TEST          log                  CRITICAL | Worker test results
2022-10-05 01:36:04,200 WORKER_TEST          log                  CRITICAL | MU-CryoEM
2022-10-05 01:36:04,200 WORKER_TEST          log                  CRITICAL |   ✓ LAUNCH
2022-10-05 01:36:04,200 WORKER_TEST          log                  CRITICAL |   ✓ SSD
2022-10-05 01:36:04,200 WORKER_TEST          log                  CRITICAL |   ✓ GPU
2022-10-05 01:36:04,206 WORKER_TEST          log                  CRITICAL |     ⚠ NVIDIA RTX A6000: GPU Hardware Slowdown is Active
2022-10-05 01:36:04,206 WORKER_TEST          log                  CRITICAL |     ⚠ NVIDIA RTX A6000: GPU Hardware Slowdown is Active

Is this “GPU Hardware Slowdown is Active” a matter of concern?

Glad to see it worked @Rajiv-Singh!

Please see our guide covering the GPU test for more information on what this means:

1 Like

Hello,

We have 4 x RTX 2080 Ti for computing and 1 x GeForce GT 1030 for running the display in one of our workstations.

At first, the GPU test failed with 0 Tensorflow GPU detected. After creating a link for libcusolver.so.10 in lib64 as mentioned above, the four 2080s are at least being detected with Tensorflow.

However, it seems like if all GPUs are not detected with Tensorflow, the tensorflow GPU test seems to fail.

Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 93, in cryosparc_compute.run.main
  File "/home/user/software/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/instance_testing/run.py", line 161, in run_gpu_job
    assert devs == total_gpus, f"Tensorflow detected {devs} of {total_gpus} GPUs."
AssertionError: Tensorflow detected 4 of 5 GPUs.

FWIW, cryoSPARC lists only the 4 computing GPUs in the processing lane. It’d be great to be able to run the test just with the Tensorflow GPUs or skip the Tensorflow test for GPU(s) without Tensorflow capabilities.

Best,
Kookjoo

Hi @kookjookeem,

Thank you very much for reporting this issue.
In the upcoming release of CryoSPARC, v4.0.1, testing Tensorflow will be optional through a command line flag: --test-tensorflow. In addition, Tensorflow cabililites will only be “checked” on GPUs that have been registered with CryoSPARC. This will ensure the test doesn’t fail if Tensorflow fails to start on a display GPU for example.

3 Likes

@kookjookeem This has been implemented in v4.0.1

2 Likes