DeepPicker error: Input number of GPUs must be less than or equal to number of available GPUs

marinegor · October 20, 2022, 1:23pm

I could actually proudly say I believe I managed to handle this now.

In short, you should do the following:

Make sure your cryosparc knows about what cuda you’re running – I ran newcuda command.
In my case, cuda version was 11.3, though DeepPicker would still search for libcusolver.so.10 – I believe it’s hardcoded somewhere. Fix was easy – link current libcusolver.so.10 to the currently existing one:

ln -s /usr/local/cuda/lib64/libcusolver.so{,.10}

For some reason, setting number of threads to 1 helped me to get rid of some other errors.

Hope that helps!

UPD: I am still on 3.3.2, but I hope it’ll work for other versions too, since changelog didn’t mention anything particular about DeepPicker.

stavros · October 20, 2022, 1:57pm

Thanks for this, I tried that but didn’t work… do you know where DeepPicker is looking? Cause if it is in the cuda directory, unfortunately I don’t have access since it is a common cluster directory and I don’t wanna mess things up there.

marinegor · October 20, 2022, 2:51pm

@stephan might help you with it, I’m unaware, unfortunately.

wtempel · October 30, 2022, 11:00pm

Please can you post the contents job.log for the failed job as well as the command and working directory for the symbolic link.

stavros · January 2, 2023, 12:15pm

just to give an update on this, I had to create a symbolic link of libcusolver.so and libcusolver.so.10 from a cuda/10.2.6 directory to my

cryosparc/cryosparc_worker/deps/external/cudnn/lib

which I had write access to, since DeePicker was looking in there as well. This “solved” the issue.

eMKiso · February 17, 2023, 1:39pm

Below is the error from ‘cryosparcm log’.
The job failed with Input number of GPUs must be less than or equal to number of available GPUs. Please check job log for more information.

**** handle exception rc
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 93, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 67, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Input number of GPUs must be less than or equal to number of available GPUs. Please check job log for more information.
set status to failed
========= main process now complete.
========= monitor process now complete.

This issue is still present in 4.1.1. I tried the @stavros symbolic linking but didn’t help.
I found that DeepPicker requires CUDA 11 and we have that.

Our CS 4.1.1 is running on a cluster with SLURM.

Also tried to ‘refresh’ the CUDA version (there was no recent upgrade) with

./cryosparcw newcuda /usr/local/cuda-11.6
 New CUDA Path was provided as /usr/local/cuda-11.6
 Checking CUDA installation...
 Found nvcc at /usr/local/cuda-11.6/bin/nvcc
 The above cuda installation will be used but can be changed later.
 Proceeding to uninstall pycuda...
Found existing installation: pycuda 2020.1
Uninstalling pycuda-2020.1:
  Successfully uninstalled pycuda-2020.1
 Replacing path within config.sh...
 Proceeding to recompile pycuda with the new CUDA binaries
Processing ./deps_bundle/python/python_packages/pip_packages/pycuda-2020.1.tar.gz
  Preparing metadata (setup.py) ... done
Requirement already satisfied: pytools>=2011.2 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages (from pycuda==2020.1) (2020.4.4)
Requirement already satisfied: decorator>=3.2.0 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages (from pycuda==2020.1) (4.4.2)
Requirement already satisfied: appdirs>=1.4.0 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages (from pycuda==2020.1) (1.4.4)
Requirement already satisfied: mako in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages (from pycuda==2020.1) (1.2.4)
Requirement already satisfied: numpy>=1.6.0 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages (from pytools>=2011.2->pycuda==2020.1) (1.19.5)
Requirement already satisfied: MarkupSafe>=0.9.2 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages (from mako->pycuda==2020.1) (2.0.1)
Building wheels for collected packages: pycuda
  Building wheel for pycuda (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [6720 lines of output]
      ***************************************************************
      *** WARNING: nvcc not in path.
      *** May need to set CUDA_INC_DIR for installation to succeed.
      ***************************************************************
      *************************************************************
      *** I have detected that you have not run configure.py.
      *************************************************************
      *** Additionally, no global config files were found.
      *** I will go ahead with the default configuration.
      *** In all likelihood, this will not work out.
      ***
      *** See README_SETUP.txt for more information.
      ***
      *** If the build does fail, just re-run configure.py with the
      *** correct arguments, and then retry. Good luck!
      *************************************************************
      *** HIT Ctrl-C NOW IF THIS IS NOT WHAT YOU WANT
      *************************************************************
      Continuing in 10 seconds...
      Continuing in 9 seconds...
      Continuing in 8 seconds...
      Continuing in 7 seconds...
      Continuing in 6 seconds...

Then it fails after a long output:

 In file included from src/cpp/cuda.cpp:4:
      src/cpp/cuda.hpp:14:10: fatal error: cuda.h: No such file or directory
       #include <cuda.h>
                ^~~~~~~~
      compilation terminated.
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> pycuda

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

So the newcuda command doesn’t seem to work either.
I tried to run newcuda on the master and worker nodes.

Any ideas?
Thanks!

wtempel · February 17, 2023, 8:40pm

The following suggestion has worked for us under certain circumstances, but comes with the following caveats:

the environment variables mentioned here should be set for worker installation (or cryosparcw newcuda) only, but these settings should not be included in shell startup files
this suggestion may not apply to CryoSPARC versions other than v4.1.x

export CUDA_PATH="/usr/local/cuda-11.6"
export C_INCLUDE_PATH="$CUDA_PATH/include"
export CPLUS_INCLUDE_PATH="$C_INCLUDE_PATH"
export PATH=$CUDA_PATH/bin:$PATH
./cryosparcw newcuda /usr/local/cuda-11.6

eMKiso · February 17, 2023, 10:31pm

Hi @wtempel,

thank you for your quick reply.
After I posted my Reply I found out that the CS instance is not functional anymore. No pycuda is found, which is logical since command newcuda deletes the existing pycuda module.

Now I am wondering is is better to try the fix you proposed to get the pycuda working or would the upgrade to 4.1.2 also rebuild the missing pycuda?

Regarding the original DeepPicker error. We have there CS instances, two on clusters and one on a workstation but on all of them we get the same error. Is there a way to figure out what is the issue?

eMKiso · February 22, 2023, 10:42am

I’d just like to bump this.
Does anyone have an advice, should I try to fix the issue or just upgrade to 4.1.2? Will the upgrade fix/rebuild the pycuda?

Thanks!

wtempel · February 22, 2023, 7:44pm

You may encounter a broken pycuda installation even during update to v4.1.2. I therefore recommend fixing your existing installation.

eMKiso · March 2, 2023, 8:55pm

Dear @wtempel thank you very much!
This worked great to fix the pycuda.

Deep picker is still not operational.

wtempel · March 2, 2023, 9:42pm

Are other GPU-accelerated job types able to run now?
Please post the text of specific error messages from the event and job logs.
On the GPU node where Deep Picker failed, please record outputs for:

Before You Post: Troubleshooting Guidelines

Inside a fresh shell, load the cryoSPARC worker environment (for execution of the following commands); substitute the actual path of the cryosparcw script:
eval $(/path/to/cryosparc_worker/bin/cryosparcw env)
env | grep PATH
which nvcc
nvcc --version
python -c "import pycuda.driver; print(pycuda.driver.get_version())"
uname -a
free -g
nvidia-smi
record the output

exit the shell

nikydna · March 3, 2023, 12:35pm

Hello,
I could manage to solve this issue. I checked the job log file and it turns out that it could not find libcusolver.so.10 file.

I am using cuda 11.8 but it does not have libcusolver.so.10 in “/usr/local/cuda-11.8/targets/x86_64-linux/lib” instead it has libcusolver.so.11.

So I created symbolic link for libcusolver.so.10 as follows

sudo ln -s libcusolver.so.11 libcusolver.so.10 (make sure to cd into the lib directory where libcusolver.so.11 is located)

After that I restarted cryosparc and rerun DeepPicker and its running smoothly so far.

eMKiso · March 6, 2023, 8:39pm

Hi @nikydna

thank you for your input.
We are running CS on a cluster where I don’t have any admin permissions so I can’t test this.

Thanks anyway!

eMKiso · March 6, 2023, 9:03pm

Hi,

yes other GPU-accelerated jobs run fine now.

The error is identical, to the original post in this thread:


> [2023-02-17 14:11:16.09] [CPU: 169.4 MB] CPU : [0]
> [2023-02-17 14:11:16.09] [CPU: 169.4 MB] GPU : [0]
> [2023-02-17 14:11:16.09] [CPU: 169.4 MB] RAM : [0, 1]
> [2023-02-17 14:11:16.10] [CPU: 169.4 MB] SSD : False
> [2023-02-17 14:11:16.10] [CPU: 169.4 MB] --------------------------------------------------------------
> [2023-02-17 14:11:16.10] [CPU: 169.4 MB] Importing job module for job type deep_picker_train...
> [2023-02-17 14:11:33.00] [CPU: 387.4 MB] Job ready to run
> [2023-02-17 14:11:33.00] [CPU: 387.4 MB] ***************************************************************
> [2023-02-17 14:11:33.77] [CPU: 447.9 MB] Using TensorFlow version 2.4.4
> [2023-02-17 14:11:33.91] [CPU: 473.7 MB] Traceback (most recent call last):? File "cryosparc_master/cryosparc_compute/run.py",
line 93, in cryosparc_compute.run.main? File
"cryosparc_master/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 67, in
cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train?AssertionError
: Input number of GPUs must be less than or equal to number of available GPUs. Please c
heck job log for more information.

‘cryosparcm joblog’ of the same job:

....
'tpl_vars': ['job_dir_abs', 'cluster_job_id', 'num_cpu', 'run_args', 'cryosparc_username', 'job_log_path_abs', 'ram_gb', 'worker_bin_path', 'job_uid', 'run_cmd', 'command', 'project_uid', 'project_dir_abs', 'job_creator', 'num_gpu'], 'type': 'cluster', 'worker_bin_path': '/path/to//cryosparc/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 93, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 67, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Input number of GPUs must be less than or equal to number of available GPUs. Please check job log for more information.
set status to failed
========= main process now complete.
========= monitor process now complete.

I logged into the worker node and here is the info:
env | grep PATH - I removed some paths (replaced with ‘XYZ’) since there were some specific cluster paths that may enable the identification of the cluster. If important I can send it privately.

env | grep PATH
LD_LIBRARY_PATH=/usr/local/cuda-11.6/lib64:/path/to/cryosparc/cryosparc_worker/deps/external/cudnn/lib
CRYOSPARC_CUDA_PATH=/usr/local/cuda-11.6
__LMOD_REF_COUNT_MODULEPATH=/XYZ/modules/el7/modules/all:2;/etc/modulefiles:1;/usr/share/modulefiles:1;/usr/share/modulefiles/Linux:1;/usr/share/modulefiles/Core:1;/usr/share/lmod/lmod/modulefiles/Core:1
CRYOSPARC_PATH=/path/to/cryosparc/cryosparc_worker/bin
PYTHONPATH=/path/to/cryosparc/cryosparc_worker
MANPATH=/usr/share/lmod/lmod/share/man::/XYZ/share/man
MODULEPATH=/XYZ/modules/el7/modules/all:/etc/modulefiles:/usr/share/modulefiles:/usr/share/modulefiles/Linux:/usr/share/modulefiles/Core:/usr/share/lmod/lmod/modulefiles/Core
MODULEPATH_ROOT=/usr/share/modulefiles
PATH=/usr/local/cuda-11.6/bin:/path/to/cryosparc/cryosparc_worker/bin:/path/to/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/path/to/cryosparc/cryosparc_worker/deps/anaconda/condabin:/XYZ/modules/el7/software/Anaconda3/2020.11/condabin:/path/to/cryosparc/cryosparc_master/bin:/XYZ/home/cryosparcuser/.local/bin:/XYZc/home/cryosparcuser/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin

which nvcc
/usr/local/cuda-11.6/bin/nvcc

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

python -c "import pycuda.driver; print(pycuda.driver.get_version())"
(11, 6, 0)

uname -a
Linux node.name 5.15.82-1.el8.name.x86_64 #1 SMP Tue Dec 13 15:02:32 CET 2022 x86_64 x86_64 x86_64 GNU/Linux

free -g
              total        used        free      shared  buff/cache   available
Mem:            125           7          91           0          27         116
Swap:             7           0           7

nvidia-smi 
Mon Mar  6 22:02:05 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100S-PCI...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   31C    P0    36W / 250W |    962MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    619708      C   python                            958MiB |
+-----------------------------------------------------------------------------+

Just a thought, could it be connected with TensorFlow?

wtempel · March 10, 2023, 6:22pm

[Update 2023-03-13: Downgrading the nvidia driver does not in and by itself ensure DeepPicker function. Details to follow.]
[Update 2023-04-25: On the worker with nvidia driver 460, where Deep Picker Train completed, a separate CUDA-10 toolkit installation, distinct from the toolkit configured for use with CryoSPARC, was available via ldconfig cache.]

It might, in a way. I could reproduce the error on a v4.2.0+230302 worker with nvidia-smi output

Driver Version: 525.85.05    CUDA Version: 12.0

whereas a clone of the job on the same instance completed on a worker with

Driver Version: 460.84       CUDA Version: 11.2

Both workers shared a cryosparc_worker directory and CUDA-11.8 toolkit installations.
I should mention that the workers also ran different OS versions, but for now suspect an incompatibility between tensorflow (or its dependencies) and the newer nvidia driver.

eMKiso · April 24, 2023, 11:11am

I would just like to bump this.
Thanks!

wtempel · April 25, 2023, 6:48pm

Assuming that your CryoSPARC worker configuration has not changed, you may try:

cd /usr/local/cuda-11.6/lib64
ln -s libcusolver.so libcusolver.so.10

That is very similar to @nikydna’s suggestion.
Such a link enabled running Deep Picker Train in a case where that job type had previously failed with

Input number of GPUs must be less than or equal to number of available GPUs.

and (in job log)

dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory

in CryoSPARC v4.2.0+230302.
Even though the job completed, I have not confirmed that the model trained with this modified software configuration is “correct”.