DeepPicker error: Input number of GPUs must be less than or equal to number of available GPUs

Hi everyone,

I tried using DeepPicker for particle picking (select some particles with template picker and use them & respective micrographs for the input). However, with either default settings or number of parallel threads set to 1 / number of GPUs set to 1, I get the following error:

[CPU: 512.9 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 67, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Input number of GPUs must be less than or equal to number of available GPUs. Please check job log for more information.

The beginning of the stdout is (for number of parallel threads = 1, number of GPUs = 1):

[CPU: 80.7 MB]   Project P3 Job J170 Started
[CPU: 80.7 MB]   Master running v3.2.0, worker running v3.2.0
[CPU: 80.9 MB]   Running on lane default
[CPU: 80.9 MB]   Resources allocated: 
[CPU: 80.9 MB]     Worker:  cmm-1
[CPU: 80.9 MB]     CPU   :  [0]
[CPU: 80.9 MB]     GPU   :  [0]
[CPU: 80.9 MB]     RAM   :  [0, 1]
[CPU: 80.9 MB]     SSD   :  False
[CPU: 80.9 MB]   --------------------------------------------------------------
[CPU: 80.9 MB]   Importing job module for job type deep_picker_train...
[CPU: 437.9 MB]  Job ready to run
[CPU: 438.1 MB]  ***************************************************************
[CPU: 592.5 MB]  Using TensorFlow version 2.4.1

I tried submiting job to either the general lane, or to specific GPUs.
As for hardware, I have 4 2080 Ti. Software: nvcc --version ->10.2, CryoSPARC 3.2.

Hey @marinegor,

Can you send the job log for this job? cryosparcm joblog P3 J170

there is no such file:

marin@cmm-1:~$ cryosparcm joblog P3 J170
/data/cryosparc_projects/P3/J170/job.log: No such file or directory

The folder structure confirms it:

marin@cmm-1:~$ tree /data/cryosparc_projects/P3/J170/ 
/data/cryosparc_projects/P3/J170/ 
├── events.bson                  
├── gridfs_data                  
└── job.json                          

1 directory, 2 files     

Hi @marinegor,

Can you try clearing the job, then running the job again?
Once it fails, you can try running the joblog command again to see if anything shows up. Can you also send me the output of nvidia-smi?

nvidia-smi:

Job log (obtained with cryosparcm joblog P3 J215 > log.txt):

================= CRYOSPARCW =======  2021-05-19 18:18:28.238449  =========
Project P3 Job J215
Master cmm-1 Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 2473
MAIN PID 2473
deep_picker.run_deep_picker cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
2021-05-19 18:18:34.832676: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
2021-05-19 18:19:22.589079: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-19 18:19:22.621872: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-19 18:19:22.670988: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.671739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.671886: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.672556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
pciBusID: 0000:02:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.672628: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.673253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties: 
pciBusID: 0000:03:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.673331: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.673948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties: 
pciBusID: 0000:04:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.674004: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-19 18:19:22.853563: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-05-19 18:19:22.853750: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-05-19 18:19:22.871436: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-05-19 18:19:22.871843: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-05-19 18:19:23.108518: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-05-19 18:19:23.108983: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/cryosparc/cryosparc_worker/cryosparc_compute/blobio:/opt/cryosparc/cryosparc_worker/cryosparc_compute/libs:/opt/cryosparc/cryosparc_worker/deps/external/cudnn/lib:/usr/local/cuda/lib64:/opt/cryosparc/cryosparc_master/cryosparc_compute/blobio:/opt/cryosparc/cryosparc_master/cryosparc_compute/libs
2021-05-19 18:19:23.131132: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-05-19 18:19:23.131198: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
***************************************************************
Running job on hostname %s cmm-1
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'cmm-1', 'lane': 'default', 'lane_type': 'default', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [2], 'GPU': [1], 'RAM': [5, 6]}, 'target': {'cache_path': '/data/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}], 'hostname': 'cmm-1', 'lane': 'default', 'monitor_port': None, 'name': 'cmm-1', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}, 'ssh_str': 'cryosparcuser@cmm-1', 'title': 'Worker node cmm-1', 'type': 'node', 'worker_bin_path': '/opt/cryosparc/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

Hi @marinegor,

Looks like this is your problem:

Can you check if the file exists inside: /usr/local/cuda/lib64 and make sure /usr/local/cuda is pointing to CUDA-11.0?

1 Like

Ok, that makes sense – I have CUDA 10.2 on board, and thus the libcusparse.so.11 isn’t there.

Should I indeed upgrade to 11.0 for DeepPicker to work, or should it pick up the installed version somehow?

Hi @marinegor,

You can update the version of CUDA that cryoSPARC uses by running the cryosparcw newcuda command:
https://guide.cryosparc.com/setup-configuration-and-management/management-and-monitoring/cryosparcw#cryosparcw-newcuda-less-than-path-greater-than

Thanks, it’s running smoothly now :+1:

1 Like

Ok, it’s indeed running (meaning “goes to the 'running` state”), but still can not progress with training.
Namely, I get the following error:

[CPU: 624.6 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 255, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: No particles corresponding to input micrographs were found. Ensure that non-zero particle picks were input and that the particle picks are from the input micrographs.

Although I submit same particles & micrographs as early for Topaz (and it successfully trains with them).

Hi @stephan,

I recently encountered the same issue. I have the latest CUDA (11.3) and updated as you suggested here. Still run into the same error as reported above:

Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py”, line 67, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Input number of GPUs must be less than or equal to number of available GPUs. Please check job log for more information.

Any idea what else might cause the issue?

The job log file seem to indicate the same issue, ergo:
“could not load dynamic library 'libcusolver.so.10”

EDIT: found another post where you made a suggestion to make a hard link for the libcusolver.so.10 in cuda-11.3 folder. That seem to have solved the problem. Thanks :slight_smile:

Best,
Omid

1 Like