DeepPicker error: Input number of GPUs must be less than or equal to number of available GPUs

marinegor · May 18, 2021, 8:03pm

Hi everyone,

I tried using DeepPicker for particle picking (select some particles with template picker and use them & respective micrographs for the input). However, with either default settings or number of parallel threads set to 1 / number of GPUs set to 1, I get the following error:

[CPU: 512.9 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 67, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Input number of GPUs must be less than or equal to number of available GPUs. Please check job log for more information.

The beginning of the stdout is (for number of parallel threads = 1, number of GPUs = 1):

[CPU: 80.7 MB]   Project P3 Job J170 Started
[CPU: 80.7 MB]   Master running v3.2.0, worker running v3.2.0
[CPU: 80.9 MB]   Running on lane default
[CPU: 80.9 MB]   Resources allocated: 
[CPU: 80.9 MB]     Worker:  cmm-1
[CPU: 80.9 MB]     CPU   :  [0]
[CPU: 80.9 MB]     GPU   :  [0]
[CPU: 80.9 MB]     RAM   :  [0, 1]
[CPU: 80.9 MB]     SSD   :  False
[CPU: 80.9 MB]   --------------------------------------------------------------
[CPU: 80.9 MB]   Importing job module for job type deep_picker_train...
[CPU: 437.9 MB]  Job ready to run
[CPU: 438.1 MB]  ***************************************************************
[CPU: 592.5 MB]  Using TensorFlow version 2.4.1

I tried submiting job to either the general lane, or to specific GPUs.
As for hardware, I have 4 2080 Ti. Software: nvcc --version ->10.2, CryoSPARC 3.2.

stephan · May 19, 2021, 1:29pm

Hey @marinegor,

Can you send the job log for this job? cryosparcm joblog P3 J170

marinegor · May 19, 2021, 2:13pm

there is no such file:

marin@cmm-1:~$ cryosparcm joblog P3 J170
/data/cryosparc_projects/P3/J170/job.log: No such file or directory

The folder structure confirms it:

marin@cmm-1:~$ tree /data/cryosparc_projects/P3/J170/ 
/data/cryosparc_projects/P3/J170/ 
├── events.bson                  
├── gridfs_data                  
└── job.json                          

1 directory, 2 files

stephan · May 19, 2021, 3:14pm

Hi @marinegor,

Can you try clearing the job, then running the job again?
Once it fails, you can try running the joblog command again to see if anything shows up. Can you also send me the output of nvidia-smi?

marinegor · May 19, 2021, 3:22pm

nvidia-smi:

Job log (obtained with cryosparcm joblog P3 J215 > log.txt):

================= CRYOSPARCW =======  2021-05-19 18:18:28.238449  =========
Project P3 Job J215
Master cmm-1 Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 2473
MAIN PID 2473
deep_picker.run_deep_picker cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
2021-05-19 18:18:34.832676: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
2021-05-19 18:19:22.589079: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-19 18:19:22.621872: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-19 18:19:22.670988: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.671739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.671886: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.672556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
pciBusID: 0000:02:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.672628: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.673253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties: 
pciBusID: 0000:03:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.673331: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.673948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties: 
pciBusID: 0000:04:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.674004: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-19 18:19:22.853563: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-05-19 18:19:22.853750: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-05-19 18:19:22.871436: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-05-19 18:19:22.871843: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-05-19 18:19:23.108518: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-05-19 18:19:23.108983: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/cryosparc/cryosparc_worker/cryosparc_compute/blobio:/opt/cryosparc/cryosparc_worker/cryosparc_compute/libs:/opt/cryosparc/cryosparc_worker/deps/external/cudnn/lib:/usr/local/cuda/lib64:/opt/cryosparc/cryosparc_master/cryosparc_compute/blobio:/opt/cryosparc/cryosparc_master/cryosparc_compute/libs
2021-05-19 18:19:23.131132: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-05-19 18:19:23.131198: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
***************************************************************
Running job on hostname %s cmm-1
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'cmm-1', 'lane': 'default', 'lane_type': 'default', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [2], 'GPU': [1], 'RAM': [5, 6]}, 'target': {'cache_path': '/data/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}], 'hostname': 'cmm-1', 'lane': 'default', 'monitor_port': None, 'name': 'cmm-1', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}, 'ssh_str': 'cryosparcuser@cmm-1', 'title': 'Worker node cmm-1', 'type': 'node', 'worker_bin_path': '/opt/cryosparc/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

stephan · May 19, 2021, 3:44pm

Hi @marinegor,

Looks like this is your problem:

Can you check if the file exists inside: /usr/local/cuda/lib64 and make sure /usr/local/cuda is pointing to CUDA-11.0?

marinegor · May 19, 2021, 3:48pm

Ok, that makes sense – I have CUDA 10.2 on board, and thus the libcusparse.so.11 isn’t there.

Should I indeed upgrade to 11.0 for DeepPicker to work, or should it pick up the installed version somehow?

stephan · May 28, 2021, 4:58pm

Hi @marinegor,

You can update the version of CUDA that cryoSPARC uses by running the cryosparcw newcuda command:
https://guide.cryosparc.com/setup-configuration-and-management/management-and-monitoring/cryosparcw#cryosparcw-newcuda-less-than-path-greater-than

marinegor · June 3, 2021, 1:25pm

Thanks, it’s running smoothly now

marinegor · June 3, 2021, 3:18pm

Ok, it’s indeed running (meaning “goes to the 'running` state”), but still can not progress with training.
Namely, I get the following error:

[CPU: 624.6 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 255, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: No particles corresponding to input micrographs were found. Ensure that non-zero particle picks were input and that the particle picks are from the input micrographs.

Although I submit same particles & micrographs as early for Topaz (and it successfully trains with them).

Omid · June 4, 2021, 10:28pm

Hi @stephan,

I recently encountered the same issue. I have the latest CUDA (11.3) and updated as you suggested here. Still run into the same error as reported above:

Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py”, line 67, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Input number of GPUs must be less than or equal to number of available GPUs. Please check job log for more information.

Any idea what else might cause the issue?

The job log file seem to indicate the same issue, ergo:
“could not load dynamic library 'libcusolver.so.10”

EDIT: found another post where you made a suggestion to make a hard link for the libcusolver.so.10 in cuda-11.3 folder. That seem to have solved the problem. Thanks

Best,
Omid

stephan · July 6, 2021, 2:07pm

Hi @marinegor,

Sorry for the late response. Is it possible if you can try re-running the job with the “Number of Parallel Threads” parameter value to be just 1?

marinegor · July 8, 2021, 7:39pm

Hi @stephan, in either case (# of available GPUs 4 or 1 & # of parallel threads 4 or 1) it throws the above mentioned error. The cryosparcm joblog gives quite an uninformative message as well:

Traceback (most recent call last):   
  File "/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py", line 193, in _run_module_as_main        
    "__main__", mod_spec)                                                                                                                     
  File "/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py", line 85, in _run_code 
    exec(code, run_globals)    
  File "/opt/cryosparc/cryosparc_master/cryosparc_compute/client.py", line 89, in <module>        
    print(eval("cli."+command))           
  File "<string>", line 1, in <module>           
  File "/opt/cryosparc/cryosparc_master/cryosparc_compute/client.py", line 62, in func         
    assert False, res['error']       
AssertionError: {'code': 500, 'data': None, 'message': "OtherError: argument of type 'NoneType' is not iterable", 'name': 'OtherError'}

stephan · August 25, 2021, 5:10pm

Hi @marinegor,

Are you still having this issue? If you connect your particles and micrographs to an Inspect Picks job, are you able to see corresponding picks on each micrograph?

marinegor · August 30, 2021, 10:48am

yes, they look pretty normal. Also, all other types of picking (template, blob, or Topaz) work fine.

stephan · August 30, 2021, 2:40pm

Hi @marinegor,

Are you on the latest major + patch version of cryoSPARC?

marinegor · August 31, 2021, 12:56pm

@stephan I’m on cryosparc version v3.2.0, patch 210817.

It also seem that the problem with the “Input number of GPUs must be less than or equal to …” might be fixed easier than I expected: via running cryosparcm joblog P3 J739 (the failing job) I figured out that the library that is being searched for, is libcusolver.so.10, although the other libraries from cuda 11.3 are successfully loaded. I soft-linked the existing libcusolver library via this:

$ realpath $(echo $LD_LIBRARY_PATH | cut -c2-)/libcusolver.so -l
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcusolver.so.11.2.0.43 
$ ln -s /usr/local/cuda-11.4/targets/x86_64-linux/lib/libcusolver.so.{11.2.0.43,10}

However, after I fix that, the other problem still persists:

[CPU: 596.8 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 255, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: No particles corresponding to input micrographs were found. Ensure that non-zero particle picks were input and that the particle picks are from the input micrographs.

Although I double-checked that the “Inspect picks” works fine on the input data, and I can see the particles there.

lalmagor · September 8, 2021, 2:07pm

Hi,

I’m experiencing a similar problem when trying to run Deep Picker:

[CPU: 372.1 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py”, line 67, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Input number of GPUs must be less than or equal to number of available GPUs. Please check job log for more information.

It stays the same regardless of the number of GPUs selected. Other job types, including Topaz training, work fine.

Any ideas?

Thanks,
Lior

amaker · June 15, 2022, 4:22pm

Hi all,

Was there ever a resolution to this bug? I am experiencing the same error.

Thanks!

stavros · October 20, 2022, 12:55pm

Still exists in Ver.4.0.1