Hi @marinegor,
Can you try clearing the job, then running the job again?
Once it fails, you can try running the joblog command again to see if anything shows up. Can you also send me the output of nvidia-smi
?
Hi @marinegor,
Can you try clearing the job, then running the job again?
Once it fails, you can try running the joblog command again to see if anything shows up. Can you also send me the output of nvidia-smi
?
nvidia-smi
:
Job log (obtained with cryosparcm joblog P3 J215 > log.txt
):
================= CRYOSPARCW ======= 2021-05-19 18:18:28.238449 =========
Project P3 Job J215
Master cmm-1 Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 2473
MAIN PID 2473
deep_picker.run_deep_picker cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
2021-05-19 18:18:34.832676: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
2021-05-19 18:19:22.589079: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-19 18:19:22.621872: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-19 18:19:22.670988: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.671739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.671886: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.672556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:02:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.672628: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.673253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties:
pciBusID: 0000:03:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.673331: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 18:19:22.673948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties:
pciBusID: 0000:04:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-05-19 18:19:22.674004: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-19 18:19:22.853563: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-05-19 18:19:22.853750: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-05-19 18:19:22.871436: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-05-19 18:19:22.871843: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-05-19 18:19:23.108518: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-05-19 18:19:23.108983: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/cryosparc/cryosparc_worker/cryosparc_compute/blobio:/opt/cryosparc/cryosparc_worker/cryosparc_compute/libs:/opt/cryosparc/cryosparc_worker/deps/external/cudnn/lib:/usr/local/cuda/lib64:/opt/cryosparc/cryosparc_master/cryosparc_compute/blobio:/opt/cryosparc/cryosparc_master/cryosparc_compute/libs
2021-05-19 18:19:23.131132: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-05-19 18:19:23.131198: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
***************************************************************
Running job on hostname %s cmm-1
Allocated Resources : {'fixed': {'SSD': False}, 'hostname': 'cmm-1', 'lane': 'default', 'lane_type': 'default', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [2], 'GPU': [1], 'RAM': [5, 6]}, 'target': {'cache_path': '/data/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}], 'hostname': 'cmm-1', 'lane': 'default', 'monitor_port': None, 'name': 'cmm-1', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}, 'ssh_str': 'cryosparcuser@cmm-1', 'title': 'Worker node cmm-1', 'type': 'node', 'worker_bin_path': '/opt/cryosparc/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.
Hi @marinegor,
Looks like this is your problem:
Can you check if the file exists inside: /usr/local/cuda/lib64
and make sure /usr/local/cuda
is pointing to CUDA-11.0?
Ok, that makes sense – I have CUDA 10.2 on board, and thus the libcusparse.so.11 isn’t there.
Should I indeed upgrade to 11.0 for DeepPicker to work, or should it pick up the installed version somehow?
Hi @marinegor,
You can update the version of CUDA that cryoSPARC uses by running the cryosparcw newcuda
command:
https://guide.cryosparc.com/setup-configuration-and-management/management-and-monitoring/cryosparcw#cryosparcw-newcuda-less-than-path-greater-than
Ok, it’s indeed running (meaning “goes to the 'running` state”), but still can not progress with training.
Namely, I get the following error:
[CPU: 624.6 MB] Traceback (most recent call last):
File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
File "cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 255, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: No particles corresponding to input micrographs were found. Ensure that non-zero particle picks were input and that the particle picks are from the input micrographs.
Although I submit same particles & micrographs as early for Topaz (and it successfully trains with them).
Hi @stephan,
I recently encountered the same issue. I have the latest CUDA (11.3) and updated as you suggested here. Still run into the same error as reported above:
Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py”, line 67, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Input number of GPUs must be less than or equal to number of available GPUs. Please check job log for more information.
Any idea what else might cause the issue?
The job log file seem to indicate the same issue, ergo:
“could not load dynamic library 'libcusolver.so.10”
EDIT: found another post where you made a suggestion to make a hard link for the libcusolver.so.10 in cuda-11.3 folder. That seem to have solved the problem. Thanks
Best,
Omid
Hi @marinegor,
Sorry for the late response. Is it possible if you can try re-running the job with the “Number of Parallel Threads” parameter value to be just 1?
Hi @stephan, in either case (# of available GPUs 4 or 1 & # of parallel threads 4 or 1) it throws the above mentioned error. The cryosparcm joblog
gives quite an uninformative message as well:
Traceback (most recent call last):
File "/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/cryosparc/cryosparc_master/cryosparc_compute/client.py", line 89, in <module>
print(eval("cli."+command))
File "<string>", line 1, in <module>
File "/opt/cryosparc/cryosparc_master/cryosparc_compute/client.py", line 62, in func
assert False, res['error']
AssertionError: {'code': 500, 'data': None, 'message': "OtherError: argument of type 'NoneType' is not iterable", 'name': 'OtherError'}
Hi @marinegor,
Are you still having this issue? If you connect your particles and micrographs to an Inspect Picks job, are you able to see corresponding picks on each micrograph?
yes, they look pretty normal. Also, all other types of picking (template, blob, or Topaz) work fine.
@stephan I’m on cryosparc version v3.2.0, patch 210817.
It also seem that the problem with the “Input number of GPUs must be less than or equal to …” might be fixed easier than I expected: via running cryosparcm joblog P3 J739
(the failing job) I figured out that the library that is being searched for, is libcusolver.so.10
, although the other libraries from cuda 11.3 are successfully loaded. I soft-linked the existing libcusolver library via this:
$ realpath $(echo $LD_LIBRARY_PATH | cut -c2-)/libcusolver.so -l
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcusolver.so.11.2.0.43
$ ln -s /usr/local/cuda-11.4/targets/x86_64-linux/lib/libcusolver.so.{11.2.0.43,10}
However, after I fix that, the other problem still persists:
[CPU: 596.8 MB] Traceback (most recent call last):
File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
File "cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 255, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: No particles corresponding to input micrographs were found. Ensure that non-zero particle picks were input and that the particle picks are from the input micrographs.
Although I double-checked that the “Inspect picks” works fine on the input data, and I can see the particles there.
Hi,
I’m experiencing a similar problem when trying to run Deep Picker:
[CPU: 372.1 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py”, line 67, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Input number of GPUs must be less than or equal to number of available GPUs. Please check job log for more information.
It stays the same regardless of the number of GPUs selected. Other job types, including Topaz training, work fine.
Any ideas?
Thanks,
Lior
Hi all,
Was there ever a resolution to this bug? I am experiencing the same error.
Thanks!
Still exists in Ver.4.0.1
I could actually proudly say I believe I managed to handle this now.
In short, you should do the following:
Make sure your cryosparc knows about what cuda you’re running – I ran newcuda command.
In my case, cuda version was 11.3
, though DeepPicker would still search for libcusolver.so.10
– I believe it’s hardcoded somewhere. Fix was easy – link current libcusolver.so.10
to the currently existing one:
ln -s /usr/local/cuda/lib64/libcusolver.so{,.10}
number of threads
to 1 helped me to get rid of some other errors.Hope that helps!
UPD: I am still on 3.3.2, but I hope it’ll work for other versions too, since changelog didn’t mention anything particular about DeepPicker.
Thanks for this, I tried that but didn’t work… do you know where DeepPicker is looking? Cause if it is in the cuda directory, unfortunately I don’t have access since it is a common cluster directory and I don’t wanna mess things up there.