3dflex job failed with "libnccl.so.2"

Hi, a standalone cryosparc with fresh 3dflex installation failed with error:“ImportError: libnccl.so.2: cannot open shared object file: No such file or directory.”

I deleted the cryosparc_worker and reinstall it, but still dosen’t work. Also , I don’t find libnssl.so.2 in cryosparc_worker path.

What do you suggest ? Do I need install nccl into the os ?

1 Like

It would be a good thing to check.

sudo apt install libnccl2 should do the trick.

@wsatbluesky Please can you post

  • the text of this Traceback to make your interesting question easier to find by future visitors of the forum
  • CryoSPARC version and patch level
  • the output (as text) of these commands
ldd /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/_C.cpython-38-x86_64-linux-gnu.so
cat /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/version.py

@wtempel
The Cryosparc version is 4.2.1,no patch.

Traceback like this:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 83, in cryosparc_compute.run.main
  File "/opt/cryosparc/cryosparc_worker/cryosparc_compute/jobs/jobregister.py", line 442, in get_run_function
    runmod = importlib.import_module(".."+modname, __name__)
  File "/opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 1174, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py", line 12, in init cryosparc_compute.jobs.flex_refine.run_train
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 19, in init cryosparc_compute.jobs.flex_refine.flexmod
  File "/opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/__init__.py", line 229, in <module>
    from torch._C import *  # noqa: F403
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory

Below are output of 2 commands

[user@local cryosparc_worker]$ ldd /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/_C.cpython-38-x86_64-linux-gnu.so
        linux-vdso.so.1 =>  (0x00007ffd0ab95000)
        libtorch_python.so => /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/lib/libtorch_python.so (0x00007f71e50b2000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f71e4e96000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f71e4ac8000)
        libshm.so => /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/lib/libshm.so (0x00007f71e6343000)
        libtorch.so => /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/lib/libtorch.so (0x00007f71e631d000)
        libnvToolsExt.so.1 => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvToolsExt.so.1 (0x00007f71e48bf000)
        libtorch_cpu.so => /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so (0x00007f71cba0c000)
        libtorch_cuda.so => /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so (0x00007f71a594b000)
        libc10_cuda.so => /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/lib/libc10_cuda.so (0x00007f71e62ad000)
        libc10.so => /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/lib/libc10.so (0x00007f71e61f2000)
        libcudart.so.11.0 => /usr/local/cuda11.1/targets/x86_64-linux/lib/libcudart.so.11.0 (0x00007f71a56c6000)
        libcudnn.so.8 => not found
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f71a53be000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f71a51a8000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f71e6167000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f71a4fa0000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f71a4d9c000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f71a4a9a000)
        libgomp-a34b3233.so.1 => /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1 (0x00007f71a4870000)
        libcupti.so.11.7 => not found
        libcusparse.so.11 => /usr/local/cuda11.1/targets/x86_64-linux/lib/libcusparse.so.11 (0x00007f7196637000)
        libcurand.so.10 => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcurand.so.10 (0x00007f71925d6000)
        libcudnn.so.8 => not found
        libnccl.so.2 => not found
        libcufft.so.10 => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufft.so.10 (0x00007f7189f9c000)
        libcublas.so.11 => /usr/local/cuda11.1/targets/x86_64-linux/lib/libcublas.so.11 (0x00007f7181b80000)
        libcublasLt.so.11 => /usr/local/cuda11.1/targets/x86_64-linux/lib/libcublasLt.so.11 (0x00007f7173b8c000)

[user@local cryosparc_worker]$ cat /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/version.py
__version__ = '2.0.0+cu117'
debug = False
cuda = '11.7'
git_version = 'c263bd43e8e8502d4726643bc6fd046f0130ac0e'
hip = None

From the output,like @rbs_sci said, seems like I need install these “not found” libraries into the cuda of OS.

Hm. Maybe not. I’ve checked and I don’t have libnccl installed on any of my recently set up systems but cryoSPARC works without any problems…

@rbs_sci You are right.The other cryosparc I installed works well. Maybe I need to double check system environemnt. The only difference is the torch version, it’s so weired.
Below is the normal cryosparc torch version info.

[cryosparc@login01 cryosparc]$ cat ~/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/
python3.8/site-packages/torch/version.py
__version__ = '1.13.1+cu117'
debug = False
cuda = '11.7'
git_version = '49444c3e546bf240bed24a101e747422d1f8a0ee'
hip = None

I’d try a fresh install into a new directory, making sure that there is nothing CryoSPARC or CUDA related in your environment variables or .bashrc…

You could also try the forcedeps command to reinstall, then install-3dflex…

Sorry for the confusion; I had NCCL installed on the box I was logged in to (and which was the one I checked before replying!) from tinkering with something else.

Thanks @wsatbluesky for posing the outputs in 3dflex job failed with "libnccl.so.2" - #4 by wsatbluesky.
The outputs indicate several potential problems. You may want to initially try the steps in Installing "3dflex" got failed - #5 by wtempel
and see if those steps enable 3D Flex jobs on your instance.

@rbs_sci @wtempel Thanks. Someone set a global wrong $LD_LIBRARY_PATH. I clear it and reinstall 3dflex. It’s ok now.

Glad you got it sorted. :slight_smile: