3DFlex Dependencies; Building pycuda

scaiola · December 15, 2022, 12:55pm

Hey folks,

Had the same issue and figured out how to fix it. I think @rbs_sci is right and it’s a mix of cuda version which is the issue.

I simply change the line of cryosparcw from

conda install -y cuda-nvcc=11.7 cuda-toolkit=11.7 -c nvidia

to

conda install -y cuda-nvcc=11.7 cuda-toolkit=11.7 -c “nvidia/label/cuda-11.7.0”

This force the usage of 11.7 for all packages and it seems to work afterwards

Hope that helps.

stavros · December 15, 2022, 1:33pm

Hey @scaiola what do you mean exactly?

I simply change the line of cryosparcw from xxxxxxxx

Could you provide a little more elaborate step description?
Thanks!

olibclarke · December 15, 2022, 2:44pm

Can confirm this works for us too, thank you!!

@stavros you need to edit the cryosparcw script (used to install the 3D-flex dependencies), located in the worker/bin directory, editing the conda install line in the way that @scaiola showed works. Be careful when making the edit if you are pasting the new line in - there are multiple types of quotation marks, and those that get pasted can have formatting that you don’t want, so I would recommend typing rather than copy-pasting.

Cheers
Oli

EDIT:
This worked for the Ubuntu system. For the CentOS one it is not quite there:

Requirement already satisfied: pytools>=2011.2 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from pycuda==2020.1) (2020.4.4)
Requirement already satisfied: decorator>=3.2.0 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from pycuda==2020.1) (4.4.2)
Requirement already satisfied: appdirs>=1.4.0 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from pycuda==2020.1) (1.4.4)
Requirement already satisfied: mako in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from pycuda==2020.1) (1.1.6)
Requirement already satisfied: numpy>=1.6.0 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from pytools>=2011.2->pycuda==2020.1) (1.19.5)
Requirement already satisfied: MarkupSafe>=0.9.2 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from mako->pycuda==2020.1) (2.0.1)
Building wheels for collected packages: pycuda
  Building wheel for pycuda (setup.py) ... done
  Created wheel for pycuda: filename=pycuda-2020.1-cp37-cp37m-linux_x86_64.whl size=615783 sha256=329190f5458cf4d0e958ca89ab1226de960ff4bc47c9032d7f90bbe8a2d2aad5
  Stored in directory: /home/tmp/pip-ephem-wheel-cache-01nqtq71/wheels/fe/c9/2f/377db1b07f46ef88920cd6e533c2f0e1d0d0e3a5dbac1997bb
Successfully built pycuda
Installing collected packages: pycuda
Successfully installed pycuda-2020.1
PyTorch not installed correctly, or NVIDIA GPU not detected.

EDIT2:
Fixed! Had to unset LD_LIBRARY_PATH first.

Well… almost fixed. On our Ubuntu machine all runs fine. On CentOS it says dependencies installed successfully, but any 3D-flex job “terminates abnormally”, with no useful info in the joblog.

EDIT3: OK now it is working on CentOS, but I have no idea why or what changed…

jelka · December 15, 2022, 2:53pm

Thanks @stavros great spotting.
The double quotation marks are not needed and it also compiles with cuda-11.7.1 label:

conda install -y cuda-nvcc=11.7 cuda-toolkit=11.7 -c nvidia/label/cuda-11.7.1

But would 3Dflex not work out of the box if one just did a normal worker install, compiled against external cuda-11.7
Followed by:

pip install torch ninja

And why suddenly the need for internal cuda env, when the rest of cryosparc uses system cuda env?

Best,
Jesper

EDIT: I just tested this. 3DFlex does not need the internal CUDA env in anaconda. It can be build against system CUDA installation. Just run normal update from worker directory:

eval $(bin/cryosparcw env)
bin/cryosparcw update
pip install torch ninja

Now pycuda in cryosparc is build against system CUDA defined in config.sh and it even runs.

leetleyang · December 15, 2022, 3:28pm

This gets us through the installation process as well. Brilliant.

ss4858 · December 15, 2022, 4:35pm

changing the line of cryosparcw from:

conda install -y cuda-nvcc=11.7 cuda-toolkit=11.7 -c nvidia

to

conda install -y cuda-nvcc=11.7 cuda-toolkit=11.7 -c “nvidia/label/cuda-11.7.0”

worked for us.

Thanks @scaiola

Bassem · December 15, 2022, 5:47pm

Thank you. Worked for us

without unsetting LD_LIBRARY_PATH

posertinlab · December 15, 2022, 9:04pm

This worked for us as well! For reference it’s line 457 of cryosparcw.

rbs_sci · December 16, 2022, 12:36am

Glad I was on the right path, at least.

All my poking around with quirky CUDA installs appears to have broken my worker install in a rather terminal fashion. Oh well, the database is safe, time for a reinstall.

Then maybe I can try 3D flex out.

edit: well, it installed OK with a non-broken conda install…

Successfully built pycuda
Installing collected packages: pycuda
Successfully installed pycuda-2020.1
3D Flex Refine dependencies installed successfully.

Now to test it out.

edit 2:

All tests fail. Suspect due to --nossd flag used during install.

However, running jobs manually (full T20S workflow with every job type enabled, disabling SSD caching where necessary, and then running 3D flex (still running training) all run fine.

qitsweauca · December 16, 2022, 6:45am

Now it’s fixed. It was the conflict of system’s preload cuda 11-2.
So yes, specifying cryosparcw with the cuda 11.7 conda update does work!!

Thank you all

------------------------ updated -------

Are you guys able to run through 3D Flex Training after specifying 11.7 update in cryosparc worker?

I still got the following error though… I guess unsetting LD_LIBRARY is necessary for the dependencies update though.

[2022-12-16 17:40:16.44]
[CPU: 203.7 MB]
Traceback (most recent call last):
File “/apps/cryosparc4/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/init.py”, line 172, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File “/apps/cryosparc4/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/ctypes/init.py”, line 364, in init
self._handle = _dlopen(self._name, mode)
OSError: /apps/cryosparc4/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/lib/…/…/nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 80, in cryosparc_compute.run.main
File “/apps/cryosparc4/cryosparc_worker/cryosparc_compute/jobs/jobregister.py”, line 443, in get_run_function
runmod = importlib.import_module("…"+modname, name)
File “/apps/cryosparc4/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/importlib/init.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1006, in _gcd_import
File “”, line 983, in _find_and_load
File “”, line 967, in _find_and_load_unlocked
File “”, line 677, in _load_unlocked
File “”, line 1050, in exec_module
File “”, line 219, in _call_with_frames_removed
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py”, line 12, in init cryosparc_compute.jobs.flex_refine.run_train
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py”, line 19, in init cryosparc_compute.jobs.flex_refine.flexmod
File “/apps/cryosparc4/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/init.py”, line 217, in
_load_global_deps()
File “/apps/cryosparc4/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/init.py”, line 178, in _load_global_deps
_preload_cuda_deps()
File “/apps/cryosparc4/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/init.py”, line 158, in _preload_cuda_deps
ctypes.CDLL(cublas_path)
File “/apps/cryosparc4/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/ctypes/init.py”, line 364, in init
self._handle = _dlopen(self._name, mode)
OSError: /apps/cryosparc4/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11

Bassem · December 16, 2022, 3:01pm

While @scaiola fix work for the update seems like flex train end with error:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 80, in cryosparc_compute.run.main
  File "/opt/cryosparc/cryosparc_worker/cryosparc_compute/jobs/jobregister.py", line 443, in get_run_function
    runmod = importlib.import_module(".."+modname, __name__)
  File "/opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 1050, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py", line 12, in init cryosparc_compute.jobs.flex_refine.run_train
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 19, in init cryosparc_compute.jobs.flex_refine.flexmod
  File "/opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/__init__.py", line 191, in <module>
    _load_global_deps()
  File "/opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/__init__.py", line 153, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.11: symbol cublasLtHSHMatmulAlgoInit, version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

any thoughts?

wtempel · December 16, 2022, 7:51pm

Hi,
As we are working to address the issues you have been experiencing with
cryosparcw install-3dflex
we are looking for volunteers who would be willing to test our fix.
If you are confident in installing a CryoSPARC test version and, potentially, recovering from a broken installation, please send me a direct message with the following information for your CryoSPARC worker:

output of uname -a
the value of CRYOSPARC_CUDA_PATH inside cryosparc_worker/config.sh
the output of $CRYOSPARC_CUDA_PATH/bin/nvcc --version (after setting the CRYOSPARC_CUDA_PATH environment variable)
the output of nvidia-smi
the type of your CryoSPARC instance:
- single workstation (combined master/worker)
- master and CryoSPARC-managed workers
- master and cluster workers

Zy90 · December 17, 2022, 11:51pm

This change works for us too!

After the failing install-3dflex installation message, you need to run ./bin/cryosparcw forcedeps to roll back; otherwise it will still give same error.

Ubuntu 18.04.1
Cuda 11.8

wilnart · December 20, 2022, 3:04am

@Bassem @qitsweauca I got the same error as you. The simple fix is to connect the libraries as follows

export LD_LIBRARY_PATH=/home/xxxxx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/…/…/nvidia/cublas/lib/

There could be a more sophisticated solution. But this worked for me

hwangab · December 20, 2022, 1:00pm

I got the same error as @Bassem and @qitsweauca running training job after install 3d-flex. Can you elaborate on your fix? What I did is

export LD_LIBRARY_PATH=/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/nvidia/cublas/lib/

and then run the training job, still got the same error.

Shall I add the “export LD_LIBRARY_PATH=/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/nvidia/cublas/lib/” in bashrc file or before running ./bin/cryosparcw install-3dflex?

Also if it’s relevant I’m using CentOS and the original cryosparc was installed using cuda 11.1. Shall I reinstall cryosparc using cuda 11.7?

wtempel · December 20, 2022, 4:50pm

We have just released CryoSPARC v4.1.1 to address this issue.

Bassem · December 20, 2022, 5:48pm

I carried out the update, and then ran the install 3dflex.

everything seems working but i got this message at end of install 3dflex

Installing collected packages: torch
Successfully installed torch-1.13.1
Processing ./deps_bundle/python/python_packages/pip_packages/pycuda-2020.1-cp38-cp38-linux_x86_64.whl
Installing collected packages: pycuda
Successfully installed pycuda-2020.1
PyTorch not installed correctly, or NVIDIA GPU not detected.

not sure what the last line is impacting if others jobs so far running ok?

donghuachen · December 20, 2022, 8:13pm

I updated from v4.0.3 to v4.1.1 then run ./bin/cryosparcw install-3dflex and got exactly the same error as the above. Anybody has suggestions? Thanks.

Installing collected packages: torch
Successfully installed torch-1.13.1
Processing ./deps_bundle/python/python_packages/pip_packages/pycuda-2020.1-cp38-cp38-linux_x86_64.whl
Installing collected packages: pycuda
Successfully installed pycuda-2020.1
PyTorch not installed correctly, or NVIDIA GPU not detected.

olibclarke · December 20, 2022, 8:16pm

We had the same issue on our CentOS machine. It turned out cuda was present in our PATH & LD_LIBRARY_PATH (even though we had removed it from our .bashrc), and I think that was somehow causing the issue. you can check by running export | grep cuda

Bassem · December 20, 2022, 8:34pm

@olibclarke Shall we redo the --install 3dflex after removing it from PATH?