3D Flex pytorch issue on RTX4090

Hi,

I’m having trouble to run 3D Flex Refinement on RTX4090.

After I installed the dependency using cryosparcw, the job failed on “RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR” error. I believe this is CUDA-related issue specific to RTX4090 running pytorch-cuda=11.7, which is the version that cryosparcw installed. It’s the same error as in CUFFT_INTERNAL_ERROR on RTX 4090 · Issue #88038 · pytorch/pytorch · GitHub.

To get around that, I used cryosparcw ipython to upgrade pytorch to pytorch-cuda=11.8, and simple test on pytorch in ipython went fine, but the 3D Flex after upgrading pytorch failed with following error:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 83, in cryosparc_compute.run.main
File “/home/hongjiang/cryosparc/cryosparc_worker/cryosparc_compute/jobs/jobregister.py”, line 442, in get_run_function
runmod = importlib.import_module(“…”+modname, name)
File “/home/hongjiang/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/importlib/init.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1014, in _gcd_import
File “”, line 991, in _find_and_load
File “”, line 975, in _find_and_load_unlocked
File “”, line 671, in _load_unlocked
File “”, line 1174, in exec_module
File “”, line 219, in _call_with_frames_removed
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py”, line 12, in init cryosparc_compute.jobs.flex_refine.run_train
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py”, line 24, in init cryosparc_compute.jobs.flex_refine.flexmod
ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory

I think this error happened because in pytorch-11.8, there is no longer a libtorch_cuda_cu.so, and all libraries are merged into one single libtorch_cuda.so file.

Any advice on how to get around this issue?

I am unsure about the state of the CryoSPARC cryosparc_worker installation at this point. You may try the following:

  1. “Reset” your cryosparc_worker installation:
    /home/hongjiang/cryosparc/cryosparc_worker/bin/cryosparcw forcedeps
  2. Then update your instance to version 4.2.1
  3. Then run
    /home/hongjiang/cryosparc/cryosparc_worker/bin/cryosparcw install-3dflex
    In CryoSPARC v4.2.1, this command should install pytorch with CUDA version 11.8, which should support your GPU.

Please update this forum topic with your progress or any errors you may encounter.

I’m running version 4.2.1, and the vanilla cryosparcw install-3dflex installed pytorch=1.13.1, which I believe is only CUDA-11.7. The actual code in cryosparcw is here:

install-3dflex)
shift

# remove CUDA from PATH and LD_LIBRARY_PATH variables
pathremove "$CRYOSPARC_CUDA_PATH/bin"
pathremove "$CRYOSPARC_ROOT_DIR/deps/external/cudnn/lib" LD_LIBRARY_PATH
pathremove "$CRYOSPARC_CUDA_PATH/lib64" LD_LIBRARY_PATH

# install new dependencies
echo "Installing 3D Flex Refine dependencies..."

conda install -y cuda-nvcc=11.8 cuda-toolkit=11.8 -c nvidia/label/cuda-11.8.0
pip uninstall -y pycuda
pip install torch~=1.13.1 --no-dependencies
# recompile PyCUDA with the newly installed, self-contained NVCC and CUDA Toolkit
pycuda_wheel="$CRYOSPARC_ROOT_DIR/deps_bundle/python/python_packages/pip_packages/pycuda-2020.1-cp38-cp38-linux_x86_64.whl"
if [ -f "$pycuda_wheel" ]; then
    pycuda_location="$pycuda_wheel"
else
    pycuda_location="pycuda==2020.1"
fi
pip install "$pycuda_location" --no-cache-dir --no-dependencies

# ensure torch is working (requires an NVIDIA GPU)
if is_pytorch_available; then
    echo "3D Flex Refine dependencies installed successfully."
else
    2>&1 echo 'NOTE: Installation of 3D Flex dependencies succeeded, but PyTorch or NVIDIA GPU were not detected.'
    2>&1 echo 'This is expected at this point of the installation process.'
    2>&1 echo 'Please confirm the installation by running a 3D Flex job.'
fi
exit 0
;;

The CUDA version may differ depending on the CryoSPARC version at the time one runs cryosparcw install-3dflex.

If one had run cryosparcw install-3dflex with an older version of CryoSPARC, one may end up with a pytorch installation that won’t run on a 4090 GPU.
If

/home/hongjiang/cryosparc/cryosparc_worker/bin/cryosparcw call which nvcc

points to a file under
/home/hongjiang/cryosparc/cryosparc_worker/
and

/home/hongjiang/cryosparc/cryosparc_worker/bin/cryosparcw call nvcc --version

shows a release lower than 11.8, you may try
1.
/home/hongjiang/cryosparc/cryosparc_worker/bin/cryosparcw forcedeps
2.
/home/hongjiang/cryosparc/cryosparc_worker/bin/cryosparcw install-3dflex

Hi, I have recently found the same error, also on RTX 4090. I have confirmed that cryosparcw call nvcc --version returns release 11.8.

And also, even after running cryosparcw forcedeps and cryosparcw install-3dflex I still get the same error.

Any suggestions to fix this would be greatly appreciated!

@Rafa What is the output of these commands on the computer with the RTX 4090?

csw=/path/to/cryosparc_worker/bin/cryosparcw # edit this
$csw call which nvcc
$csw call python -c "import torch, pycuda.driver; print(f'pycuda version? {pycuda.driver.get_version()}\nTorch CUDA available? {torch.cuda.is_available()}')"
nvidia-smi

@wtempel Sorry for the delayed reply!

Pleaase find below the requested output:

$csw call which nvcc
/home/apps/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin/nvcc
$csw call python -c "import torch, pycuda.driver; print(f'pycuda version? {pycuda.driver.get_version()}\nTorch CUDA available? {torch.cuda.is_available()}')"
pycuda version? (11, 8, 0)
Torch CUDA available? True
nvidia-smi
Mon Jun 12 16:59:57 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:81:00.0 Off |                  Off |
| 30%   35C    P2    73W / 450W |    400MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:82:00.0 Off |                  Off |
| 30%   27C    P8    22W / 450W |      8MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      7106      G   /usr/libexec/Xorg                   4MiB |
|    0   N/A  N/A    639193      C   python                            392MiB |
|    1   N/A  N/A      7106      G   /usr/libexec/Xorg                   4MiB |
+-----------------------------------------------------------------------------+

Thanks a lot in advance!

Best wishes,

Rafa

Just to add on @wtempel. We receive the same error with RTX 4000 SFF GPUs, using driver 525 and cuda 11.8.

To avoid any confusion, please can you post the error(s) you observed.

This error occurs immediately after Initializing torch..

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py", line 171, in cryosparc_compute.jobs.flex_refine.run_train.run
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 387, in cryosparc_compute.jobs.flex_refine.flexmod.run_test_density_opt
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

I get the following error immediately after “Importing job module for job type flex_train…” :

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 83, in cryosparc_compute.run.main
  File "/home/apps/cryosparc/cryosparc_worker/cryosparc_compute/jobs/jobregister.py", line 442, in get_run_function
    runmod = importlib.import_module(".."+modname, __name__)
  File "/home/apps/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 1174, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py", line 12, in init cryosparc_compute.jobs.flex_refine.run_train
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 24, in init cryosparc_compute.jobs.flex_refine.flexmod
ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory

@alexwoo1221 @Rafa @UCBKurt We have confirmed problems running 3DFlex jobs on a Hopper GPU and are investigating a possible fix.

Hello @wtempel , I am just wondering if there was by any chance any sort of way to get around this error? We have identified a case where 3D variability is giving interesting results and would like to give it a shot with 3D Flex as well :slight_smile: Thanks a lot in advance!

I also encountered the same problem on RTX4090 machine with v4.2.1.
Can this issue be solved by upgrading to the the latest v4.3.0 ?
If there is other possible approach, please let me know.
Thanks a lot!

Unfortunately, v4.3 does not include added support for running 3DFlex jobs on RTX 4090 devices, but we aim to add support in a future release.

Eager to use this powerful tool in the next release. Thanks for your prompt reply.

@alexwoo1221 @jpliu @Rafa @UCBKurt We today released CryoSPARC v4.4, which should support 3DFlex jobs on cards like the RTX4090 without the need to run cryosparcw install-3dflex.

1 Like

Great job! I’ll give it a try. Thanks a lot!