No module named 'torch' after 4.1 upgrade

I see. mokca is on the right path.

after running:

cryosparcw install-3dflex

Now it seems to be working.

However, there is a very long gcc error that happens, perhaps this should be investigated further.

2 Likes

Hey everyone,

Thanks for reporting.

If you’d like to run 3DFlex jobs, you will need to install the dependencies required via the install-3dflex command as mentioned here:

There seems to be an issue with the installation on some systems, we’re working on an update to fix this.

1 Like

Thanks @stephan

We were able to install and start a run, but eventually get this error which seems related to GPU memory:

cryosparc_compute.jobs.flex_refine.flexmod.TetraSVFunction.forward torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 10.76 GiB total capacity; 7.15 GiB already allocated; 940.94 MiB free; 9.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is there a place we should change max_split_size_mb ?

Also, will you post back here when you have a fix for the 3D flex dips?

Many thanks!

2 Likes

Is it expected that the worker environment is using the system gcc/g++ instead of the conda version?

I am running into /usr/bin/gcc and cuda/nvcc from conda potential conflicts on ubuntu 20.04. I have remove the possible conflicting packages from ubuntu (apt remove nvidia-cuda-toolkit…) but ./bin/cryosparcw install-3dflex keeps failing.

Then trying to revert with:
cryoem@myrdal:~/cryosparc2/cryosparc_worker$ ./bin/cryosparcw forcedeps
yields:

...
    gcc -pthread -B /home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -fwrapv -Wall -O3 -DNDEBUG -fPIC -DBOOST_ALL_NO_LIB=1 -DBOOST_THREAD_BUILD_DLL=1 -DBOOST_MULTI_INDEX_DISABLE_SERIALIZATION=1 -DBOOST_PYTHON_SOURCE=1 -Dboost=pycudaboost -DBOOST_THREAD_DONT_USE_CHRONO=1 -DPYGPU_PACKAGE=pycuda -DPYGPU_PYCUDA=1 -DHAVE_CURAND=1 -Isrc/cpp -Ibpl-subset/bpl_subset -I/home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/numpy/core/include -I/home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include/python3.7m -c src/cpp/cuda.cpp -o build/temp.linux-x86_64-3.7/src/cpp/cuda.o
    In file included from src/cpp/cuda.cpp:4:
    src/cpp/cuda.hpp:14:10: fatal error: cuda.h: No such file or directory
       14 | #include <cuda.h>
          |          ^~~~~~~~
    compilation terminated.
    error: command 'gcc' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin/python3.7 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-3334l4mw/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-3334l4mw/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-8lgu7bn9/install-record.txt --single-version-externally-managed --compile --install-headers /home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include/python3.7m/pycuda Check the logs for full command output.
check_install_deps.sh: 59: ERROR: installing python failed.

I had to re-add nvidia-cuda-toolkit and the system provided cuda10.1

maybe relates to this thread 3DFlex Dependencies; Building pycuda - #10 by qitsweauca

1 Like
tru@myrdal:~$ dpkg -l nvidia-cuda-toolkit
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                Version      Architecture Description
+++-===================-============-============-=================================
ii  nvidia-cuda-toolkit 10.1.243-3   amd64        NVIDIA CUDA development toolkit
tru@myrdal:~$ dpkg -l gcc g++
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version          Architecture Description
+++-==============-================-============-=================================
ii  g++            4:9.3.0-1ubuntu2 amd64        GNU C++ compiler
ii  gcc            4:9.3.0-1ubuntu2 amd64        GNU C compiler

now I can revert to previous setup:

cryoem@myrdal:~/cryosparc2/cryosparc_worker$ ./bin/cryosparcw forcedeps
Checking dependencies...
Forcing dependencies to be reinstalled...
  ------------------------------------------------------------------------
  Installing anaconda python...
  ------------------------------------------------------------------------
PREFIX=/home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda
Unpacking payload ...
...
  Extracting all conda packages...
  ------------------------------------------------------------------------
...................................................................................................................................................................................
  ------------------------------------------------------------------------
    Done.
    conda packages installation successful.  
  ------------------------------------------------------------------------
  Preparing to install all pip packages...   
  ------------------------------------------------------------------------
Processing ./deps_bundle/python/python_packages/pip_packages/pycuda-2020.1.tar.gz
  Preparing metadata (setup.py) ... done
Skipping wheel build for pycuda, due to binaries being disabled for it.
Installing collected packages: pycuda
    Running setup.py install for pycuda ... done
Successfully installed pycuda-2020.1
  ------------------------------------------------------------------------
    Done.
    pip packages installation successful.
  ------------------------------------------------------------------------
  Main dependency installation completed. Continuing...
  ------------------------------------------------------------------------
Completed.
Currently checking hash for ctffind
Forcing reinstall for dependency ctffind...  
  ------------------------------------------------------------------------
  ctffind 4.1.10 installation successful.
  ------------------------------------------------------------------------
Completed.
Currently checking hash for cudnn
Forcing reinstall for dependency cudnn...
  ------------------------------------------------------------------------
  cudnn 8.1.0.77 for CUDA 11 installation successful.
  ------------------------------------------------------------------------
Completed.
Currently checking hash for gctf
Forcing reinstall for dependency gctf...
  ------------------------------------------------------------------------
  Gctf v1.06 installation successful.
  ------------------------------------------------------------------------
Completed.
Completed dependency check.
Generating '/home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/libtiff/tiff_h_4_4_0.py' from '/home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/../include/tiff.h'
1 Like

reverted system yields non function cryosparc with errors such as:

I use the fixed provided at 3DFlex Dependencies; Building pycuda - #15 by scaiola

replacing line 457 of cryosparc_worker/bin/cryosparcw

conda install -y cuda-nvcc=11.7 cuda-toolkit=11.7 -c nvidia

by:

conda install -y cuda-nvcc=11.7 cuda-toolkit=11.7 -c nvidia/label/cuda-11.7.0

and running cryosparc_worker/bin/cryosparcw install-3dflex after, seems to have fixed everything and activate the 3dflex functionnality

This fix appears to work for us as well on a CentOS system.

Reference:

I did have to first revert the system:

./bin/cryosparcw forcedeps

Then edit cryosparcw and change from:

conda install -y cuda-nvcc=11.7 cuda-toolkit=11.7 -c nvidia

To:

conda install -y cuda-nvcc=11.7 cuda-toolkit=11.7 -c nvidia/label/cuda-11.7.0

Then finally run the 3dflex installer:

./bin/cryosparcw install-3dflex

No more lengthy errors, and users report that jobs appear ok so far.

Hi

No idea if this is related to the previous fix, here is an error reported by our users when using topaz:

UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors.
This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor.
You may want to copy the array to protect its data or make it writeable before converting it to a tensor.
This type of warning will be suppressed for the rest of this program.
(Triggered internally at   /opt/conda/conda-bld/pytorch_1607370156314/work/torch/csrc/utils/tensor_numpy.cpp:141.)

and another one during extraction:

We have just released CryoSPARC v4.1.1 to address this issue.

Thanks for the update!

Hey everyone,

I see that v4.1.1 addressed this, but I have run into this same ‘no torch’ issue in CryoSPARC v4.1.2 on Centos 7 when starting a 3DFlex Train job (Data Prep and Mesh Prep ran fine).

" ModuleNotFoundError: No module named ‘torch’ " is the precise wording in the log.

Before I attempt any of the above solutions, is there one that is currently recommended for v4.1.2?

Thanks!

Please try, without editing cryosparcw,

/path/to/cryosparc_worker/bin/cryosparcw install-3dflex 2>&1 | tee install_3dflex.log

Does this work?

The command runs with the follow output:


Preparing transaction: …working… done
Verifying transaction: …working… done
Executing transaction: …working… done

==> WARNING: A newer version of conda exists. <==
current version: 4.12.0
latest version: 23.1.0

Please update conda by running

$ conda update -n base -c defaults conda

Found existing installation: pycuda 2020.1
Uninstalling pycuda-2020.1:
Successfully uninstalled pycuda-2020.1
Collecting torch
Downloading torch-1.13.1-cp38-cp38-manylinux1_x86_64.whl (887.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 887.4/887.4 MB 3.3 MB/s eta 0:00:00
Installing collected packages: torch
Successfully installed torch-1.13.1
Processing ./deps_bundle/python/python_packages/pip_packages/pycuda-2020.1-cp38-cp38-linux_x86_64.whl
Installing collected packages: pycuda
Successfully installed pycuda-2020.1
PyTorch not installed correctly, or NVIDIA GPU not detected.


Running 3DFlex Train fails upon initializing torch

This is using a Quadro P6000 with CUDA Version 10.1.243 (not sure what “found version 10020” maps to from the above error). Seems like the workstation needs an updated CUDA driver to properly install PyTorch?

EDIT: @wtempel seems like all jobs fail now on this machine, not just 3DFlex Train. See below for a Non-Uniform Refinement, and similar error seen for Ab Initio. Suggestion?

I had also run the command on a second workstation set up as a worker for the above workstation. The command ran with the following output:

Preparing transaction: …working… done
Verifying transaction: …working… done
Executing transaction: …working… done

==> WARNING: A newer version of conda exists. <==
current version: 4.12.0
latest version: 23.1.0

Please update conda by running

$ conda update -n base -c defaults conda

Found existing installation: pycuda 2020.1
Uninstalling pycuda-2020.1:
Successfully uninstalled pycuda-2020.1
Collecting torch
Downloading torch-1.13.1-cp38-cp38-manylinux1_x86_64.whl (887.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 887.4/887.4 MB 4.6 MB/s eta 0:00:00
Installing collected packages: torch
Successfully installed torch-1.13.1
Processing ./deps_bundle/python/python_packages/pip_packages/pycuda-2020.1-cp38-cp38-linux_x86_64.whl
Installing collected packages: pycuda
Successfully installed pycuda-2020.1
PyTorch not installed correctly, or NVIDIA GPU not detected.

I am able to run 3D Flex Train and other jobs without crashing (at least, thus far). The worker is Centos 7, GeForce RTX 2080 Ti with CUDA Version 10.2.89.

Please ensure your nvidia driver version is at least v 460 (see guide)

Hello, @wtempel. What commands are run during “install-3dflex” ? Currently I am getting Conda HTTP error due to the company firewall settings. Usually, when trying to create Conda environments, I need to add “–insecure” to bypass this issue. Is it possible to do so in this case?

@andreym you should able to do this by adding pypi.org as a trusted host in pip and disabling SSL verification in conda. Should be something like this:

cryosparcw call pip config set global.trusted-host pypi.org
SSL_NO_VERIFY=1 cryosparcw install-3dflex

Let me know how that goes

1 Like

@nfrasser Thank you for the suggestion. I have tried it, unfortunately I am still running into the error:
" Installing 3D Flex Refine dependencies…

Collecting package metadata (current_repodata.json): failed

CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.anaconda.org/nvidia/label/cuda-11.8.0/linux-64/current_repodata.json

Elapsed: -

An HTTP error occurred when trying to retrieve this URL.

HTTP errors are often intermittent, and a simple retry will get you on your way.

Linux 64 :: Anaconda.org’"

It is not your proxy, as https://conda.anaconda.org/nvidia/label/cuda-11.8.0/linux-64/current_repodata.json lands at " The page you are looking for does not exist."

@andreym could you post the full output of the following command?

SSL_NO_VERIFY=1 cryosparcw call conda install -y cuda-nvcc=11.8 cuda-toolkit=11.8 -c nvidia/label/cuda-11.8.0 -v