Same here. Torch error during 3D-flex training after 4.1 upgrade:
Traceback (most recent call last):
File "cryosparc_master/cryosparc_compute/run.py", line 80, in cryosparc_compute.run.main
File "/lmb/home/mjones/soft/171122/cryosparc_worker/cryosparc_compute/jobs/jobregister.py", line 443, in get_run_function
runmod = importlib.import_module(".."+modname, __name__)
File "/lmb/home/mjones/soft/171122/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 1050, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py", line 12, in init cryosparc_compute.jobs.flex_refine.run_train
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 19, in init cryosparc_compute.jobs.flex_refine.flexmod
ModuleNotFoundError: No module named 'torch'
ANOTHER EDIT: That page says you don’t need CUDA 11.8, the install-3dflex will install it itself. There’s another post in the Install, Configure, and Update forum about the error message you get with install-3dflex.
We were able to install and start a run, but eventually get this error which seems related to GPU memory:
cryosparc_compute.jobs.flex_refine.flexmod.TetraSVFunction.forward torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 10.76 GiB total capacity; 7.15 GiB already allocated; 940.94 MiB free; 9.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Is there a place we should change max_split_size_mb ?
Also, will you post back here when you have a fix for the 3D flex dips?
Is it expected that the worker environment is using the system gcc/g++ instead of the conda version?
I am running into /usr/bin/gcc and cuda/nvcc from conda potential conflicts on ubuntu 20.04. I have remove the possible conflicting packages from ubuntu (apt remove nvidia-cuda-toolkit…) but ./bin/cryosparcw install-3dflex keeps failing.
Then trying to revert with: cryoem@myrdal:~/cryosparc2/cryosparc_worker$ ./bin/cryosparcw forcedeps
yields:
...
gcc -pthread -B /home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -fwrapv -Wall -O3 -DNDEBUG -fPIC -DBOOST_ALL_NO_LIB=1 -DBOOST_THREAD_BUILD_DLL=1 -DBOOST_MULTI_INDEX_DISABLE_SERIALIZATION=1 -DBOOST_PYTHON_SOURCE=1 -Dboost=pycudaboost -DBOOST_THREAD_DONT_USE_CHRONO=1 -DPYGPU_PACKAGE=pycuda -DPYGPU_PYCUDA=1 -DHAVE_CURAND=1 -Isrc/cpp -Ibpl-subset/bpl_subset -I/home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/numpy/core/include -I/home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include/python3.7m -c src/cpp/cuda.cpp -o build/temp.linux-x86_64-3.7/src/cpp/cuda.o
In file included from src/cpp/cuda.cpp:4:
src/cpp/cuda.hpp:14:10: fatal error: cuda.h: No such file or directory
14 | #include <cuda.h>
| ^~~~~~~~
compilation terminated.
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin/python3.7 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-3334l4mw/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-3334l4mw/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-8lgu7bn9/install-record.txt --single-version-externally-managed --compile --install-headers /home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include/python3.7m/pycuda Check the logs for full command output.
check_install_deps.sh: 59: ERROR: installing python failed.
I had to re-add nvidia-cuda-toolkit and the system provided cuda10.1
tru@myrdal:~$ dpkg -l nvidia-cuda-toolkit
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===================-============-============-=================================
ii nvidia-cuda-toolkit 10.1.243-3 amd64 NVIDIA CUDA development toolkit
tru@myrdal:~$ dpkg -l gcc g++
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============-================-============-=================================
ii g++ 4:9.3.0-1ubuntu2 amd64 GNU C++ compiler
ii gcc 4:9.3.0-1ubuntu2 amd64 GNU C compiler
now I can revert to previous setup:
cryoem@myrdal:~/cryosparc2/cryosparc_worker$ ./bin/cryosparcw forcedeps
Checking dependencies...
Forcing dependencies to be reinstalled...
------------------------------------------------------------------------
Installing anaconda python...
------------------------------------------------------------------------
PREFIX=/home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda
Unpacking payload ...
...
Extracting all conda packages...
------------------------------------------------------------------------
...................................................................................................................................................................................
------------------------------------------------------------------------
Done.
conda packages installation successful.
------------------------------------------------------------------------
Preparing to install all pip packages...
------------------------------------------------------------------------
Processing ./deps_bundle/python/python_packages/pip_packages/pycuda-2020.1.tar.gz
Preparing metadata (setup.py) ... done
Skipping wheel build for pycuda, due to binaries being disabled for it.
Installing collected packages: pycuda
Running setup.py install for pycuda ... done
Successfully installed pycuda-2020.1
------------------------------------------------------------------------
Done.
pip packages installation successful.
------------------------------------------------------------------------
Main dependency installation completed. Continuing...
------------------------------------------------------------------------
Completed.
Currently checking hash for ctffind
Forcing reinstall for dependency ctffind...
------------------------------------------------------------------------
ctffind 4.1.10 installation successful.
------------------------------------------------------------------------
Completed.
Currently checking hash for cudnn
Forcing reinstall for dependency cudnn...
------------------------------------------------------------------------
cudnn 8.1.0.77 for CUDA 11 installation successful.
------------------------------------------------------------------------
Completed.
Currently checking hash for gctf
Forcing reinstall for dependency gctf...
------------------------------------------------------------------------
Gctf v1.06 installation successful.
------------------------------------------------------------------------
Completed.
Completed dependency check.
Generating '/home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/libtiff/tiff_h_4_4_0.py' from '/home/cryoem/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/../include/tiff.h'
No idea if this is related to the previous fix, here is an error reported by our users when using topaz:
UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors.
This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor.
You may want to copy the array to protect its data or make it writeable before converting it to a tensor.
This type of warning will be suppressed for the rest of this program.
(Triggered internally at /opt/conda/conda-bld/pytorch_1607370156314/work/torch/csrc/utils/tensor_numpy.cpp:141.)
I see that v4.1.1 addressed this, but I have run into this same ‘no torch’ issue in CryoSPARC v4.1.2 on Centos 7 when starting a 3DFlex Train job (Data Prep and Mesh Prep ran fine).
" ModuleNotFoundError: No module named ‘torch’ " is the precise wording in the log.
Before I attempt any of the above solutions, is there one that is currently recommended for v4.1.2?
This is using a Quadro P6000 with CUDA Version 10.1.243 (not sure what “found version 10020” maps to from the above error). Seems like the workstation needs an updated CUDA driver to properly install PyTorch?
EDIT: @wtempel seems like all jobs fail now on this machine, not just 3DFlex Train. See below for a Non-Uniform Refinement, and similar error seen for Ab Initio. Suggestion?
==> WARNING: A newer version of conda exists. <==
current version: 4.12.0
latest version: 23.1.0
Please update conda by running
$ conda update -n base -c defaults conda
Found existing installation: pycuda 2020.1
Uninstalling pycuda-2020.1:
Successfully uninstalled pycuda-2020.1
Collecting torch
Downloading torch-1.13.1-cp38-cp38-manylinux1_x86_64.whl (887.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 887.4/887.4 MB 4.6 MB/s eta 0:00:00
Installing collected packages: torch
Successfully installed torch-1.13.1
Processing ./deps_bundle/python/python_packages/pip_packages/pycuda-2020.1-cp38-cp38-linux_x86_64.whl
Installing collected packages: pycuda
Successfully installed pycuda-2020.1
PyTorch not installed correctly, or NVIDIA GPU not detected.
I am able to run 3D Flex Train and other jobs without crashing (at least, thus far). The worker is Centos 7, GeForce RTX 2080 Ti with CUDA Version 10.2.89.
Hello, @wtempel. What commands are run during “install-3dflex” ? Currently I am getting Conda HTTP error due to the company firewall settings. Usually, when trying to create Conda environments, I need to add “–insecure” to bypass this issue. Is it possible to do so in this case?