Torch not found errror in 4.3.1

karcaw · September 26, 2023, 11:53pm

I support a team that is running Cryosparc on a single server with 4 GPU’s, they recently asked me to get the 3dflex modules working,

I tired installing the modules using the ‘./bin/cryosparcw install-3dflex’ command, and that was unsuccessful at first, even with multiple forcedeps commands afterward. I went ahead and upgraded from 4.2.1 to 4.3.1 and things went better and it is properly installed now, but we still get the following messages from the job the team is trying to run:

================= CRYOSPARCW ======= 2023-09-26 14:41:04.911479 =========
Project P9 Job J1868
Master kriosgpu.emsl.pnl.gov Port 39002

========= monitor process now starting main process at 2023-09-26 14:41:04.911532
MAINPROCESS PID 134394
MAIN PID 134394
flex_refine.run_train cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
**** handle exception rc
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 82, in cryosparc_compute.run.main
File “/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/jobregister.py”, line 448, in get_run_function
runmod = importlib.import_module(“…”+modname, name)
File “/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/importlib/init.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1014, in _gcd_import
File “”, line 991, in _find_and_load
File “”, line 975, in _find_and_load_unlocked
File “”, line 671, in _load_unlocked
File “”, line 1174, in exec_module
File “”, line 219, in _call_with_frames_removed
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py”, line 12, in init cryosparc_compute.jobs.flex_refine.run_train
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py”, line 19, in init cryosparc_compute.jobs.flex_refine.flexmod
ModuleNotFoundError: No module named ‘torch’
set status to failed
========= main process now complete at 2023-09-26 14:41:09.042473.
========= monitor process now complete at 2023-09-26 14:41:09.049811.

It looks like to me from this stack trace that the flexmod.py code is running in the server environment and not the worker environment where I believe it should be running. But it also looks like it has some of the worker environment in that stack trace. So I’m a bit confused at what to do next.

Can you help me debug the situation, and get this working for them?

wtempel · September 27, 2023, 3:45pm

cryosparcw forcedeps is expected to uninstall 3DFlex dependencies.

suggests that 3DFlex dependencies are either missing or broken.
Please can you confirm by running

nvidia-smi --query-gpu=name,driver_version --format=csv

that

your GPU model is displayed correctly
the driver version is 460 or higher (guide)

and then try again

/cryosparc/cryosparc2_worker/bin/cryosparcw forcedeps
/cryosparc/cryosparc2_worker/bin/cryosparcw install-3dflex 2>&1 | tee install_3dflex_20230927.log

If the installation proceeds without error, but 3DFlex jobs still do not run, please post the output of the commands

csw=/cryosparc/cryosparc2_worker/bin/cryosparcw
$csw call /usr/bin/env | grep PATH
$csw call python -c "import torch, pycuda.driver; print(f'pycuda version? {pycuda.driver.get_version()}\nTorch version {torch.__version__}\nTorch CUDA available? {torch.cuda.is_available()}')"
$csw call which nvcc

karcaw · September 27, 2023, 6:17pm

Driver versions:
name, driver_version
Tesla V100-PCIE-32GB, 515.65.01
Tesla V100-PCIE-32GB, 515.65.01
Tesla V100-PCIE-32GB, 515.65.01
Tesla V100-PCIE-32GB, 515.65.01

Seem over 460 to me…

The forcedeps stage at this point is getting this:

In file included from src/cpp/cuda.cpp:4:0:
      src/cpp/cuda.hpp:14:18: fatal error: cuda.h: No such file or directory
       #include <cuda.h>
                        ^
      compilation terminated.

But the install of 3dflex seems to work. See attached file.

the test still fails:

cryosparcm test workers P9 --test gpu --test-pytorch
Using project P9
Specifying gpu test
Enabling PyTorch test
Running worker tests...
2023-09-27 11:12:12,911 log                  CRITICAL | Worker test results
2023-09-27 11:12:12,911 log                  CRITICAL | kriosgpu.emsl.pnl.gov
2023-09-27 11:12:12,911 log                  CRITICAL |   ✕ GPU
2023-09-27 11:12:12,912 log                  CRITICAL |     Error: No module named 'torch'
2023-09-27 11:12:12,912 log                  CRITICAL |     See P9 J1880 for more information

Paths:

MANPATH=/usr/share/lmod/lmod/share/man:
MODULEPATH_ROOT=/usr/share/modulefiles
LD_LIBRARY_PATH=/home/svc-emslkrios/cryosparc/cryosparc_worker/deps/external/cudnn/lib
PATH=/home/svc-emslkrios/cryosparc/cryosparc_worker/bin:/home/svc-emslkrios/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/home/svc-emslkrios/cryosparc/cryosparc_worker/deps/anaconda/condabin:/home/svc-emslkrios/.pyenv/bin:/cryosparc/cryosparc2_master/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/apps/bin:/msc/krios/bin:/home/svc-emslkrios/.local/bin:/home/svc-emslkrios/bin
MODULEPATH=/etc/modulefiles:/usr/share/modulefiles:/usr/share/modulefiles/Linux:/usr/share/modulefiles/Core:/usr/share/lmod/lmod/modulefiles/Core
CRYOSPARC_PATH=/home/svc-emslkrios/cryosparc/cryosparc_worker/bin
PYTHONPATH=/home/svc-emslkrios/cryosparc/cryosparc_worker
STEPPATH=/etc/step/
CRYOSPARC_CUDA_PATH=/usr/local/cuda-11.8

Cuda Check:

pycuda version? (11, 8, 0)
Torch version 1.13.1+cu117
Torch CUDA available? True

NVCC:
~/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin/nvcc

wtempel · October 3, 2023, 9:43pm

I wonder whether

(which seems to be used by CryoSPARC)
and

which may have been used in some of the updates and tests
are independent directories?
What are the outputs of these commands:

ls -al /cryosparc/cryosparc2_worker/
ls -al /home/svc-emslkrios/cryosparc/cryosparc_worker/
cryosparcm cli "get_scheduler_targets()"

karcaw · October 4, 2023, 3:55pm

Woot! the output of get_scheduler_targets showed me that I had been upgrading the wrong worker directory the whole time:

[{'cache_path': '/gpustorage/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 34089926656, 'name': 'Tesla V100-PCIE-32GB'
}, {'id': 1, 'mem': 34089926656, 'name': 'Tesla V100-PCIE-32GB'}, {'id': 2, 'mem': 34089926656, 'name': 'Tesla V100-PCIE-32GB'}, {'id': 3, 'mem': 34089926656, 'name':
 'Tesla V100-PCIE-32GB'}], 'hostname': 'kriosgpu.emsl.pnl.gov', 'lane': 'default', 'monitor_port': None, 'name': 'kriosgpu.emsl.pnl.gov', 'resource_fixed': {'SSD': Tr
ue}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]}, 'ssh_s
tr': 'svc-emslkrios@kriosgpu.emsl.pnl.gov', 'title': 'Worker node kriosgpu.emsl.pnl.gov', 'type': 'node', 'worker_bin_path': '/cryosparc/cryosparc2_worker/bin/cryosparcw'}]

Thank You for pointing that out.

After upgrading the correct worker path, the tests work fine.

Torch not found errror in 4.3.1

================= CRYOSPARCW ======= 2023-09-26 14:41:04.911479 ========= Project P9 Job J1868 Master kriosgpu.emsl.pnl.gov Port 39002

================= CRYOSPARCW ======= 2023-09-26 14:41:04.911479 =========
Project P9 Job J1868
Master kriosgpu.emsl.pnl.gov Port 39002