I support a team that is running Cryosparc on a single server with 4 GPU’s, they recently asked me to get the 3dflex modules working,
I tired installing the modules using the ‘./bin/cryosparcw install-3dflex’ command, and that was unsuccessful at first, even with multiple forcedeps
commands afterward. I went ahead and upgraded from 4.2.1 to 4.3.1 and things went better and it is properly installed now, but we still get the following messages from the job the team is trying to run:
================= CRYOSPARCW ======= 2023-09-26 14:41:04.911479 =========
Project P9 Job J1868
Master kriosgpu.emsl.pnl.gov Port 39002
========= monitor process now starting main process at 2023-09-26 14:41:04.911532
MAINPROCESS PID 134394
MAIN PID 134394
flex_refine.run_train cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
**** handle exception rc
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 82, in cryosparc_compute.run.main
File “/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/jobregister.py”, line 448, in get_run_function
runmod = importlib.import_module(“…”+modname, name)
File “/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/importlib/init.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1014, in _gcd_import
File “”, line 991, in _find_and_load
File “”, line 975, in _find_and_load_unlocked
File “”, line 671, in _load_unlocked
File “”, line 1174, in exec_module
File “”, line 219, in _call_with_frames_removed
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py”, line 12, in init cryosparc_compute.jobs.flex_refine.run_train
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py”, line 19, in init cryosparc_compute.jobs.flex_refine.flexmod
ModuleNotFoundError: No module named ‘torch’
set status to failed
========= main process now complete at 2023-09-26 14:41:09.042473.
========= monitor process now complete at 2023-09-26 14:41:09.049811.
It looks like to me from this stack trace that the flexmod.py code is running in the server environment and not the worker environment where I believe it should be running. But it also looks like it has some of the worker environment in that stack trace. So I’m a bit confused at what to do next.
Can you help me debug the situation, and get this working for them?