Cluster worker update error - pycuda

cryosparcm updated successfully, but I get the following error updating cryosparcw. I tried doing an --override reinstall of cryosparcw, but get the same results.

Attempted jobs fail with “No module named ‘pycuda’” as expected based on the above errors, so we currently can’t run any jobs on this cluster.

----------------------------------------------------------

 Preparing to install all pip packages...
  ------------------------------------------------------------------------
Processing ./deps_bundle/python/python_packages/pip_packages/pycuda-2020.1.tar.gz
  Preparing metadata (setup.py) ... done
Skipping wheel build for pycuda, due to binaries being disabled for it.
Installing collected packages: pycuda
    Running setup.py install for pycuda ... error
    ERROR: Command errored out with exit status 1:
     command: /opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/bin/python3.7 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-i9cmz_t8/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-i9cmz_t8/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-xvun8a2w/install-record.txt --single-version-externally-managed --compile --install-headers /opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/include/python3.7m/pycuda
         cwd: /tmp/pip-req-build-i9cmz_t8/
    Complete output (114 lines):
    *************************************************************
    *** I have detected that you have not run configure.py.
    *************************************************************
    *** Additionally, no global config files were found.
    *** I will go ahead with the default configuration.
    *** In all likelihood, this will not work out.
    ***
    *** See README_SETUP.txt for more information.
    ***
    *** If the build does fail, just re-run configure.py with the
    *** correct arguments, and then retry. Good luck!
    *************************************************************
    *** HIT Ctrl-C NOW IF THIS IS NOT WHAT YOU WANT
    *************************************************************
    Continuing in 10 seconds...
    Continuing in 9 seconds...
    Continuing in 8 seconds...
    Continuing in 7 seconds...
    Continuing in 6 seconds...
    Continuing in 5 seconds...
    Continuing in 4 seconds...
    Continuing in 3 seconds...
    Continuing in 2 seconds...
    Continuing in 1 seconds...
    /opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/distutils/dist.py:274: UserWarning: Unknown distribution option: 'test_requires'
      warnings.warn(msg)
    running install
    /opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
      setuptools.SetuptoolsDeprecationWarning,
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.7
    creating build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/__init__.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/_cluda.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/_mymako.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/autoinit.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/characterize.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/compiler.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/cumath.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/curandom.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/debug.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/driver.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/elementwise.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/gpuarray.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/reduction.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/scan.py -> build/lib.linux-x86_64-3.7/pycuda
    copying pycuda/tools.py -> build/lib.linux-x86_64-3.7/pycuda
    creating build/lib.linux-x86_64-3.7/pycuda/gl
    copying pycuda/gl/__init__.py -> build/lib.linux-x86_64-3.7/pycuda/gl
    copying pycuda/gl/autoinit.py -> build/lib.linux-x86_64-3.7/pycuda/gl
    creating build/lib.linux-x86_64-3.7/pycuda/sparse
    copying pycuda/sparse/__init__.py -> build/lib.linux-x86_64-3.7/pycuda/sparse
    copying pycuda/sparse/cg.py -> build/lib.linux-x86_64-3.7/pycuda/sparse
    copying pycuda/sparse/coordinate.py -> build/lib.linux-x86_64-3.7/pycuda/sparse
    copying pycuda/sparse/inner.py -> build/lib.linux-x86_64-3.7/pycuda/sparse
    copying pycuda/sparse/operator.py -> build/lib.linux-x86_64-3.7/pycuda/sparse
    copying pycuda/sparse/packeted.py -> build/lib.linux-x86_64-3.7/pycuda/sparse
    copying pycuda/sparse/pkt_build.py -> build/lib.linux-x86_64-3.7/pycuda/sparse
    creating build/lib.linux-x86_64-3.7/pycuda/compyte
    copying pycuda/compyte/__init__.py -> build/lib.linux-x86_64-3.7/pycuda/compyte
    copying pycuda/compyte/array.py -> build/lib.linux-x86_64-3.7/pycuda/compyte
    copying pycuda/compyte/dtypes.py -> build/lib.linux-x86_64-3.7/pycuda/compyte
    running egg_info
    writing pycuda.egg-info/PKG-INFO
    writing dependency_links to pycuda.egg-info/dependency_links.txt
    writing requirements to pycuda.egg-info/requires.txt
    writing top-level names to pycuda.egg-info/top_level.txt
    reading manifest file 'pycuda.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    warning: no files found matching 'doc/source/_static/*.css'
    warning: no files found matching 'doc/source/_templates/*.html'
    warning: no files found matching '*.cpp' under directory 'bpl-subset/bpl_subset/boost'
    warning: no files found matching '*.html' under directory 'bpl-subset/bpl_subset/boost'
    warning: no files found matching '*.inl' under directory 'bpl-subset/bpl_subset/boost'
    warning: no files found matching '*.txt' under directory 'bpl-subset/bpl_subset/boost'
    warning: no files found matching '*.h' under directory 'bpl-subset/bpl_subset/libs'
    warning: no files found matching '*.ipp' under directory 'bpl-subset/bpl_subset/libs'
    warning: no files found matching '*.pl' under directory 'bpl-subset/bpl_subset/libs'
    adding license file 'LICENSE'
    writing manifest file 'pycuda.egg-info/SOURCES.txt'
    creating build/lib.linux-x86_64-3.7/pycuda/cuda
    copying pycuda/cuda/pycuda-complex-impl.hpp -> build/lib.linux-x86_64-3.7/pycuda/cuda
    copying pycuda/cuda/pycuda-complex.hpp -> build/lib.linux-x86_64-3.7/pycuda/cuda
    copying pycuda/cuda/pycuda-helpers.hpp -> build/lib.linux-x86_64-3.7/pycuda/cuda
    copying pycuda/sparse/pkt_build_cython.pyx -> build/lib.linux-x86_64-3.7/pycuda/sparse
    running build_ext
    building '_driver' extension
    creating build/temp.linux-x86_64-3.7
    creating build/temp.linux-x86_64-3.7/src
    creating build/temp.linux-x86_64-3.7/src/cpp
    creating build/temp.linux-x86_64-3.7/src/wrapper
    creating build/temp.linux-x86_64-3.7/bpl-subset
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs/python
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs/python/src
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs/python/src/converter
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs/python/src/object
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs/smart_ptr
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs/smart_ptr/src
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs/system
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs/system/src
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs/thread
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs/thread/src
    creating build/temp.linux-x86_64-3.7/bpl-subset/bpl_subset/libs/thread/src/pthread
    gcc -pthread -B /opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -fwrapv -Wall -O3 -DNDEBUG -fPIC -DBOOST_ALL_NO_LIB=1 -DBOOST_THREAD_BUILD_DLL=1 -DBOOST_MULTI_INDEX_DISABLE_SERIALIZATION=1 -DBOOST_PYTHON_SOURCE=1 -Dboost=pycudaboost -DBOOST_THREAD_DONT_USE_CHRONO=1 -DPYGPU_PACKAGE=pycuda -DPYGPU_PYCUDA=1 -DHAVE_CURAND=1 -Isrc/cpp -Ibpl-subset/bpl_subset -I/cm/shared/apps/cuda11.8/toolkit/11.8.0/include -I/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/numpy/core/include -I/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/include/python3.7m -c src/cpp/cuda.cpp -o build/temp.linux-x86_64-3.7/src/cpp/cuda.o
    In file included from src/cpp/cuda.cpp:4:
    src/cpp/cuda.hpp:23:10: fatal error: cudaProfiler.h: No such file or directory
     #include <cudaProfiler.h>
              ^~~~~~~~~~~~~~~~
    compilation terminated.
    error: command 'gcc' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/bin/python3.7 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-i9cmz_t8/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-i9cmz_t8/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-xvun8a2w/install-record.txt --single-version-externally-managed --compile --install-headers /opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/include/python3.7m/pycuda Check the logs for full command output.
  ------------------------------------------------------------------------
    Done.
    pip packages installation successful.
  ------------------------------------------------------------------------
  Main dependency installation completed. Continuing...
  ------------------------------------------------------------------------
Completed.
Currently checking hash for ctffind
Forcing reinstall for dependency ctffind...
  ------------------------------------------------------------------------
  ctffind 4.1.10 installation successful.
  ------------------------------------------------------------------------
Completed.
Currently checking hash for cudnn
Forcing reinstall for dependency cudnn...
  ------------------------------------------------------------------------
  cudnn 8.1.0.77 for CUDA 11 installation successful.
  ------------------------------------------------------------------------
Completed.
Currently checking hash for gctf
Forcing reinstall for dependency gctf...
  ------------------------------------------------------------------------
  Gctf v1.06 installation successful.
  ------------------------------------------------------------------------
Completed.
Completed dependency check.
Generating '/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/libtiff/tiff_h_4_4_0.py' from '/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/../include/tiff.h'

Successfully updated.

Update:

I rolled back to 3.4.0 and this appears to be installing fine without any errors.

I had a user submit a test job, but it seems to be getting hung up here:

Launching job on lane vision target vision ...

Launching job on cluster vision

====================== Cluster submission script: ======================== ========================================================================== #!/usr/bin/env bash #### cryoSPARC cluster submission script template for SLURM ## Available variables: ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J201 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J201/job.log 2>&1 - the complete command string to run the job ## 4 - the number of CPUs needed ## 1 - the number of GPUs needed. ## Note: the code will use this many GPUs starting from dev id 0 ## the cluster scheduler or this script have the responsibility ## of setting CUDA_VISIBLE_DEVICES so that the job code ends up ## using the correct cluster-allocated GPUs. ## 24.0 - the amount of RAM needed in GB ## /tank/colemanlab/jcoleman/cryosparc/P5/J201 - absolute path to the job directory ## /tank/colemanlab/jcoleman/cryosparc/P5 - absolute path to the project dir ## /tank/colemanlab/jcoleman/cryosparc/P5/J201/job.log - absolute path to the log file for the job ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw - absolute path to the cryosparc worker command ## --project P120 --job J201 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 - arguments to be passed to cryosparcw run ## P120 - uid of the project ## J201 - uid of the job ## coleman - name of the user that created the job (may contain spaces) ## coleman1@pitt.edu - cryosparc username of the user that created the job (usually an email) ## ## What follows is a simple SLURM script: #SBATCH --job-name cryosparc_P120_J201 #SBATCH -n 4 #SBATCH --gres=gpu:1 #SBATCH -p defq #SBATCH --mem=24000MB #SBATCH -o /tank/colemanlab/jcoleman/cryosparc/P5/J201/out.txt #SBATCH -e /tank/colemanlab/jcoleman/cryosparc/P5/J201/err.txt available_devs="" for devidx in $(seq 0 15); do if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then if [[ -z "$available_devs" ]] ; then available_devs=$devidx else available_devs=$available_devs,$devidx fi fi done export CUDA_VISIBLE_DEVICES=$available_devs /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J201 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J201/job.log 2>&1 ========================================================================== ==========================================================================

-------- Submission command: sbatch /tank/colemanlab/jcoleman/cryosparc/P5/J201/queue_sub_script.sh

-------- Cluster Job ID: 23834

-------- Queued on cluster at 2022-11-02 12:07:21.426008

-------- Job status at 2022-11-02 12:07:21.476359 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 23834 defq cryospar cryospar PD 0:00 1 (None)

[CPU: 68.5 MB] Project P120 Job J201 Started

[CPU: 68.5 MB] Master running v3.4.0, worker running v3.4.0

[CPU: 68.8 MB] Working in directory: /tank/colemanlab/jcoleman/cryosparc/P5/J201

[CPU: 68.8 MB] Running on lane vision

[CPU: 68.8 MB] Resources allocated:

[CPU: 68.8 MB] Worker: vision

[CPU: 68.8 MB] CPU : [0, 1, 2, 3]

[CPU: 68.8 MB] GPU : [0]

[CPU: 68.8 MB] RAM : [0, 1, 2]

[CPU: 68.8 MB] SSD : True

[CPU: 68.8 MB] --------------------------------------------------------------

[CPU: 68.8 MB] Importing job module for job type nonuniform_refine_new...

[CPU: 229.5 MB] Job ready to run

[CPU: 229.5 MB] ***************************************************************

[CPU: 485.2 MB] Using random seed of 1607127199

[CPU: 485.2 MB] Loading a ParticleStack with 154060 items...

Any information in:
P5/J201/out.txt
or
P5/J201/err.txt?

Greetings.

The files are as follows:

P5/J201/err.txt

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

slurmstepd: error: *** JOB 23828 ON node03 CANCELLED AT 2022-11-02T11:59:03 ***

P5/J201/out.txt

File exists but is empty.


So, we tried restarting the cryosparc master process and rebuilding the worker process (with --override), and now the job gets stuck here without entering the queue. I can confirm that the SLURM queue is working fine for other jobs though.


License is valid.

Launching job on lane vision target vision …

Launching job on cluster vision

====================== Cluster submission script: ======================== ========================================================================== #!/usr/bin/env bash #### cryoSPARC cluster submission script template for SLURM ## Available variables: ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J203 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J203/job.log 2>&1 - the complete command string to run the job ## 4 - the number of CPUs needed ## 1 - the number of GPUs needed. ## Note: the code will use this many GPUs starting from dev id 0 ## the cluster scheduler or this script have the responsibility ## of setting CUDA_VISIBLE_DEVICES so that the job code ends up ## using the correct cluster-allocated GPUs. ## 24.0 - the amount of RAM needed in GB ## /tank/colemanlab/jcoleman/cryosparc/P5/J203 - absolute path to the job directory ## /tank/colemanlab/jcoleman/cryosparc/P5 - absolute path to the project dir ## /tank/colemanlab/jcoleman/cryosparc/P5/J203/job.log - absolute path to the log file for the job ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw - absolute path to the cryosparc worker command ## --project P120 --job J203 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 - arguments to be passed to cryosparcw run ## P120 - uid of the project ## J203 - uid of the job ## coleman - name of the user that created the job (may contain spaces) ## coleman1@pitt.edu - cryosparc username of the user that created the job (usually an email) ## ## What follows is a simple SLURM script: #SBATCH --job-name cryosparc_P120_J203 #SBATCH -n 4 #SBATCH --gres=gpu:1 #SBATCH -p defq #SBATCH --mem=24000MB #SBATCH -o /tank/colemanlab/jcoleman/cryosparc/P5/J203/out.txt #SBATCH -e /tank/colemanlab/jcoleman/cryosparc/P5/J203/err.txt available_devs="" for devidx in $(seq 0 15); do if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then if [[ -z “$available_devs” ]] ; then available_devs=$devidx else available_devs=$available_devs,$devidx fi fi done export CUDA_VISIBLE_DEVICES=$available_devs /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J203 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J203/job.log 2>&1 ========================================================================== ==========================================================================

-------- Submission command: sbatch /tank/colemanlab/jcoleman/cryosparc/P5/J203/queue_sub_script.sh

-------- Cluster Job ID: 23837

-------- Queued on cluster at 2022-11-03 09:42:33.132614

-------- Job status at 2022-11-03 09:42:33.271697 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 23837 defq cryospar cryospar CF 0:00 1 node02

[CPU: 69.6 MB] Project P120 Job J203 Started

Additionally, for this last job, we get the same outputs from err.txt:


/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

Your sbatch script includes a call nvidia-smi. Are you sure that nvidia-smi is installed and included in the $PATH on the compute node?

Hmmm, we haven’t changed the location of nvidia-smi or any path variables.

If I ssh to a compute node, I need to do:

[root@vision cryosparc2_worker]# module load cuda10.2/toolkit/10.2.89
[root@vision cryosparc2_worker]# which nvidia-smi
/cm/local/apps/cuda/libs/current/bin/nvidia-smi

I’ve never had to explicitly set the path on the compute node.

Is it possible that the worker process needs to be pointed explicitly to the nvidia-smi location? Is there a way to do this?

Thanks.

I wonder whether the cluster job’s failure and the earlier update error are related.
You reported earlier that cluster jobs were running (albeit with other problems Discrete GPU usage - #3 by yodamoppet).
Was there another nvidia-smi executable present (in your path) then, or has the environment initialization changed since then?

Hi.

We actually haven’t changed anything in the environment, other than our attempt to upgrade to V4 above which resulted in rolling back to v3.4 and our current state of not being able to run jobs. Prior to this, we were running jobs fine, albeit with occasional discrete GPU issues. And some jobs with large cache were having issues, which you suggested to clear/reset the cache, which we have done.

We have been testing cgroups on other systems but haven’t implemented them on this main cluster yet. That should solve the GPU issue once implemented on this system. In the meantime, it only effects some jobs, many run fine.

Our cluster stack is built on Bright Cluster Manager, so the location of nvidia-smi and environment variables have not changed.

How can we further diagnose the current issue? And, if it’s just an nvidia-smi issue in the slurm script created by cryosparc, how can we set the location of nvidia-smi that it uses? I’d really like to get our users up and running again.

CryoSPARC creates cluster scripts based on the cluster_script.sh template and cluster_info.sh definitions that were uploaded to the CryoSPARC database with
cryosparcm cluster connect during configuration of the CryoSPARC instance.
It is likely that the bundled examples require significant customizations before upload.
For the sake of troubleshooting, one might assume for a moment

(it may well not be) and:

  1. keep records and backups of any configuration files in case you can later identify the true, underlying cause of the problems and want to roll back any unhelpful configuration changes
  2. as cryosparcuser (or whatever Linux account owns the CryoSPARC instance), use srun to get an interactive shell on a GPU node
  3. try
    /cm/local/apps/cuda/libs/current/bin/nvidia-smi or
    module load cuda10.2/toolkit/10.2.89 && nvidia-smi
  4. The outcome of these tests may suggest a modification of the (carefully backed up) cluster_script.sh file. Instead of overwriting the configuration of the existing cluster lane, one may temporarily connect an additional (testing) target under a different (cluster_info.sh) "name":.

I’ve had the same issue on one system. It does not appear related to nvidia-smi at all*, but in fact is PyCUDA failing to build, which cascades down through everything else. I’ve not yet found a fix or workaround for it. The painful thing is, the install/update process continues, and actually reports success at the end, so unless you’re watching like a hawk it’s easy to miss.

It doesn’t happen on Ubuntu 20.04 or 22.04 for me, but does happen on Arch. I’ve been very slow to update other systems to 4.0.2 as a result. I should test out our CentOS 7 and CentOS Stream boxes.

I think it’s a g++ issue.

*Because it works correctly in a non-cryoSPARC Python environment, nvidia-smi works correctly in the shell and RELION and other GPU applications are working without issue (and compile fresh without problems as well…)

I’ll tinker some more over the weekend if I get time.

Greetings.

Thanks for the troubleshooting info.

As cryosparc_user, simply running nvidia-smi doesn’t work:

cryosparc_user@node03 bin]$ /cm/local/apps/cuda/libs/current/bin/nvidia-smi
NVIDIA-SMI couldn’t find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

However, loading the module and running it works fine:

[cryosparc_user@node03 bin]$ module load cuda10.2/toolkit/10.2.89
[cryosparc_user@node03 bin]$ nvidia-smi
Fri Nov 4 08:11:19 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |etc…


Our cluster_script.sh template hasn’t had problems before, but I suppose I could add the "module load … " statement there to provide access to nvidia-smi.

Do you concur?

If so, how can I update the database after adding this?

rbs_sci mentioned PyCUDA build failure. We had this problem with 4.x which is why we have rolled back to 3.4. I believe pycuda built correctly with 3.4, but is there a way to verify this?

Thanks so much.

Yes. In its own line just above the nvidia-smi test.

cryosparcm cluster connect
I recommend adding an additional test lane (instead of replacing the existing one, guide) and would like to reiterate

May I ask:

  1. What was the full command used for installation or update to version “4.x”.
  2. On which host (worker or master) was that command executed?
  3. What distribution and version of Linux is running on that machine?

There is. Please see instructions for running a test workflow.

I added the testing lane (vision-testing) and included the “module load” statement for nvidia-smi in the slurm script template for that lane.

I then asked the same user to run a test job. This time err.txt and out.txt are empty, but the job still appears to hang:

Here is the user provided output from the job:

License is valid.

Launching job on lane vision-testing target vision-testing ...

Launching job on cluster vision-testing

====================== Cluster submission script: ======================== ========================================================================== #!/usr/bin/env bash #### cryoSPARC cluster submission script template for SLURM ## Available variables: ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J204 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J204/job.log 2>&1 - the complete command string to run the job ## 4 - the number of CPUs needed ## 1 - the number of GPUs needed. ## Note: the code will use this many GPUs starting from dev id 0 ## the cluster scheduler or this script have the responsibility ## of setting CUDA_VISIBLE_DEVICES so that the job code ends up ## using the correct cluster-allocated GPUs. ## 24.0 - the amount of RAM needed in GB ## /tank/colemanlab/jcoleman/cryosparc/P5/J204 - absolute path to the job directory ## /tank/colemanlab/jcoleman/cryosparc/P5 - absolute path to the project dir ## /tank/colemanlab/jcoleman/cryosparc/P5/J204/job.log - absolute path to the log file for the job ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw - absolute path to the cryosparc worker command ## --project P120 --job J204 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 - arguments to be passed to cryosparcw run ## P120 - uid of the project ## J204 - uid of the job ## coleman - name of the user that created the job (may contain spaces) ## coleman1@pitt.edu - cryosparc username of the user that created the job (usually an email) ## ## What follows is a simple SLURM script: #SBATCH --job-name cryosparc_P120_J204 #SBATCH -n 4 #SBATCH --gres=gpu:1 #SBATCH -p defq #SBATCH --mem=24000MB #SBATCH -o /tank/colemanlab/jcoleman/cryosparc/P5/J204/out.txt #SBATCH -e /tank/colemanlab/jcoleman/cryosparc/P5/J204/err.txt module load cuda10.2/toolkit/10.2.89 available_devs="" for devidx in $(seq 0 15); do if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then if [[ -z "$available_devs" ]] ; then available_devs=$devidx else available_devs=$available_devs,$devidx fi fi done export CUDA_VISIBLE_DEVICES=$available_devs /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J204 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J204/job.log 2>&1 ========================================================================== ==========================================================================

-------- Submission command: sbatch /tank/colemanlab/jcoleman/cryosparc/P5/J204/queue_sub_script.sh

-------- Cluster Job ID: 23853

-------- Queued on cluster at 2022-11-04 10:07:41.330035

-------- Job status at 2022-11-04 10:07:41.350174 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 23853 defq cryospar cryospar PD 0:00 1 (None)

[CPU: 68.5 MB] Project P120 Job J204 Started

[CPU: 68.5 MB] Master running v3.4.0, worker running v3.4.0

[CPU: 68.8 MB] Working in directory: /tank/colemanlab/jcoleman/cryosparc/P5/J204

[CPU: 68.8 MB] Running on lane vision-testing

[CPU: 68.8 MB] Resources allocated:

[CPU: 68.8 MB] Worker: vision-testing

[CPU: 68.8 MB] CPU : [0, 1, 2, 3]

[CPU: 68.8 MB] GPU : [0]

[CPU: 68.8 MB] RAM : [0, 1, 2]

[CPU: 68.8 MB] SSD : True

[CPU: 68.8 MB] --------------------------------------------------------------

[CPU: 68.8 MB] Importing job module for job type nonuniform_refine_new...

How can we troubleshoot this further?

Please can you post the content of
/tank/colemanlab/jcoleman/cryosparc/P5/J204/job.log.

Sure thing. Here it is:

[root@vision ~]# cat /tank/colemanlab/jcoleman/cryosparc/P5/J204/job.log


================= CRYOSPARCW =======  2022-11-04 10:07:43.084647  =========
Project P120 Job J204
Master vision.structbio.pitt.edu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 350310
MAIN PID 350310
refine.newrun cryosparc_compute.jobs.jobregister
/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/driver.py:66: UserWarning: Failed to import the CUDA driver interface, with an error message indicating that the version of your CUDA header does not match the version of your CUDA driver.
  warn("Failed to import the CUDA driver interface, with an error "
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_worker/cryosparc_compute/run.py", line 173, in cryosparc_compute.run.run
  File "/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/runcommon.py", line 1961, in get_gpu_info
    import pycuda.driver as cudrv
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/driver.py", line 62, in <module>
    from pycuda._driver import *  # noqa
ImportError: /opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/_driver.cpython-37m-x86_64-linux-gnu.so: undefined symbol: cuDevicePrimaryCtxRelease_v2
libgcc_s.so.1 must be installed for pthread_cancel to work
/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw: line 120: 350309 Aborted                 python -c "import cryosparc_compute.run as run; run.run()" "$@"

After a bit of tinkering, I was able to solve this by forcing pycuda to rebuild. We are now up and running again.

I had previously forced a reinstallation of the worker version 3.4.0, and this showed pycuda as building cleanly with no errors, so I’m not quite sure why this was necessary.

We appreciate the guidance from wtempel and rbs_sci. All of this discussion pointed us in the right direction for solving this problem.

2 Likes

Update from me also. With the 4.0.3 update, I wondered whether there was a minor change which might make 4 install. Nope, failed at the same point again: pycuda 2020.

The Master and Worker installed fine except for pycuda. I tried different version of CUDA, different versions of gcc and all failed the same way. I eval’d into the worker environment and installed pycuda 2022, connected the worker node and the T20S workflow goes all the way through without issue (except for submitting every job as requiring an SSD, when there is --nossd set during install) and an apoferritin test set also runs through to 1.97 Angstrom without issue (which is where it tends to stop in cryoSPARC). Currently running a lower symmetry example from a different 'scope through.

The errors thrown by pycuda are not very helpful, particularly since 2022 builds fine but 2020 included with cryoSPARC doesn’t.

That said, I’m not really happy with running a version of pycuda which cryoSPARC wasn’t designed for… even though everything appears to work. That system was only running Arch because it was some brand-new hardware which at the time would not boot on Ubuntu or CentOS. It’s time I reinstalled it.

Thank you for the update and the suggestion about T20S workflow particle caching.
Please can you confirm that you observed the pycuda installation error on Arch, but not on Ubuntu or CentOS?

Greetings.

I think this question was directed at rbs_sci, but for the sake of thread completion our system is CentOS 7.