Cluster worker update error - pycuda

Greetings.

The files are as follows:

P5/J201/err.txt

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found

slurmstepd: error: *** JOB 23828 ON node03 CANCELLED AT 2022-11-02T11:59:03 ***

P5/J201/out.txt

File exists but is empty.


So, we tried restarting the cryosparc master process and rebuilding the worker process (with --override), and now the job gets stuck here without entering the queue. I can confirm that the SLURM queue is working fine for other jobs though.


License is valid.

Launching job on lane vision target vision …

Launching job on cluster vision

====================== Cluster submission script: ======================== ========================================================================== #!/usr/bin/env bash #### cryoSPARC cluster submission script template for SLURM ## Available variables: ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J203 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J203/job.log 2>&1 - the complete command string to run the job ## 4 - the number of CPUs needed ## 1 - the number of GPUs needed. ## Note: the code will use this many GPUs starting from dev id 0 ## the cluster scheduler or this script have the responsibility ## of setting CUDA_VISIBLE_DEVICES so that the job code ends up ## using the correct cluster-allocated GPUs. ## 24.0 - the amount of RAM needed in GB ## /tank/colemanlab/jcoleman/cryosparc/P5/J203 - absolute path to the job directory ## /tank/colemanlab/jcoleman/cryosparc/P5 - absolute path to the project dir ## /tank/colemanlab/jcoleman/cryosparc/P5/J203/job.log - absolute path to the log file for the job ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw - absolute path to the cryosparc worker command ## --project P120 --job J203 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 - arguments to be passed to cryosparcw run ## P120 - uid of the project ## J203 - uid of the job ## coleman - name of the user that created the job (may contain spaces) ## coleman1@pitt.edu - cryosparc username of the user that created the job (usually an email) ## ## What follows is a simple SLURM script: #SBATCH --job-name cryosparc_P120_J203 #SBATCH -n 4 #SBATCH --gres=gpu:1 #SBATCH -p defq #SBATCH --mem=24000MB #SBATCH -o /tank/colemanlab/jcoleman/cryosparc/P5/J203/out.txt #SBATCH -e /tank/colemanlab/jcoleman/cryosparc/P5/J203/err.txt available_devs="" for devidx in $(seq 0 15); do if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then if [[ -z “$available_devs” ]] ; then available_devs=$devidx else available_devs=$available_devs,$devidx fi fi done export CUDA_VISIBLE_DEVICES=$available_devs /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J203 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J203/job.log 2>&1 ========================================================================== ==========================================================================

-------- Submission command: sbatch /tank/colemanlab/jcoleman/cryosparc/P5/J203/queue_sub_script.sh

-------- Cluster Job ID: 23837

-------- Queued on cluster at 2022-11-03 09:42:33.132614

-------- Job status at 2022-11-03 09:42:33.271697 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 23837 defq cryospar cryospar CF 0:00 1 node02

[CPU: 69.6 MB] Project P120 Job J203 Started

Additionally, for this last job, we get the same outputs from err.txt:


/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

/cm/local/apps/slurm/var/spool/job23837/slurm_script: line 35: nvidia-smi: command not found

Your sbatch script includes a call nvidia-smi. Are you sure that nvidia-smi is installed and included in the $PATH on the compute node?

Hmmm, we haven’t changed the location of nvidia-smi or any path variables.

If I ssh to a compute node, I need to do:

[root@vision cryosparc2_worker]# module load cuda10.2/toolkit/10.2.89
[root@vision cryosparc2_worker]# which nvidia-smi
/cm/local/apps/cuda/libs/current/bin/nvidia-smi

I’ve never had to explicitly set the path on the compute node.

Is it possible that the worker process needs to be pointed explicitly to the nvidia-smi location? Is there a way to do this?

Thanks.

I wonder whether the cluster job’s failure and the earlier update error are related.
You reported earlier that cluster jobs were running (albeit with other problems Discrete GPU usage - #3 by yodamoppet).
Was there another nvidia-smi executable present (in your path) then, or has the environment initialization changed since then?

Hi.

We actually haven’t changed anything in the environment, other than our attempt to upgrade to V4 above which resulted in rolling back to v3.4 and our current state of not being able to run jobs. Prior to this, we were running jobs fine, albeit with occasional discrete GPU issues. And some jobs with large cache were having issues, which you suggested to clear/reset the cache, which we have done.

We have been testing cgroups on other systems but haven’t implemented them on this main cluster yet. That should solve the GPU issue once implemented on this system. In the meantime, it only effects some jobs, many run fine.

Our cluster stack is built on Bright Cluster Manager, so the location of nvidia-smi and environment variables have not changed.

How can we further diagnose the current issue? And, if it’s just an nvidia-smi issue in the slurm script created by cryosparc, how can we set the location of nvidia-smi that it uses? I’d really like to get our users up and running again.

CryoSPARC creates cluster scripts based on the cluster_script.sh template and cluster_info.sh definitions that were uploaded to the CryoSPARC database with
cryosparcm cluster connect during configuration of the CryoSPARC instance.
It is likely that the bundled examples require significant customizations before upload.
For the sake of troubleshooting, one might assume for a moment

(it may well not be) and:

  1. keep records and backups of any configuration files in case you can later identify the true, underlying cause of the problems and want to roll back any unhelpful configuration changes
  2. as cryosparcuser (or whatever Linux account owns the CryoSPARC instance), use srun to get an interactive shell on a GPU node
  3. try
    /cm/local/apps/cuda/libs/current/bin/nvidia-smi or
    module load cuda10.2/toolkit/10.2.89 && nvidia-smi
  4. The outcome of these tests may suggest a modification of the (carefully backed up) cluster_script.sh file. Instead of overwriting the configuration of the existing cluster lane, one may temporarily connect an additional (testing) target under a different (cluster_info.sh) "name":.

I’ve had the same issue on one system. It does not appear related to nvidia-smi at all*, but in fact is PyCUDA failing to build, which cascades down through everything else. I’ve not yet found a fix or workaround for it. The painful thing is, the install/update process continues, and actually reports success at the end, so unless you’re watching like a hawk it’s easy to miss.

It doesn’t happen on Ubuntu 20.04 or 22.04 for me, but does happen on Arch. I’ve been very slow to update other systems to 4.0.2 as a result. I should test out our CentOS 7 and CentOS Stream boxes.

I think it’s a g++ issue.

*Because it works correctly in a non-cryoSPARC Python environment, nvidia-smi works correctly in the shell and RELION and other GPU applications are working without issue (and compile fresh without problems as well…)

I’ll tinker some more over the weekend if I get time.

Greetings.

Thanks for the troubleshooting info.

As cryosparc_user, simply running nvidia-smi doesn’t work:

cryosparc_user@node03 bin]$ /cm/local/apps/cuda/libs/current/bin/nvidia-smi
NVIDIA-SMI couldn’t find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

However, loading the module and running it works fine:

[cryosparc_user@node03 bin]$ module load cuda10.2/toolkit/10.2.89
[cryosparc_user@node03 bin]$ nvidia-smi
Fri Nov 4 08:11:19 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |etc…


Our cluster_script.sh template hasn’t had problems before, but I suppose I could add the "module load … " statement there to provide access to nvidia-smi.

Do you concur?

If so, how can I update the database after adding this?

rbs_sci mentioned PyCUDA build failure. We had this problem with 4.x which is why we have rolled back to 3.4. I believe pycuda built correctly with 3.4, but is there a way to verify this?

Thanks so much.

Yes. In its own line just above the nvidia-smi test.

cryosparcm cluster connect
I recommend adding an additional test lane (instead of replacing the existing one, guide) and would like to reiterate

May I ask:

  1. What was the full command used for installation or update to version “4.x”.
  2. On which host (worker or master) was that command executed?
  3. What distribution and version of Linux is running on that machine?

There is. Please see instructions for running a test workflow.

I added the testing lane (vision-testing) and included the “module load” statement for nvidia-smi in the slurm script template for that lane.

I then asked the same user to run a test job. This time err.txt and out.txt are empty, but the job still appears to hang:

Here is the user provided output from the job:

License is valid.

Launching job on lane vision-testing target vision-testing ...

Launching job on cluster vision-testing

====================== Cluster submission script: ======================== ========================================================================== #!/usr/bin/env bash #### cryoSPARC cluster submission script template for SLURM ## Available variables: ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J204 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J204/job.log 2>&1 - the complete command string to run the job ## 4 - the number of CPUs needed ## 1 - the number of GPUs needed. ## Note: the code will use this many GPUs starting from dev id 0 ## the cluster scheduler or this script have the responsibility ## of setting CUDA_VISIBLE_DEVICES so that the job code ends up ## using the correct cluster-allocated GPUs. ## 24.0 - the amount of RAM needed in GB ## /tank/colemanlab/jcoleman/cryosparc/P5/J204 - absolute path to the job directory ## /tank/colemanlab/jcoleman/cryosparc/P5 - absolute path to the project dir ## /tank/colemanlab/jcoleman/cryosparc/P5/J204/job.log - absolute path to the log file for the job ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw - absolute path to the cryosparc worker command ## --project P120 --job J204 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 - arguments to be passed to cryosparcw run ## P120 - uid of the project ## J204 - uid of the job ## coleman - name of the user that created the job (may contain spaces) ## coleman1@pitt.edu - cryosparc username of the user that created the job (usually an email) ## ## What follows is a simple SLURM script: #SBATCH --job-name cryosparc_P120_J204 #SBATCH -n 4 #SBATCH --gres=gpu:1 #SBATCH -p defq #SBATCH --mem=24000MB #SBATCH -o /tank/colemanlab/jcoleman/cryosparc/P5/J204/out.txt #SBATCH -e /tank/colemanlab/jcoleman/cryosparc/P5/J204/err.txt module load cuda10.2/toolkit/10.2.89 available_devs="" for devidx in $(seq 0 15); do if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then if [[ -z "$available_devs" ]] ; then available_devs=$devidx else available_devs=$available_devs,$devidx fi fi done export CUDA_VISIBLE_DEVICES=$available_devs /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J204 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J204/job.log 2>&1 ========================================================================== ==========================================================================

-------- Submission command: sbatch /tank/colemanlab/jcoleman/cryosparc/P5/J204/queue_sub_script.sh

-------- Cluster Job ID: 23853

-------- Queued on cluster at 2022-11-04 10:07:41.330035

-------- Job status at 2022-11-04 10:07:41.350174 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 23853 defq cryospar cryospar PD 0:00 1 (None)

[CPU: 68.5 MB] Project P120 Job J204 Started

[CPU: 68.5 MB] Master running v3.4.0, worker running v3.4.0

[CPU: 68.8 MB] Working in directory: /tank/colemanlab/jcoleman/cryosparc/P5/J204

[CPU: 68.8 MB] Running on lane vision-testing

[CPU: 68.8 MB] Resources allocated:

[CPU: 68.8 MB] Worker: vision-testing

[CPU: 68.8 MB] CPU : [0, 1, 2, 3]

[CPU: 68.8 MB] GPU : [0]

[CPU: 68.8 MB] RAM : [0, 1, 2]

[CPU: 68.8 MB] SSD : True

[CPU: 68.8 MB] --------------------------------------------------------------

[CPU: 68.8 MB] Importing job module for job type nonuniform_refine_new...

How can we troubleshoot this further?

Please can you post the content of
/tank/colemanlab/jcoleman/cryosparc/P5/J204/job.log.

Sure thing. Here it is:

[root@vision ~]# cat /tank/colemanlab/jcoleman/cryosparc/P5/J204/job.log


================= CRYOSPARCW =======  2022-11-04 10:07:43.084647  =========
Project P120 Job J204
Master vision.structbio.pitt.edu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 350310
MAIN PID 350310
refine.newrun cryosparc_compute.jobs.jobregister
/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/driver.py:66: UserWarning: Failed to import the CUDA driver interface, with an error message indicating that the version of your CUDA header does not match the version of your CUDA driver.
  warn("Failed to import the CUDA driver interface, with an error "
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_worker/cryosparc_compute/run.py", line 173, in cryosparc_compute.run.run
  File "/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/runcommon.py", line 1961, in get_gpu_info
    import pycuda.driver as cudrv
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/driver.py", line 62, in <module>
    from pycuda._driver import *  # noqa
ImportError: /opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/_driver.cpython-37m-x86_64-linux-gnu.so: undefined symbol: cuDevicePrimaryCtxRelease_v2
libgcc_s.so.1 must be installed for pthread_cancel to work
/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw: line 120: 350309 Aborted                 python -c "import cryosparc_compute.run as run; run.run()" "$@"

After a bit of tinkering, I was able to solve this by forcing pycuda to rebuild. We are now up and running again.

I had previously forced a reinstallation of the worker version 3.4.0, and this showed pycuda as building cleanly with no errors, so I’m not quite sure why this was necessary.

We appreciate the guidance from wtempel and rbs_sci. All of this discussion pointed us in the right direction for solving this problem.

2 Likes

Update from me also. With the 4.0.3 update, I wondered whether there was a minor change which might make 4 install. Nope, failed at the same point again: pycuda 2020.

The Master and Worker installed fine except for pycuda. I tried different version of CUDA, different versions of gcc and all failed the same way. I eval’d into the worker environment and installed pycuda 2022, connected the worker node and the T20S workflow goes all the way through without issue (except for submitting every job as requiring an SSD, when there is --nossd set during install) and an apoferritin test set also runs through to 1.97 Angstrom without issue (which is where it tends to stop in cryoSPARC). Currently running a lower symmetry example from a different 'scope through.

The errors thrown by pycuda are not very helpful, particularly since 2022 builds fine but 2020 included with cryoSPARC doesn’t.

That said, I’m not really happy with running a version of pycuda which cryoSPARC wasn’t designed for… even though everything appears to work. That system was only running Arch because it was some brand-new hardware which at the time would not boot on Ubuntu or CentOS. It’s time I reinstalled it.

Thank you for the update and the suggestion about T20S workflow particle caching.
Please can you confirm that you observed the pycuda installation error on Arch, but not on Ubuntu or CentOS?

Greetings.

I think this question was directed at rbs_sci, but for the sake of thread completion our system is CentOS 7.

Yes, I confirm a pycuda install error on Arch but not on Ubuntu 20.04 or 22.04.

I can copy the cryoSPARC installs somewhere safe and run through a full install again for both Ubuntu 22.04 and the current Arch box and save the logs if that would help?

Thank you for your kind offer. In our tests, installation on Ubuntu-22.04 worked, and we would be interested in learning about errors experienced by users on Ubuntu-{18,20,22}.04 (as well as RHEL family distributions).
In case you are planning a new installation or re-installation on Arch Linux anyway, a complete log of the installation commands, output and errors might be useful to us. We ask for your understanding, however, that support of our software on Linux Arch is not a priority for us right now.

@wtempel,

Aye, I know Arch isn’t officially supported.

When I get a chance, I’ll bundle up all install input and output on Arch.