Greetings.
The files are as follows:
P5/J201/err.txt
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
/cm/local/apps/slurm/var/spool/job23828/slurm_script: line 35: nvidia-smi: command not found
slurmstepd: error: *** JOB 23828 ON node03 CANCELLED AT 2022-11-02T11:59:03 ***
P5/J201/out.txt
File exists but is empty.
So, we tried restarting the cryosparc master process and rebuilding the worker process (with --override), and now the job gets stuck here without entering the queue. I can confirm that the SLURM queue is working fine for other jobs though.
License is valid.
Launching job on lane vision target vision …
Launching job on cluster vision
====================== Cluster submission script: ======================== ========================================================================== #!/usr/bin/env bash #### cryoSPARC cluster submission script template for SLURM ## Available variables: ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J203 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J203/job.log 2>&1 - the complete command string to run the job ## 4 - the number of CPUs needed ## 1 - the number of GPUs needed. ## Note: the code will use this many GPUs starting from dev id 0 ## the cluster scheduler or this script have the responsibility ## of setting CUDA_VISIBLE_DEVICES so that the job code ends up ## using the correct cluster-allocated GPUs. ## 24.0 - the amount of RAM needed in GB ## /tank/colemanlab/jcoleman/cryosparc/P5/J203 - absolute path to the job directory ## /tank/colemanlab/jcoleman/cryosparc/P5 - absolute path to the project dir ## /tank/colemanlab/jcoleman/cryosparc/P5/J203/job.log - absolute path to the log file for the job ## /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw - absolute path to the cryosparc worker command ## --project P120 --job J203 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 - arguments to be passed to cryosparcw run ## P120 - uid of the project ## J203 - uid of the job ## coleman - name of the user that created the job (may contain spaces) ## coleman1@pitt.edu - cryosparc username of the user that created the job (usually an email) ## ## What follows is a simple SLURM script: #SBATCH --job-name cryosparc_P120_J203 #SBATCH -n 4 #SBATCH --gres=gpu:1 #SBATCH -p defq #SBATCH --mem=24000MB #SBATCH -o /tank/colemanlab/jcoleman/cryosparc/P5/J203/out.txt #SBATCH -e /tank/colemanlab/jcoleman/cryosparc/P5/J203/err.txt available_devs="" for devidx in $(seq 0 15); do if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then if [[ -z “$available_devs” ]] ; then available_devs=$devidx else available_devs=$available_devs,$devidx fi fi done export CUDA_VISIBLE_DEVICES=$available_devs /opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P120 --job J203 --master_hostname vision.structbio.pitt.edu --master_command_core_port 39002 > /tank/colemanlab/jcoleman/cryosparc/P5/J203/job.log 2>&1 ========================================================================== ==========================================================================
-------- Submission command: sbatch /tank/colemanlab/jcoleman/cryosparc/P5/J203/queue_sub_script.sh
-------- Cluster Job ID: 23837
-------- Queued on cluster at 2022-11-03 09:42:33.132614
-------- Job status at 2022-11-03 09:42:33.271697 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 23837 defq cryospar cryospar CF 0:00 1 node02
[CPU: 69.6 MB] Project P120 Job J203 Started