Slurm_load_jobs error: Invalid job id specified for version 4.2.1

LUKASinScience · June 21, 2023, 1:09pm

Hi,
I am a PhD student trying to get CryoSPARC 4.2.1 running on our cluster. However when I try to submit something to my cluster I receive the error: Invalid job id specified
We already updated both Master and Worker GPU.
I am very happy about any advice what I should do and where my error arises from. I have here the slurm script also, maybe I did something wrong here (I removed the exact paths, since my institute was a bit worried about their cluster security). I don’t know how to proceed trouble shooting. I am happy for any advice. Thanks Lukas

GPU Master:
NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4

GPU Worker:

NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4

====================== Cluster submission script: ======================== ========================================================================== #!/bin/bash 
#SBATCH --job-name cryosparc_P1_J10
#SBATCH -n 6
#SBATCH --gres=gpu:1
#####SBATCH --mem=128000MB
#SBATCH --mem-per-cpu=11G
#SBATCH -o //output_P1_J10.txt
#SBATCH -e //cryosparc_slurm_outputs/error_P1_J10.txt
#Define the "gpu" partition for GPU-accelerated jobs
#SBATCH --partition=gpu
#
#Define the GPU architecture (GTX980 in the example, other options are GTX1080Ti, K40)
######SBATCH --constraint=GTX1080Ti
######SBATCH --constraint=buster
#SBATCH --time=96:00:00
module load cuda/11.2.2
module load tensorflow
nvidia-smi
mkdir -p /ssdpool/

available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
#I commented out the line below on advice 
#export CUDA_VISIBLE_DEVICES=$available_devs
echo $available_devs
echo $CUDA_HOME
echo "$(hostname)"
echo $SLURM_TMPDIR

/usr/bin/nvidia-smi

module list

export CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"

/servergrp/cryodata/cryosparc_v4.0.0_gpuMaster_gpuWorker_PORT_LIC_2dfef1c2/cryosparc_worker/bin/cryosparcw run --project P1 --job J10 --master_hostname  --master_command_core_port  > /servergrp/cryodata/cryosparc/CS-jesse-training-proteasome/J10/job.log 2>&1 

-------- Submission command: sbatch /servergrp/cryodata/cryosparc/CS-jesse-training-proteasome/J10/queue_sub_script.sh
-------- Cluster Job ID: 4309216
-------- Queued on cluster at 2023-06-20 17:08:33.848970
Cluster job status update for P1 J10 failed with exit code 1 (6463 retries) slurm_load_jobs error: Invalid job id specified

wtempel · June 21, 2023, 8:57pm

Welcome to the forum @LUKASinScience .

could be caused by either

CryoSPARC incorrectly extracted 4309216 as the slurm job id (by parsing sbatch output)
or: 4309216 was the correct slurm job id, but the configured time window during which slurm would provide information on slurm job 4309216 has expired.

You may want to confirm the actual cause of slurm_load_jobs error: Invalid job id specified in this specific case with your cluster support team.

slurm_load_jobs error may conceal underlying problems with the script template. Some potentially problematic items in the script:

#SBATCH -o //output_P1_J10.txt # directory may not be shared with GPU node and writeable
#SBATCH -e //cryosparc_slurm_outputs/error_P1_J10.txt # directory may not be shared with GPU and writeable
module load cuda/11.2.2 # may be in conflict with $CRYOSPARC_CUDA_PATH. Confirm that $CRYSPARC_CUDA_PATH is readable on GPU nodes.
module load tensorflow # may be in conflict with 3DFlex dependencies
mkdir -p /ssdpool/ # may fail with permission error

There are good reasons for

#I commented out the line below on advice 
#export CUDA_VISIBLE_DEVICES=$available_devs

in that export CUDA_VISIBLE_DEVICES=$available_devs does not in and by itself robustly prevent the oversubscription of GPU resources. You may want to confirm with the admins of your cluster that the cluster is configured to restrict jobs from accessing GPUs that are in use by other jobs.
With

#export CUDA_VISIBLE_DEVICES=$available_devs

commented out, the lines

available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done

may be obsolete.
Please take this advice with a grain of salt: I do not know the configuration of your specific cluster and, ultimately, cluster_info.json and cluster_script.sh must be compatible with that cluster configuration.
You may want to forward a link to our guide to your cluster support team and seek their input on suitable cluster_info.json and cluster_script.sh specs.

LUKASinScience · June 22, 2023, 9:59am

Thank you wtempel/ CryoSPARC Team, we were able to solve the issue thanks to your advice!