Hi,
I am a PhD student trying to get CryoSPARC 4.2.1 running on our cluster. However when I try to submit something to my cluster I receive the error: Invalid job id specified
We already updated both Master and Worker GPU.
I am very happy about any advice what I should do and where my error arises from. I have here the slurm script also, maybe I did something wrong here (I removed the exact paths, since my institute was a bit worried about their cluster security). I don’t know how to proceed trouble shooting. I am happy for any advice. Thanks Lukas
GPU Master:
NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4
GPU Worker:
NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4
====================== Cluster submission script: ======================== ========================================================================== #!/bin/bash
#SBATCH --job-name cryosparc_P1_J10
#SBATCH -n 6
#SBATCH --gres=gpu:1
#####SBATCH --mem=128000MB
#SBATCH --mem-per-cpu=11G
#SBATCH -o //output_P1_J10.txt
#SBATCH -e //cryosparc_slurm_outputs/error_P1_J10.txt
#Define the "gpu" partition for GPU-accelerated jobs
#SBATCH --partition=gpu
#
#Define the GPU architecture (GTX980 in the example, other options are GTX1080Ti, K40)
######SBATCH --constraint=GTX1080Ti
######SBATCH --constraint=buster
#SBATCH --time=96:00:00
module load cuda/11.2.2
module load tensorflow
nvidia-smi
mkdir -p /ssdpool/
available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z "$available_devs" ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done
#I commented out the line below on advice
#export CUDA_VISIBLE_DEVICES=$available_devs
echo $available_devs
echo $CUDA_HOME
echo "$(hostname)"
echo $SLURM_TMPDIR
/usr/bin/nvidia-smi
module list
export CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"
/servergrp/cryodata/cryosparc_v4.0.0_gpuMaster_gpuWorker_PORT_LIC_2dfef1c2/cryosparc_worker/bin/cryosparcw run --project P1 --job J10 --master_hostname --master_command_core_port > /servergrp/cryodata/cryosparc/CS-jesse-training-proteasome/J10/job.log 2>&1
-------- Submission command: sbatch /servergrp/cryodata/cryosparc/CS-jesse-training-proteasome/J10/queue_sub_script.sh
-------- Cluster Job ID: 4309216
-------- Queued on cluster at 2023-06-20 17:08:33.848970
Cluster job status update for P1 J10 failed with exit code 1 (6463 retries) slurm_load_jobs error: Invalid job id specified