Stuck in queue forever

Sure, it’s a set of scripts, but hope this is helpful:

#!/bin/bash
#SBATCH -c 4 --mem=32g --constraint=x2695 --partition=norm --job-name=cryosparc_master --signal=B:TERM@1200 --output=/home/%u/cryosparc_master.%j.out --comment="cryosparc_master" --time=10-00:00:00

cleanup() {
    ${CRYOSPARC_HOME}/libexec/shutdown.sh >> ${CRYOSPARC_BASE}/sbatch.log

# if all goes well and smoothly, the job will exit with EC=0 and will be marked as COMPLETED
    exit 0
}    
trap cleanup 1 2 3 4 5 6 7 8 9 10 SIGINT SIGTERM

# Set up the environment
source ${CRYOSPARC_HOME}/libexec/enable_proxy.sh
source ${CRYOSPARC_HOME}/libexec/enable_waits.sh
source ${CRYOSPARC_BASE}/config.cnf

# Is cryosparc installed?
[[ -f ${CRYOSPARC_BASE}/cryosparc_master/bin/cryosparcm ]] || { echo "CryoSPARC not installed in ${CRYOSPARC_BASE}"; exit 1; }
[[ -f ${CRYOSPARC_BASE}/cryosparc_master/config.sh ]] || { echo "CryoSPARC not installed in ${CRYOSPARC_BASE}"; exit 1; }
[[ -w ${CRYOSPARC_BASE}/cryosparc_master/config.sh ]] || { echo "CryoSPARC config.sh not writable ${CRYOSPARC_BASE}"; exit 1; }

# Is cryosparc already running?
[[ -f ${CRYOSPARC_BASE}/cryosparc_master/run/supervisord.pid ]] && ps -p $(cat ${CRYOSPARC_BASE}/cryosparc_master/run/supervisord.pid) >& /dev/null && { echo "CryoSPARC already running!"; exit 1; }

# Start up the server
${CRYOSPARC_HOME}/libexec/startup.sh >> ${CRYOSPARC_BASE}/sbatch.log

# Establish tunnel from biowulf to this node
/usr/local/slurm/libexec/tunnelcommand_tcp.sh --cryosparc

# Wait until done
wait_until_done ${SLURM_JOB_ID}
1 Like

@morganyang422 Please can you email us a compressed copy of the file

/gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/run/command_core.log

I will send you a direct message with the relevant email address.

@morganyang422 Please can you check if requesting a larger number of CPUs (8 or more) can help with this issue?

@wtempel Thanks for the suggestion. I tested with 16 cpus and 32g memory, sitll didn’t solve the problem. The maximum cpus usage is 2 from our job monitor tools. Thanks for further suggestion or comment.

Thanks for trying that. Another thing to try: what is the output of the command

cat /sys/kernel/mm/transparent_hugepage/enabled

? If the output shows
[always] madvise never, you may want to test if disabling transparent_hugepage (applying the never setting), resolves the problem.