Hello, I am trying to get my test jobs working on a PBS cluster system. When I run the test jobs, it launches, but stays in a launched state. In the event log, the job is first queued to the cluster, then seems to run before eventually failing with a consistent exit code:
[2025-06-23 15:52:59.98]
License is valid.
[2025-06-23 15:52:59.98]
Launching job on lane Polaris target Polaris ...
[2025-06-23 15:53:00.00]
Launching job on cluster Polaris
[2025-06-23 15:53:00.00]
====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#PBS -N cryosparc_job
#PBS -l select=1:system=polaris,walltime=01:00:00
#PBS -l filesystems=home:eagle
#PBS -A FoundEpidem
#PBS -q debug
module load nvhpc/23.9 PrgEnv-nvhpc/8.5.0
cd /lus/eagle/projects/FoundEpidem/aravi
qsub connect_workers.pbs
==========================================================================
==========================================================================
[2025-06-23 15:53:00.00]
-------- Submission command:
qsub /lus/eagle/projects/FoundEpidem/aravi/connect_workers.pbs
[2025-06-23 15:53:00.32]
-------- Cluster Job ID:
5236869.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
[2025-06-23 15:53:00.33]
-------- Queued on cluster at 2025-06-23 20:53:00.330924
[2025-06-23 15:53:01.29]
Cluster job status update for P1 J49 failed with exit code 35 (63 status update request retries)
qstat: 5236869.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov Job has finished, use -x or -H to obtain historical job information
Any help to resolve this error would be much appreciated, please let me know if I need to provide more information.