CryoSPARC Test Jobs Stuck in Launch State

CryoSPARC instance information

Cryosparc is installed onto a PBS cluster system.

Output of cryosparcm status:

We are aware of the license status issue; it seems like it only fails to verify it as CryoSPARC is running.

Output of uname -a && free -g

nvidia_smi output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:A3:00.0 Off |                   On |
| N/A   38C    P0              47W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Issue:

This is an issue with running test jobs onto a PBS cluster- when we start the test jobs, it is stuck in the launch state and in the event log we can see that it retries indefinitely. We’re leaning towards the issue being that the jobs are sent to the login node’s hostname rather than the correct main cluster hostname. We are wondering if there is a way to change the value of variables such as Submit_Host , server, and/or PBS_O_HOST?

Here is the event log output:



License is valid.

Launching job on lane Polaris target Polaris ...

Launching job on cluster Polaris


====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#PBS -N cryosparc_job
#PBS -l select=1:system=polaris,walltime=01:00:00
#PBS -l filesystems=home:eagle
#PBS -A FoundEpidem
#PBS -q debug
 
module load nvhpc/23.9 PrgEnv-nvhpc/8.5.0
cd /lus/eagle/projects/FoundEpidem/aravi/CS-testing/J3
/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/bin/cryosparcw run --project P1 --job J3 --master_hostname polaris.alcf.anl.gov --master_command_core_port 39002 > /lus/eagle/projects/FoundEpidem/aravi/CS-testing/J3/job.log 2>&1 
==========================================================================
==========================================================================

-------- Submission command: 
qsub /lus/eagle/projects/FoundEpidem/aravi/CS-testing/J3/queue_sub_script.sh

-------- Cluster Job ID: 
5131771.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov

-------- Queued on cluster at 2025-06-18 16:25:46.632431

-------- Cluster job status at 2025-06-18 16:25:47.234604 (0 retries)
Job Id: 5131771.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
    Job_Name = cryosparc_job
    Job_Owner = aravi@polaris-login-01.hsn.cm.polaris.alcf.anl.gov
    job_state = Q
    queue = debug
    server = polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
    Account_Name = FoundEpidem
    Checkpoint = u
    ctime = Wed Jun 18 16:25:46 2025
    Error_Path = polaris-login-01.hsn.cm.polaris.alcf.anl.gov:/lus/eagle/projec
	ts/FoundEpidem/aravi/cryosparc/cryosparc_master/cryosparc_job.e5131771
    Hold_Types = n
    Join_Path = n
    Keep_Files = doe
    Mail_Points = a
    Mail_Users = aravi@anl.gov
    mtime = Wed Jun 18 16:25:46 2025
    Output_Path = polaris-login-01.hsn.cm.polaris.alcf.anl.gov:/lus/eagle/proje
	cts/FoundEpidem/aravi/cryosparc/cryosparc_master/cryosparc_job.o5131771
	
    Priority = 0
    qtime = Wed Jun 18 16:25:46 2025
    Rerunable = False
    Resource_List.allow_account_check_failure = True
    Resource_List.allow_negative_allocation = True
    Resource_List.award_category = INCITE
    Resource_List.award_type = INCITE-2025
    Resource_List.backfill_factor = 84600
    Resource_List.backfill_max = 50
    Resource_List.base_score = 51
    Resource_List.burn_ratio = 0.2610
    Resource_List.current_allocation = 11173063680
    Resource_List.eagle_fs = True
    Resource_List.enable_backfill = 0
    Resource_List.enable_fifo = 1
    Resource_List.enable_wfp = 0
    Resource_List.fifo_factor = 1800
    Resource_List.filesystems = home:eagle
    Resource_List.home_fs = True
    Resource_List.mig_avail = True
    Resource_List.ncpus = 64
    Resource_List.ni_resource = polaris
    Resource_List.nodect = 1
    Resource_List.overburn = False
    Resource_List.place = free
    Resource_List.preempt_targets = NONE
    Resource_List.project_priority = 25
    Resource_List.route_backfill = False
    Resource_List.score_boost = 0
    Resource_List.select = 1:system=polaris
    Resource_List.start_xserver = False
    Resource_List.total_allocation = 15120000000
    Resource_List.total_cpus = 560
    Resource_List.walltime = 01:00:00
    Resource_List.wfp_factor = 100000
    substate = 10
    Variable_List = PBS_O_HOME=/home/aravi,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=aravi,
	PBS_O_PATH=/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_m
	aster/deps/external/mongodb/bin:/lus/eagle/projects/FoundEpidem/aravi/c
	ryosparc/cryosparc_master/bin:/lus/eagle/projects/FoundEpidem/aravi/cry
	osparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin:/lu
	s/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_master/deps/anac
	onda/condabin:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc
	_master/deps/external/mongodb/bin:/lus/eagle/projects/FoundEpidem/aravi
	/nodejs/bin:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_m
	aster/deps/external/mongodb/bin:/lus/eagle/projects/FoundEpidem/aravi/c
	ryosparc/cryosparc_worker/bin:/lus/eagle/projects/FoundEpidem/aravi/cry
	osparc/cryosparc_master/bin:/home/aravi/anaconda3/bin:/lus/eagle/projec
	ts/FoundEpidem/aravi/cryosparc/cryosparc_master/deps/external/mongodb/b
	in:/lus/eagle/projects/FoundEpidem/aravi/nodejs/bin:/lus/eagle/projects
	/FoundEpidem/aravi/cryosparc/cryosparc_master/deps/external/mongodb/bin
	:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/bin:/
	lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_master/bin:/ho
	me/aravi/anaconda3/bin:/soft/xalt/3.0.2-202408282050/bin:/soft/perftool
	s/darshan/darshan-3.4.4/bin:/opt/cray/pe/perftools/23.12.0/bin:/opt/cra
	y/pe/papi/7.0.1.2/bin:/opt/cray/libfabric/1.15.2.0/bin:/opt/cray/pals/1
	.3.4/bin:/opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/bin:/opt/cray/pe/mpi
	ch/8.1.28/bin:/opt/cray/pe/craype/2.7.30/bin:/opt/nvidia/hpc_sdk/Linux_
	x86_64/23.9/compilers/extras/qd/bin:/opt/nvidia/hpc_sdk/Linux_x86_64/23
	.9/compilers/bin:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/bin:/opt/cl
	mgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/usr/local/bin:/usr/
	bin:/bin:/opt/c3/bin:/dbhome/db2cat/sqllib/bin:/dbhome/db2cat/sqllib/ad
	m:/dbhome/db2cat/sqllib/misc:/dbhome/db2cat/sqllib/gskit/bin:/usr/lib/m
	it/bin:/usr/lib/mit/sbin:/opt/pbs/bin:/sbin:/opt/cray/pe/bin:/home/arav
	i/.local/bin:/home/aravi/bin,PBS_O_MAIL=/var/spool/mail/aravi,
	PBS_O_SHELL=/bin/bash,PBS_O_INTERACTIVE_AUTH_METHOD=resvport,
	PBS_O_HOST=polaris-login-01.hsn.cm.polaris.alcf.anl.gov,
	PBS_O_WORKDIR=/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryospar
	c_master,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=debug
    etime = Wed Jun 18 16:25:46 2025
    umask = 22
    eligible_time = 00:00:01
    Submit_arguments = /lus/eagle/projects/FoundEpidem/aravi/CS-testing/J3/queu
	e_sub_script.sh
    project = FoundEpidem
    Submit_Host = polaris-login-01.hsn.cm.polaris.alcf.anl.gov