Topaz extract issue

[guodongxie@beagle3-login4 master]$ cryosparcm eventlog P1 J605 | tail -n 50
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --time=7:00:00
#SBATCH --mem=48G
#SBATCH --exclude=beagle3-0028
export CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"
srun /software/cryosparc_worker/bin/cryosparcw run --project P1 --job J605 --master_hostname beagle3-login3.rcc.local --master_command_core_port 39322 > /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J605/job.log 2>&1 
==========================================================================
==========================================================================
[Mon, 30 Jun 2025 19:09:28 GMT]  -------- Submission command: 
sbatch /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J605/queue_sub_script.sh
[Mon, 30 Jun 2025 19:09:28 GMT]  -------- Cluster Job ID: 
32428230
[Mon, 30 Jun 2025 19:09:28 GMT]  -------- Queued on cluster at 2025-06-30 14:09:28.687181
[Mon, 30 Jun 2025 19:09:28 GMT]  -------- Cluster job status at 2025-06-30 14:09:59.015101 (3 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          32428230   beagle3 cryospar guodongx  R       0:17      1 beagle3-0039
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB] Job J605 Started
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB] Master running v4.6.0, worker running v4.6.0
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB] Working in directory: /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J605
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB] Running on lane beagle3-exclude-0028
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB] Resources allocated:
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB]   Worker:  beagle3-exclude-0028
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB]   CPU   :  [0, 1, 2, 3, 4, 5, 6, 7]
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB]   GPU   :  [0]
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB]   RAM   :  [0]
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB]   SSD   :  False
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB] --------------------------------------------------------------
[Mon, 30 Jun 2025 19:10:00 GMT] [CPU RAM used: 87 MB] Importing job module for job type topaz_extract...
[Mon, 30 Jun 2025 19:10:13 GMT] [CPU RAM used: 226 MB] Job ready to run
[Mon, 30 Jun 2025 19:10:13 GMT] [CPU RAM used: 226 MB] ***************************************************************
[Mon, 30 Jun 2025 19:10:13 GMT] [CPU RAM used: 226 MB] Topaz is a particle detection tool created by Tristan Bepler and Alex J. Noble.
Citations:
- Bepler, T., Morin, A., Rapp, M. et al. Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. Nat Methods 16, 1153-1160 (2019) doi:10.1038/s41592-019-0575-8
- Bepler, T., Noble, A.J., Berger, B. Topaz-Denoise: general deep denoising models for cryoEM. bioRxiv 838920 (2019) doi: https://doi.org/10.1101/838920

Structura Biotechnology Inc. and cryoSPARC do not license Topaz nor distribute Topaz binaries. Please ensure you have your own copy of Topaz licensed and installed under the terms of its GNU General Public License v3.0, available for review at: https://github.com/tbepler/topaz/blob/master/LICENSE.
***************************************************************
[Mon, 30 Jun 2025 19:10:42 GMT] [CPU RAM used: 224 MB] Starting Topaz process using version 0.2.5a...
[Mon, 30 Jun 2025 19:10:42 GMT] [CPU RAM used: 224 MB] Using preprocessed micrographs from  J603/preprocessed
[Mon, 30 Jun 2025 19:10:42 GMT] [CPU RAM used: 224 MB] Found 627 processed micrograph(s) in /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J603/preprocessed
[Mon, 30 Jun 2025 19:10:42 GMT] [CPU RAM used: 224 MB] An additional 796 micrograph(s) require preprocessing. Results will be saved to /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J603/preprocessed
[Mon, 30 Jun 2025 19:10:42 GMT] [CPU RAM used: 224 MB] --------------------------------------------------------------
[Mon, 30 Jun 2025 19:10:42 GMT] [CPU RAM used: 224 MB] Starting preprocessing...
[Mon, 30 Jun 2025 19:10:42 GMT] [CPU RAM used: 224 MB] Starting micrograph preprocessing by running command /project2/eozkan/sjmachera/anaconda/topaz/bin/topaz preprocess --scale 7 --niters 200 --num-workers 4 -o /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J603/preprocessed [796 MICROGRAPH PATHS EXCLUDED FOR LEGIBILITY]
[Mon, 30 Jun 2025 19:10:42 GMT] [CPU RAM used: 224 MB] Preprocessing over 2 processes...
[Tue, 01 Jul 2025 02:12:40 GMT]  **** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****
[Tue, 01 Jul 2025 02:12:40 GMT]  Job is unresponsive - no heartbeat received in 180 seconds.
[guodongxie@beagle3-login4 master]$ 
[guodongxie@beagle3-login4 master]$ cryosparcm cli "get_job('P1', 'J605', 'version', 'type', 'params_spec', 'started_at')"
{'_id': '6862e0cf53dbe65baa84dc46', 'params_spec': {'exec_path': {'value': '/project2/eozkan/sjmachera/anaconda/topaz/bin/topaz'}, 'par_diam': {'value': 150}}, 'project_uid': 'P1', 'started_at': 'Mon, 30 Jun 2025 19:10:00 GMT', 'type': 'topaz_extract', 'uid': 'J605', 'version': 'v4.6.0'}
[guodongxie@beagle3-login4 master]$

@guodong Please can you post the outputs of these commands:

cat /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J605/queue_sub_script.sh
cryosparcm joblog P1 J605 | tail -n 40

[guodongxie@beagle3-login3 master]$ cat /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J605/queue_sub_script.sh

#!/bin/bash

#SBATCH --job-name=cryosparc_P1_J605

#SBATCH --partition=beagle3

#SBATCH --constraint=a40

#SBATCH --account=pi-eozkan

#SBATCH --output=/beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J605/job.log

#SBATCH --error=/beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J605/job.log

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1

#SBATCH --cpus-per-task=8

#SBATCH --gres=gpu:1

#SBATCH --gres-flags=enforce-binding

#SBATCH --time=7:00:00

#SBATCH --mem=48G

#SBATCH --exclude=beagle3-0028

export CRYOSPARC_SSD_PATH=“${SLURM_TMPDIR}”

srun /software/cryosparc_worker/bin/cryosparcw run --project P1 --job J605 --master_hostname beagle3-login3.rcc.local --master_command_core_port 39322 > /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J605/job.log 2>&1 [guodongxie@beagle3-login3 master]$ cryosparcm joblog P1 J605 | tail -n 40

/beagle3/eozkan/cryosparc_guodongxie/master/cryosparc_tools/cryosparc/command.py:135: UserWarning: *** CommandClient: (http://beagle3-login3.rcc.local:39322/api) URL Error [Errno 111] Connection refused, attempt 1 of 3. Retrying in 30 seconds

system = self._get_callable(“system.describe”)()

indicates a disruption of the communication between the GPU node and the CryoSPARC master node.
Is a firewall blocking access to the relevant ports on the CryoSPARC master server from the compute node(s)?

A side note: the settings

might cause slurm output, which may be useful for troubleshooting, to be overwritten when the job starts. Consider replacing these lines in the slurm script template
with the options

#SBATCH --output={{ job_dir_abs }}/slurm-%j.out
#SBATCH --error={{ job_dir_abs }}/slurm-%j.err

I am the HPC support here. We do not have a firewall between GPU nodes and login nodes (where the master runs). If that were the case, wouldn’t all jobs have failed?

I ran a clone of the same job, and it worked fine without any changes to the system. Another user at our HPC has reported that jobs have been failing randomly. I am currently investigating that, and if I find a pattern, I can add that information here.