Jobs Fail with Exit Code 35

aravi · June 23, 2025, 9:07pm

Hello, I am trying to get my test jobs working on a PBS cluster system. When I run the test jobs, it launches, but stays in a launched state. In the event log, the job is first queued to the cluster, then seems to run before eventually failing with a consistent exit code:

[2025-06-23 15:52:59.98]

License is valid.
[2025-06-23 15:52:59.98]

Launching job on lane Polaris target Polaris ...
[2025-06-23 15:53:00.00]

Launching job on cluster Polaris
[2025-06-23 15:53:00.00]


====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#PBS -N cryosparc_job
#PBS -l select=1:system=polaris,walltime=01:00:00
#PBS -l filesystems=home:eagle
#PBS -A FoundEpidem
#PBS -q debug
 
module load nvhpc/23.9 PrgEnv-nvhpc/8.5.0
cd /lus/eagle/projects/FoundEpidem/aravi
qsub connect_workers.pbs
==========================================================================
==========================================================================
[2025-06-23 15:53:00.00]

-------- Submission command: 
qsub /lus/eagle/projects/FoundEpidem/aravi/connect_workers.pbs
[2025-06-23 15:53:00.32]

-------- Cluster Job ID: 
5236869.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
[2025-06-23 15:53:00.33]

-------- Queued on cluster at 2025-06-23 20:53:00.330924
[2025-06-23 15:53:01.29]

Cluster job status update for P1 J49 failed with exit code 35 (63 status update request retries)
qstat: 5236869.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov Job has finished, use -x or -H to obtain historical job information

Any help to resolve this error would be much appreciated, please let me know if I need to provide more information.

wtempel · June 23, 2025, 9:50pm

Please can you run these commands post their outputs:

cd $(mktemp -d)
cryosparcm cluster dump Polaris
cat cluster_info.json
cat cluster_script.sh

aravi · June 24, 2025, 3:36pm

Thank you for the quick reply! Here are the outputs:

aravi@polaris-login-02:/lus/eagle/projects/FoundEpidem/aravi> cd $(mktemp -d)
aravi@polaris-login-02:/tmp/tmp.DXd0DbbGoj> cryosparcm cluster dump Polaris
Polaris
Dumping configuration and script for cluster Polaris
Done.
aravi@polaris-login-02:/tmp/tmp.DXd0DbbGoj> cat cluster_info.json
{
    "name": "Polaris",
    "title": "Polaris",
    "worker_bin_path": "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/bin/cryosparcw",
    "send_cmd_tpl": "{{ command }}",
    "qsub_cmd_tpl": "qsub /lus/eagle/projects/FoundEpidem/aravi/connect_workers.pbs",
    "qstat_cmd_tpl": "qstat -f {{ cluster_job_id }}",
    "qdel_cmd_tpl": "qdel {{ cluster_job_id }}",
    "qinfo_cmd_tpl": "qstat -B",
    "cache_path": "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_ssd",
    "cache_quota_mb": 100000,
    "cache_reserve_mb": 10000
}
aravi@polaris-login-02:/tmp/tmp.DXd0DbbGoj> cat cluster_script.sh
#!/bin/bash
#PBS -N cryosparc_job
#PBS -l select=1:system=polaris,walltime=01:00:00
#PBS -l filesystems=home:eagle
#PBS -A FoundEpidem
#PBS -q debug
 
module load nvhpc/23.9 PrgEnv-nvhpc/8.5.0
cd /lus/eagle/projects/FoundEpidem/aravi
qsub connect_workers.pbs

I’ll also include my connect_workers.pbs script for you in case it is needed:

aravi@polaris-login-02:/lus/eagle/projects/FoundEpidem/aravi> cat connect_workers.pbs
#!/bin/bash
#PBS -N cryosparc_connect
#PBS -l select=1:system=polaris,walltime=01:00:00
#PBS -l filesystems=home:eagle
#PBS -A FoundEpidem
#PBS -q debug
 
#TASKS
module load nvhpc/23.9 PrgEnv-nvhpc/8.5.0

cd /lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker

./bin/cryosparcw connect --worker $(hostname -f) --master polaris.alcf.anl.gov --port 18080 --ssdpath /lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_ssd

wtempel · June 24, 2025, 6:05pm

Typically (there are exceptions), the submission of CryoSPARC jobs via the CryoSPARC web app to a separate workload manager, like PBS, does not involve the cryosparcw connect command. If you were to follow the typical pattern, the line inside cluster_script.sh

qsub connect_workers.pbs

would be replaced with

{{ run_cmd }}

and qsub_cmd_template inside cluster_info.json would be defined as

"qsub_cmd_tpl": "qsub {{ script_path_abs }}",

(guide), where the run_cmd template variable will be automatically rendered as the CryoSPARC command running the CryoSPARC job and script_path_abs corresponds to the fully rendered, automatically generated PBS submission script (guide).
An exception would be a workflow where a single workstation-type (aka standalone) CryoSPARC instance runs on a cluster node as a cluster job that runs CryoSPARC jobs on that cluster node only (example).

aravi · June 24, 2025, 6:55pm

Understood. I have changed the files back to as they should be, but the same behavior is happening for the test job, along with the same exit code.

wtempel · June 24, 2025, 8:18pm

Have you inspected the cluster job’s stdout and stdout files?

aravi · June 24, 2025, 9:06pm

Yes, the output of the commend cryosparcm joblog PX JY shows:

Tue, 24 Jun 2025 20:51:09 GMT]  License is valid.
[Tue, 24 Jun 2025 20:51:09 GMT]  Launching job on lane polaris target polaris ...
[Tue, 24 Jun 2025 20:51:09 GMT]  Launching job on cluster polaris
[Tue, 24 Jun 2025 20:51:09 GMT]  
====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for PBS
## Available variables:
## /lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/bin/cryosparcw run --project P1 --job J2 --master_hostname polaris.alcf.anl.gov --master_command_core_port 18002 > /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J2/job.log 2>&1             - the complete command string to run the job
## 1            - the number of CPUs needed
## 0            - the number of GPUs needed. 
##                            Note: The code will use this many GPUs starting from dev id 0.
##                                  The cluster scheduler has the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES or otherwise enuring that the
##                                  job uses the correct cluster-allocated GPUs.
## 8.0             - the amount of RAM needed in GB
## /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J2        - absolute path to the job directory
## /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test    - absolute path to the project dir
## /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J2/job.log   - absolute path to the log file for the job
## /lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/bin/cryosparcw    - absolute path to the cryosparc worker command
## --project P1 --job J2 --master_hostname polaris.alcf.anl.gov --master_command_core_port 18002           - arguments to be passed to cryosparcw run
## P1        - uid of the project
## J2            - uid of the job
## aravi        - name of the user that created the job (may contain spaces)
## aravi@anl.gov - cryosparc username of the user that created the job (usually an email)
##

## What follows is a simple PBS script:
#PBS -N cryosparc_job
#PBS -l select=1:system=polaris,walltime=01:00:00
#PBS -l filesystems=home:eagle
#PBS -A FoundEpidem
#PBS -q debug
 
module load nvhpc/23.9 PrgEnv-nvhpc/8.5.0
cd /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J2
/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/bin/cryosparcw run --project P1 --job J2 --master_hostname polaris.alcf.anl.gov --master_command_core_port 18002 > /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J2/job.log 2>&1 

==========================================================================
==========================================================================
[Tue, 24 Jun 2025 20:51:09 GMT]  -------- Submission command: 
qsub /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J2/queue_sub_script.sh
[Tue, 24 Jun 2025 20:51:09 GMT]  -------- Cluster Job ID: 
5238127.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
[Tue, 24 Jun 2025 20:51:09 GMT]  -------- Queued on cluster at 2025-06-24 20:51:09.479893
[Tue, 24 Jun 2025 20:51:09 GMT]  Cluster job status update for P1 J2 failed with exit code 35 (72 status update request retries)
qstat: 5238127.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov Job has finished, use -x or -H to obtain historical job information
aravi@polaris-login-02:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_master>

It seems to be the same as what is displayed on the web interface.

Checking the job.json manually shows this:

{
    "project_uid": "P1",
    "uid": "J2",
    "PID_main": null,
    "PID_monitor": null,
    "PID_workers": [],
    "bench": {},
    "children": [],
    "cloned_from": null,
    "cluster_job_custom_vars": {},
    "cluster_job_id": null,
    "cluster_job_monitor_event_id": null,
    "cluster_job_monitor_last_run_at": null,
    "cluster_job_monitor_retries": 0,
    "cluster_job_status": null,
    "cluster_job_status_code": null,
    "cluster_job_submission_script": null,
    "completed_at": null,
    "completed_count": 0,
    "created_at": {
        "$date": "2025-06-24T20:51:05.494Z"
    },
    "created_by_job_uid": null,
    "created_by_user_id": "685b0ddb3a17018b6d14f59d",
    "deleted": true,
    "description": "Enter a description.",
    "enable_bench": false,
    "errors_build_inputs": {},
    "errors_build_params": {},
    "errors_run": [],
    "experiment_worker_path": null,
    "failed_at": null,
    "generate_intermediate_results": false,
    "has_error": false,
    "has_warning": false,
    "heartbeat_at": null,
    "input_slot_groups": [],
    "instance_information": {},
    "interactive": false,
    "interactive_hostname": "",
    "interactive_port": null,
    "intermediate_results_size_bytes": 0,
    "intermediate_results_size_last_updated": {
        "$date": "2025-06-24T21:15:19.633Z"
    },
    "is_ancestor_of_final_result": false,
    "is_experiment": false,
    "is_final_result": false,
    "job_dir": "J2",
    "job_dir_size": 0,
    "job_dir_size_last_updated": {
        "$date": "2025-06-24T21:15:19.633Z"
    },
    "job_type": "instance_launch_test",
    "killed_at": null,
    "last_accessed": {
        "name": "aravi",
        "accessed_at": {
            "$date": "2025-06-24T21:13:45.349Z"
        }
    },
    "last_intermediate_data_cleared_amount": 0,
    "last_intermediate_data_cleared_at": null,
    "last_scheduled_at": null,
    "last_updated": {
        "$date": "2025-06-24T21:15:19.644Z"
    },
    "launched_at": null,
    "output_group_images": {},
    "output_result_groups": [],
    "output_results": [],
    "params_base": {
        "use_all_gpus": {
            "type": "boolean",
            "value": true,
            "title": "Benchmark all available GPUs",
            "desc": "If enabled, benchmark all available GPUs on the target. This option may not work when submitting to a cluster resource manager.",
            "order": 0,
            "section": "resource_settings",
            "advanced": false,
            "hidden": true
        },
        "gpu_num_gpus": {
            "type": "number",
            "value": 0,
            "title": "Number of GPUs to benchmark",
            "desc": "The number of GPUs to request from the scheduler.",
            "order": 1,
            "section": "resource_settings",
            "advanced": false,
            "hidden": true
        },
        "use_ssd": {
            "type": "boolean",
            "value": false,
            "title": "Use SSD for Tests",
            "desc": "Whether or not to use the SSD on the worker for the tests.",
            "order": 2,
            "section": "resource_settings",
            "advanced": false,
            "hidden": true
        }
    },
    "params_secs": {
        "resource_settings": {
            "title": "Resource Settings",
            "desc": "",
            "order": 0
        }
    },
    "params_spec": {},
    "parents": [],
    "priority": 0,
    "project_uid_num": 1,
    "queue_index": null,
    "queue_message": null,
    "queue_status": null,
    "queued_at": null,
    "queued_job_hash": null,
    "queued_to_lane": "",
    "resources_allocated": {},
    "resources_needed": {
        "slots": {
            "CPU": 1,
            "GPU": 0,
            "RAM": 1
        },
        "fixed": {
            "SSD": false
        }
    },
    "run_as_user": null,
    "running_at": null,
    "started_at": null,
    "status": "killed",
    "title": "New Job J2",
    "tokens_acquired_at": null,
    "tokens_requested_at": null,
    "type": "instance_launch_test",
    "ui_tile_height": 1,
    "ui_tile_images": [],
    "ui_tile_width": 1,
    "uid_num": 2,
    "version": "v4.7.1",
    "waiting_at": null,
    "workspace_uids": [
        "W1"
    ],
    "ui_layouts": {},
    "last_exported": {
        "$date": "2025-06-24T21:15:19.644Z"
    },
    "no_check_inputs_ready": false,
    "queued_to_gpu": null,
    "queued_to_hostname": null,
    "num_tokens": null,
    "job_sig": null,
    "status_num": 45,
    "progress": [],
    "deleted_at": {
        "$date": "2025-06-24T21:15:19.633Z"
    }

aravi · June 24, 2025, 9:26pm

In running another job, this time it seems to have created a job.log file with a traceback error:



================= CRYOSPARCW =======  2025-06-24 21:20:54.659810  =========
Project P1 Job J3
Master polaris.alcf.anl.gov Port 18002
===========================================================================
MAIN PROCESS PID 1211599
========= now starting main process at 2025-06-24 21:20:54.660961
Traceback (most recent call last):
  File "<string>", line 1, in <module>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_master/cryosparc_compute/run.py", line 201, in cryosparc_master.cryosparc_compute.run.run
  File "cryosparc_master/cryosparc_compute/run.py", line 255, in cryosparc_master.cryosparc_compute.run.run
  File "cryosparc_master/cryosparc_compute/run.py", line 50, in cryosparc_master.cryosparc_compute.run.main
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 141, in connect
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 141, in connect
    db = usedb if usedb is not None else database_management.get_pymongo_client('meteor')['meteor']
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_compute/database_management.py", line 221, in get_pymongo_client
    db = usedb if usedb is not None else database_management.get_pymongo_client('meteor')['meteor']
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_compute/database_management.py", line 221, in get_pymongo_client
    assert client[database_name].list_collection_names() is not None
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/database.py", line 1154, in list_collection_names
    assert client[database_name].list_collection_names() is not None
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/database.py", line 1154, in list_collection_names
    return [result["name"] for result in self.list_collections(session=session, **kwargs)]
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/database.py", line 1105, in list_collections
    return [result["name"] for result in self.list_collections(session=session, **kwargs)]
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/database.py", line 1105, in list_collections
    return self.__client._retryable_read(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1540, in _retryable_read
    return self.__client._retryable_read(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1540, in _retryable_read
    return self._retry_internal(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/_csot.py", line 108, in csot_wrapper
    return self._retry_internal(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/_csot.py", line 108, in csot_wrapper
    return func(self, *args, **kwargs)
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1507, in _retry_internal
    return func(self, *args, **kwargs)
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1507, in _retry_internal
    ).run()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2353, in run
    ).run()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2353, in run
    return self._read() if self._is_read else self._write()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2483, in _read
    return self._read() if self._is_read else self._write()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2483, in _read
    self._server = self._get_server()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2439, in _get_server
    self._server = self._get_server()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2439, in _get_server
    return self._client._select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1322, in _select_server
    server = topology.select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 368, in select_server
    return self._client._select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1322, in _select_server
    server = topology.select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 368, in select_server
    server = self._select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 346, in _select_server
    server = self._select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 346, in _select_server
    servers = self.select_servers(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 253, in select_servers
    servers = self.select_servers(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 253, in select_servers
    server_descriptions = self._select_servers_loop(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 303, in _select_servers_loop
    server_descriptions = self._select_servers_loop(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 303, in _select_servers_loop
    raise ServerSelectionTimeoutError(
pymongo.errors.ServerSelectionTimeoutError: polaris.alcf.anl.gov:18001: [Errno 101] Network is unreachable (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30.0s, Topology Description: <TopologyDescription id: 685b16c35c399377def0d5cb, topology_type: Single, servers: [<ServerDescription ('polaris.alcf.anl.gov', 18001) server_type: Unknown, rtt: None, error=AutoReconnect('polaris.alcf.anl.gov:18001: [Errno 101] Network is unreachable (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>
    raise ServerSelectionTimeoutError(
pymongo.errors.ServerSelectionTimeoutError: polaris.alcf.anl.gov:18001: [Errno 101] Network is unreachable (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30.0s, Topology Description: <TopologyDescription id: 685b16c32ff1f2ea97413170, topology_type: Single, servers: [<ServerDescription ('polaris.alcf.anl.gov', 18001) server_type: Unknown, rtt: None, error=AutoReconnect('polaris.alcf.anl.gov:18001: [Errno 101] Network is unreachable (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>

Entering the commandcurl polaris.alcf.anl.gov:18001 into the terminal shows the message: It looks like you are trying to access MongoDB over HTTP on the native driver port, as expected. I’m not too sure what could cause this connection issue, as there is passwordless ssh set up across all nodes, and the hostnames are configured correctly.

wtempel · June 24, 2025, 9:40pm

Please can you confirm that the compute node on which the job ran can access port 18001 (as well as ports 18002, 18003, 18005) on the CryoSPARC master server.

Also, PBS jobs’ stdout and stderr files (which may currently be saved at some default path) may hold useful information.
Custom paths for these files, such as inside the CryoSPARC job directory, can be specified with the
#PBS -e and #PBS -o options (example) inside cluster_script.sh.

aravi · June 25, 2025, 3:49pm

Compute node can reach the master ports:

aravi@x3004c0s25b0n0:~> curl polaris.alcf.anl.gov:18001
It looks like you are trying to access MongoDB over HTTP on the native driver port.
aravi@x3004c0s25b0n0:~> curl polaris.alcf.anl.gov:18002
Hello World from cryosparc command core.
aravi@x3004c0s25b0n0:~> curl polaris.alcf.anl.gov:18003
Hello World from cryosparc command vis.
aravi@x3004c0s25b0n0:~> curl polaris.alcf.anl.gov:18005
Hello World from cryosparc real-time processing manager.

I have added the stdout and stderr to the cluster script, however they were both blank when generated.

aravi@polaris-login-02:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_master> cat cluster_script.sh
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for PBS
## Available variables:
## {{ run_cmd }}            - the complete command string to run the job
## {{ num_cpu }}            - the number of CPUs needed
## {{ num_gpu }}            - the number of GPUs needed. 
##                            Note: The code will use this many GPUs starting from dev id 0.
##                                  The cluster scheduler has the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES or otherwise enuring that the
##                                  job uses the correct cluster-allocated GPUs.
## {{ ram_gb }}             - the amount of RAM needed in GB
## {{ job_dir_abs }}        - absolute path to the job directory
## {{ project_dir_abs }}    - absolute path to the project dir
## {{ job_log_path_abs }}   - absolute path to the log file for the job
## {{ worker_bin_path }}    - absolute path to the cryosparc worker command
## {{ run_args }}           - arguments to be passed to cryosparcw run
## {{ project_uid }}        - uid of the project
## {{ job_uid }}            - uid of the job
## {{ job_creator }}        - name of the user that created the job (may contain spaces)
## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
##

## What follows is a simple PBS script:
#PBS -N cryosparc_job
#PBS -l select=1:system=polaris,walltime=01:00:00
#PBS -l filesystems=home:eagle
#PBS -o {{ job_dir_abs }}/cluster.out
#PBS -e {{ job_dir_abs }}/cluster.err
#PBS -A FoundEpidem
#PBS -q debug
 
module load nvhpc/23.9 PrgEnv-nvhpc/8.5.0
cd {{ job_dir_abs }}
{{ run_cmd }}


aravi@polaris-login-02:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_master> cd ..
aravi@polaris-login-02:/lus/eagle/projects/FoundEpidem/aravi/cryosparc> cd CS-test/J4
aravi@polaris-login-02:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4> cat cluster.out
aravi@polaris-login-02:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4> cat cluster.err
aravi@polaris-login-02:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4>

wtempel · June 25, 2025, 9:14pm

Thanks @aravi for trying that. What are the outputs of the commands

ls -la /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4/
cat /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4/job.log

aravi · June 25, 2025, 9:36pm

Thanks @aravi for trying that. What are the outputs of the commands

ls -la /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4/
cat /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4/job.log

Output:

aravi@polaris-login-04:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4> ls -la /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4/
total 24
drwxrwsr-x 3 aravi FoundEpidem 4096 Jun 25 15:49 .
drwxrwsr-x 9 aravi FoundEpidem 4096 Jun 25 16:10 ..
-rw-rw-r-- 1 aravi FoundEpidem   18 Jun 25 15:49 events.bson
drwxrwsr-x 2 aravi FoundEpidem 4096 Jun 25 15:49 gridfs_data
-rw-rw-r-- 1 aravi FoundEpidem 4582 Jun 25 15:49 job.json
aravi@polaris-login-04:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4> cat /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4/job.log
cat: /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4/job.log: No such file or directory
aravi@polaris-login-04:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J4>

Can the job logs get automatically deleted?

wtempel · June 26, 2025, 5:45pm

Because cluster.out and cluster.err are also no longer present, I suspect the job has been cleared. You may want to queue the job again and, after the has run (or been rejected or terminated by PBS), post the outputs of the commands:

projectid="P5"
jobid="J4"
cryosparcm eventlog $projectid $jobid | tail -n 50
cryosparcm joblog $projectid $jobid | tail -n 50
cat $(cryosparcm cli "get_project_dir_abs('$projectid')")/${jobid}/cluster.err
cat $(cryosparcm cli "get_project_dir_abs('$projectid')")/${jobid}/cluster.out

aravi · June 27, 2025, 3:10pm

So I ran another job and now have a log file. However, there is an odd error of it trying to find the project path through another user’s directory? I’m not sure how this happened as I could not find anything configured towards that directory.

aravi@polaris-login-01:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12> ls
cluster.err  cluster.out  job.log  queue_sub_script.sh
aravi@polaris-login-01:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12> cat job.log


================= CRYOSPARCW =======  2025-06-27 14:44:06.621834  =========
Project P1 Job J12
Master polaris.alcf.anl.gov Port 39002
===========================================================================
MAIN PROCESS PID 3071713
========= now starting main process at 2025-06-27 14:44:06.623051
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_master/cryosparc_compute/run.py", line 255, in cryosparc_master.cryosparc_compute.run.run
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_master/cryosparc_compute/run.py", line 201, in cryosparc_master.cryosparc_compute.run.run
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 133, in connect
  File "cryosparc_master/cryosparc_compute/run.py", line 50, in cryosparc_master.cryosparc_compute.run.main
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 133, in connect
    cli.test_authentication(project_uid, job_uid)
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 122, in func
    cli.test_authentication(project_uid, job_uid)
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 122, in func
    raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://polaris.alcf.anl.gov:39002, code 400) Encountered ServerError from JSONRPC function "test_authentication" with params ('P1', 'J12'):
ServerError: P1 J12 does not exist.
Traceback (most recent call last):
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/commandcommon.py", line 156, in wrapper
    res = func(*args, **kwargs)
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 670, in test_authentication
    job_status = get_job_status(project_uid, job_uid)
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/commandcommon.py", line 187, in wrapper
    return func(*args, **kwargs)
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 7702, in get_job_status
    return get_job(project_uid, job_uid, 'status')['status']
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/commandcommon.py", line 187, in wrapper
    return func(*args, **kwargs)
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 6132, in get_job
    raise ValueError(f"{project_uid} {job_uid} does not exist.")
ValueError: P1 J12 does not exist.

    raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://polaris.alcf.anl.gov:39002, code 400) Encountered ServerError from JSONRPC function "test_authentication" with params ('P1', 'J12'):
ServerError: P1 J12 does not exist.
Traceback (most recent call last):
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/commandcommon.py", line 156, in wrapper
    res = func(*args, **kwargs)
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 670, in test_authentication
    job_status = get_job_status(project_uid, job_uid)
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/commandcommon.py", line 187, in wrapper
    return func(*args, **kwargs)
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 7702, in get_job_status
    return get_job(project_uid, job_uid, 'status')['status']
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/commandcommon.py", line 187, in wrapper
    return func(*args, **kwargs)
  File "/lus/grand/projects/TwinHostPath/cryo_sparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 6132, in get_job
    raise ValueError(f"{project_uid} {job_uid} does not exist.")
ValueError: P1 J12 does not exist.

aravi@polaris-login-01:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12> cat cluster.err
/home/aravi/.bashrc: line 46: pyenv: command not found
aravi@polaris-login-01:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12> cat cluster.out
aravi@polaris-login-01:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12> cat queue_sub_script.sh
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for PBS
## Available variables:
## /lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/bin/cryosparcw run --project P1 --job J12 --master_hostname polaris.alcf.anl.gov --master_command_core_port 39002 > /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12/job.log 2>&1             - the complete command string to run the job
## 1            - the number of CPUs needed
## 0            - the number of GPUs needed. 
##                            Note: The code will use this many GPUs starting from dev id 0.
##                                  The cluster scheduler has the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES or otherwise enuring that the
##                                  job uses the correct cluster-allocated GPUs.
## 8.0             - the amount of RAM needed in GB
## /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12        - absolute path to the job directory
## /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test    - absolute path to the project dir
## /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12/job.log   - absolute path to the log file for the job
## /lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/bin/cryosparcw    - absolute path to the cryosparc worker command
## --project P1 --job J12 --master_hostname polaris.alcf.anl.gov --master_command_core_port 39002           - arguments to be passed to cryosparcw run
## P1        - uid of the project
## J12            - uid of the job
## aravi        - name of the user that created the job (may contain spaces)
## aravi@anl.gov - cryosparc username of the user that created the job (usually an email)
##

## What follows is a simple PBS script:
#PBS -N cryosparc_job
#PBS -l select=1:system=polaris,walltime=01:00:00
#PBS -l filesystems=home:eagle
#PBS -o /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12/cluster.out
#PBS -e /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12/cluster.err
#PBS -A FoundEpidem
#PBS -q debug
 
module load nvhpc/23.9 PrgEnv-nvhpc/8.5.0
cd /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12
/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/bin/cryosparcw run --project P1 --job J12 --master_hostname polaris.alcf.anl.gov --master_command_core_port 39002 > /lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12/job.log 2>&1 

aravi@polaris-login-01:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J12>

wtempel · June 27, 2025, 3:36pm

Is it possible that the other user is operating another CryoSPARC master instance on the same server as your master instance that uses an overlapping port range? If multiple CryoSPARC master instances operate on the same server, it needs to be ensured that their port ranges do not overlap.

aravi · June 27, 2025, 3:44pm

That is very possible, there are multiple users logged into this cluster at once, and may have had the same port number as me. I have changed the port number and re-run a job. Here is the output of job.log:

aravi@polaris-login-01:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J14> cat job.log


================= CRYOSPARCW =======  2025-06-27 15:42:30.546529  =========
Project P1 Job J14
Master polaris.alcf.anl.gov Port 23002
===========================================================================
MAIN PROCESS PID 4160106
========= now starting main process at 2025-06-27 15:42:30.547754
Traceback (most recent call last):
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "cryosparc_master/cryosparc_compute/run.py", line 201, in cryosparc_master.cryosparc_compute.run.run
  File "cryosparc_master/cryosparc_compute/run.py", line 255, in cryosparc_master.cryosparc_compute.run.run
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 141, in connect
  File "cryosparc_master/cryosparc_compute/run.py", line 50, in cryosparc_master.cryosparc_compute.run.main
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 141, in connect
    db = usedb if usedb is not None else database_management.get_pymongo_client('meteor')['meteor']
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_compute/database_management.py", line 221, in get_pymongo_client
    db = usedb if usedb is not None else database_management.get_pymongo_client('meteor')['meteor']
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/cryosparc_compute/database_management.py", line 221, in get_pymongo_client
    assert client[database_name].list_collection_names() is not None
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/database.py", line 1154, in list_collection_names
    assert client[database_name].list_collection_names() is not None
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/database.py", line 1154, in list_collection_names
    return [result["name"] for result in self.list_collections(session=session, **kwargs)]
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/database.py", line 1105, in list_collections
    return [result["name"] for result in self.list_collections(session=session, **kwargs)]
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/database.py", line 1105, in list_collections
    return self.__client._retryable_read(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1540, in _retryable_read
    return self.__client._retryable_read(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1540, in _retryable_read
    return self._retry_internal(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/_csot.py", line 108, in csot_wrapper
    return self._retry_internal(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/_csot.py", line 108, in csot_wrapper
    return func(self, *args, **kwargs)
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1507, in _retry_internal
    return func(self, *args, **kwargs)
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1507, in _retry_internal
    ).run()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2353, in run
    ).run()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2353, in run
    return self._read() if self._is_read else self._write()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2483, in _read
    return self._read() if self._is_read else self._write()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2483, in _read
    self._server = self._get_server()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2439, in _get_server
    self._server = self._get_server()
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 2439, in _get_server
    return self._client._select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1322, in _select_server
    return self._client._select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1322, in _select_server
    server = topology.select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 368, in select_server
    server = topology.select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 368, in select_server
    server = self._select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 346, in _select_server
    server = self._select_server(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 346, in _select_server
    servers = self.select_servers(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 253, in select_servers
    servers = self.select_servers(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 253, in select_servers
    server_descriptions = self._select_servers_loop(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 303, in _select_servers_loop
    server_descriptions = self._select_servers_loop(
  File "/lus/eagle/projects/FoundEpidem/aravi/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/topology.py", line 303, in _select_servers_loop
    raise ServerSelectionTimeoutError(
pymongo.errors.ServerSelectionTimeoutError: polaris.alcf.anl.gov:23001: [Errno 101] Network is unreachable (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30.0s, Topology Description: <TopologyDescription id: 685ebbf7e2e403c7bab6955d, topology_type: Single, servers: [<ServerDescription ('polaris.alcf.anl.gov', 23001) server_type: Unknown, rtt: None, error=AutoReconnect('polaris.alcf.anl.gov:23001: [Errno 101] Network is unreachable (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>
    raise ServerSelectionTimeoutError(
pymongo.errors.ServerSelectionTimeoutError: polaris.alcf.anl.gov:23001: [Errno 101] Network is unreachable (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30.0s, Topology Description: <TopologyDescription id: 685ebbf74dc488850b22b74b, topology_type: Single, servers: [<ServerDescription ('polaris.alcf.anl.gov', 23001) server_type: Unknown, rtt: None, error=AutoReconnect('polaris.alcf.anl.gov:23001: [Errno 101] Network is unreachable (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>
aravi@polaris-login-01:/lus/eagle/projects/FoundEpidem/aravi/cryosparc/CS-test/J14>

wtempel · June 27, 2025, 4:09pm

You may want to ensure that port usage between all users of the server is carefully coordinated.

This error is equivalent to the error posted in Jobs Fail with Exit Code 35 - #8 by aravi. You may want to follow similar diagnostic procedures and, as needed, post their results.

aravi · June 27, 2025, 4:36pm

Does CryoSPARC have to use the TCP connection between the master and the worker? They are both installed onto the same cluster with the master being on the login node and the worker being on a compute node. They both have access to the same file system as well.

wtempel · June 27, 2025, 5:34pm

Please see the guide for the ports required.