cSPARC treats master worker as remote

sergei.pourmal · April 14, 2023, 8:33pm

Slogging thorough a series of issues after a cooler and motherboard failure and subsequent replacement on our workstation. Previously we were running CentOS7, but we couldn’t get a clean install to work so have upgraded to an install of Ubuntu on an internal SSD. The previous drives are mounted and we’re able to see the file system, including rescuing the latest cryoSPARC database backup (from 3/21).

We have installed a fresh instance of cryoSPARC 4.2.1 on Ubuntu 20.04.06. I have successfully carried over a backed up database into the new instance. We regularly keep our data and project directories on a cluster, which we mount via sshfs, to use this workstation’s GPUs. There was previously a remote worker attached to this station, though we haven’t gotten to the step of reconnecting that yet.

Starting a new project on the mounted cluster, I am able to import micrographs from a different directory on the mounted cluster and run exposure utilities to select a subset of 100 to continue troubleshooting with. Attempting to run CTF Estimation gives:

“[2023-04-14 12:21:48.62] License is valid.
[2023-04-14 12:21:48.71] Launching job on lane [workstation_name] target [workstation_name].[network_name].edu …
[2023-04-14 12:21:49.43] Running job on remote worker node hostname [workstation_name].[networkname].edu
[2023-04-14 12:21:58.67] Failed to launch! 255 Host key verification failed.”

I find it odd that cryoSPARC thinks it’s running the job from a remote worker node. The two lanes on which to queue I see after importing the db are the original lanes I had before this fresh install (i.e. [workstation] and [remote_workstation_that’s_not_reconnected_yet].

I wonder if for some reason importing the old database results in the old local worker being seen as a remote worker, and on top of that we have an issue with ssh? ~/.ssh/known_hosts is empty.

In a likely related issue, attempting to run any job in our previous workspaces gives the error:
“Job directory /path/to/cluster/project/job/ is not empty. found /path/to/cluster/project/job/job.log”

EDIT: restarting the workstation and cryoSPARC changes the behavior for attempting to run jobs in previous workspaces. The error for a NU refinement is now also “Failed to launch! 255 Host key verification failed.”

Any thoughts or suggestions would be appreciated,

Sergei

wtempel · April 14, 2023, 9:53pm

Given the various messages that are likely related in some way, I can not answer comprehensively. Maybe I can help step-by-step.

Part of cryosparcw connect is the creation of a scheduler target record in the database. The that’s_not_reconnected_yet part might not be true if the restored database backup already included that worker.
A worker is considered remote if its hostname in
cryosparcm cli "get_scheduler_targets()" output does not match $CRYOSPARC_MASTER_HOSTNAME as defined inside cryosparc_master/config.sh
Have you tested (after the upgrade/repair) the password-less ssh connection from the master to that “remote” worker?
For more concrete suggestions, please can you post the outputs of

cryosparcm status | grep -v LICENSE
cryosparcm cli "get_scheduler_targets()"

sergei.pourmal · April 15, 2023, 7:38am

output of cryosparm status | grep -v LICENSE:

----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/cryosparcuser/cryosparc/cryosparc_master
Current cryoSPARC version: v4.2.1
----------------------------------------------------------------------------

CryoSPARC process status:

app                              RUNNING   pid 4060, uptime 10:43:50
app_api                          RUNNING   pid 4078, uptime 10:43:48
app_api_dev                      STOPPED   Not started
app_legacy                       STOPPED   Not started
app_legacy_dev                   STOPPED   Not started
command_core                     RUNNING   pid 3950, uptime 10:44:02
command_rtp                      RUNNING   pid 4017, uptime 10:43:55
command_vis                      RUNNING   pid 4012, uptime 10:43:56
database                         RUNNING   pid 3584, uptime 10:44:06

----------------------------------------------------------------------------
License is valid
----------------------------------------------------------------------------

global config variables:
export CRYOSPARC_MASTER_HOSTNAME="ahuramazda"
export CRYOSPARC_DB_PATH="/home/cryosparcuser/cryosparc/cryosparc_database"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_CLICK_WRAP=true

output of cryosparcm cli "get_scheduler_targets()":

[{'cache_path': '/scratch/2021-cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25634078720, 'name': 'Quadro P6000'}, {'id': 1, 'mem': 25637224448, 'name': 'Quadro P6000'}], 'hostname': 'ahuramazda.ucsf.edu', 'lane': 'ahuramazda', 'monitor_port': None, 'name': 'ahuramazda.ucsf.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'cryosparc@ahuramazda.ucsf.edu', 'title': 'Worker node ahuramazda.ucsf.edu', 'type': 'node', 'worker_bin_path': '/home/cryosparc/software/cryosparc/cryosparc2_worker/bin/cryosparcw'}, {'cache_path': '/media/cryosparcuser/2021-cryosparc/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25637355520, 'name': 'Quadro P6000'}, {'id': 1, 'mem': 25637355520, 'name': 'Quadro P6000'}], 'hostname': 'ahuramazda', 'lane': 'ahuramiata', 'monitor_port': None, 'name': 'ahuramazda', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'cryosparcuser@ahuramazda', 'title': 'Worker node ahuramazda', 'type': 'node', 'worker_bin_path': '/home/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw'}]

For what it’s worth, since the original post I went ahead and removed the lane pointed to [remote_workstation_that’s_not_reconnected_yet] and added the worker from the master workstation as a lane in the same way one would add a remote worker. That has allowed me to run on the new lane, with what looks to be the correct initiation sequence:

[2023-04-14 14:40:30.38] License is valid.
[2023-04-14 14:40:30.38] Launching job on lane [lane_name] target [workstation_name] ...
[2023-04-14 14:40:30.42] Running job on master node hostname [workstation_name]

EDIT: I realize it would be more clear if I specified, the workstation is ‘ahuramazda.’ For whatever reason, the hostname is now just ahuramazda, whereas before (the restore of the database and the rebuild of the machine) it was ahuramazda.ucsf.edu. The old lane that has persisted is called ‘ahuramazda’ and I can not run on that. The new lane created above is ‘ahuramiata.’

wtempel · April 18, 2023, 10:06pm

Do you consider your question resolved? If not, please can you restate it to account for any modifications that you have made to your CryoSPARC instance.

sergei.pourmal · April 19, 2023, 6:13am

I would consider it to be resolved, yes, thank you for your timely and continued support of this wonderful tool.