Cryosparc timing out when connecting to worker

We are setting up a dedicated cloud instance to use as a CryoSPARC worker. Unfortunately, CryoSPARC keeps timing out when it tries to launch a job. The ssh config is set correctly with pubkey auth, and there is no issues as far as the worker install.

The cryosparc master just refuses to connect with no logs. Is there any way to troubleshoot this? I can ssh into the cloud server from the master server, using the exact same public key as to the other workers.

Did you confirm that the cloud-based worker receives a response from the master when you run the following commands on the worker:

curl <master_hostname>:<port>

where <master_hostname> corresponds $CRYOSPARC_MASTER_HOSTNAME, <port> to $((CRYOSPARC_BASE_PORT+1)), $((CRYOSPARC_BASE_PORT+2)), $((CRYOSPARC_BASE_PORT+6))?

Yes, I receive a response for each port.

Next:

  1. Are the project directories mounted under identical paths on master and worker?
  2. Do the uids for the Linux account(s) that run the instance match between master and worker?
  3. Can the relevant Linux account on the worker write to the job directory. This could be prevented by uid mismatch, nfs export and mount options.
  1. It’s a government cloud instance, so we’re limited by the user it allows us to login as (in this case, exouser). However, the cryosparc user can ssh in with its public key, just like the other workers. The ssh config file is set up to reflect the need to ssh in as “exouser” instead of “cryosparc”.

  2. For the same reason as above, no, the uids do not match up.

  3. Yes, exouser can write to the job directories

When connecting the worker, cryosparcw says it will connect as exouser@, which is what we want, right?

Interesting. How did you ensure that, under the constraints of mismatching user ids, automatically generated job directories are writable to the other user?

I am not sure. Let’s assume for a moment that your master can connect to the cloud instance using the exoworker123.cloud hostname.
Under this assumption,

  1. does the output of
    cryosparcm cli "get_scheduler_targets()"
    
    include an element with
    "hostname": "exoworker123.cloud"
    and
    "ssh_str": "exouser@exoworker123.cloud"
  2. can user cryosparc connect from the master to the cloud worker with the command
    ssh exouser@exoworker123.cloud
    

?

There’s only a single user on the cloud instance, exouser. So the directories are being created by exouser each time.

  1. Yes, the hostname and sshstr both show up, and the cryosparc user can connect using ssh exousert@<server.name>

I must have misunderstood your description of the infrastructure. Are not job directories named J<number> created by Linux user cryosparc on the on-premise master host, and shared with the cloud worker host?

It was my understanding that CryoSPARC recreated the directories on the worker in the cryosparc_scratch directory

CryoSPARC requires shared access to the project directories (details).
CryoSPARC job directories are created on the master node, and worker nodes must be able to write to the job directories when they are running the jobs.
In addition, particle stacks may optionally be cached to fast scratch storage. Caching does not eliminate the need for worker access to the job directory.

Even after sharing the project directories, CryoSPARC still is stuck on the “launched” screen. No logs either.

  1. Do the paths to each given project directory match between on-premise master and cloud worker?
  2. How did you ensure that exouser on the cloud worker can write to each relevant job directory that was created by cryosparc? Have you tested file creation by exouser inside such a job directory?
  3. Have you inspected the command_core log for relevant messages? Relevant information could be contained even in INFO, not just WARNING or ERROR messages.