Jobs stuck in ‘launched’ mode after installation of v4.4.1

Hi,

I am attempting a fresh install of v4.4.1 on a new server connected to a separate worker node (both running the latest Ubuntu). I have followed the installation instructions listed to install master node only then the worker node, and ensured the prerequisites are followed. I then followed the connection instructions to connect up a worker node. The worker node’s four GPUs are recognised when submitting a job, but the job remains stuck in ‘launched’ mode and nvidia-smi shows nothing running on any of the GPUs. The job log says something along the lines of ‘cannot find config.sh - are you sure cryosparc is installed?’ (I can get the exact wording of the job log if needed).

My thoughts are that something has gone wrong with the connection (as in it thinks it’s trying to submit a job locally) or it’s a permissions thing (the ssh seems okay as I can log into the worker node from the master node without a password by setting up ssh-key). I’d be grateful if anyone has any tips or has had similar experiences who can advise on this - thanks very much!

1 Like

Is the project directory shared with the worker node under the same path as on the master node?

Hi Wolfram - thanks for your reply and good point - I tried setting this up with our shared file system instead via sshfs Mount - however, cryosparc makes an empty project directory with the correct name, but does not recognise/write to the directory and this error message pops up:

Unable to create project: ServerError: Error: new project directory not writable /home/cryosparc4/Mount2/Work/cryosparc4/tutorial/CS-tutorial

I made the container directory wide open but it seems like a permissions issue going on somewhere. Do you know what may be causing this?

Unfortunately, I am unfamiliar with sharing CryoSPARC project directories via sshfs. In generic terms, I can see quite a few things going wrong even with “wide open” permissions (which I would discourage):

  • the directory might not be “served” in write-enabled mode
  • the directory might not be mounted on write-enabled mode on the storage client
  • write access on the storage client might additionally be restricted to specific Linux accounts
1 Like

Hi Wolfram - managed to sort this problem out by changing the sshfs options and now can make project directories (i.e. -o allow_other,follow_symlinks but NOT with default_permissions).

In addition, I can now submit to the worker node properly (and jobs actually run!) and the issue was with the setup configuration where, if you are accessing as username@server, it is necessary to supply --sshstr in the correct direction - I removed the original connection via cryosparcm cli “remove_scheduler_lane(‘server’)” and redid it with the added option as above.

1 Like