Jobs stuck in ‘launched’ mode after installation of v4.4.1

bfisher · February 11, 2024, 7:32am

Hi,

I am attempting a fresh install of v4.4.1 on a new server connected to a separate worker node (both running the latest Ubuntu). I have followed the installation instructions listed to install master node only then the worker node, and ensured the prerequisites are followed. I then followed the connection instructions to connect up a worker node. The worker node’s four GPUs are recognised when submitting a job, but the job remains stuck in ‘launched’ mode and nvidia-smi shows nothing running on any of the GPUs. The job log says something along the lines of ‘cannot find config.sh - are you sure cryosparc is installed?’ (I can get the exact wording of the job log if needed).

My thoughts are that something has gone wrong with the connection (as in it thinks it’s trying to submit a job locally) or it’s a permissions thing (the ssh seems okay as I can log into the worker node from the master node without a password by setting up ssh-key). I’d be grateful if anyone has any tips or has had similar experiences who can advise on this - thanks very much!

wtempel · February 11, 2024, 11:52pm

Is the project directory shared with the worker node under the same path as on the master node?

bfisher · February 12, 2024, 6:37pm

Hi Wolfram - thanks for your reply and good point - I tried setting this up with our shared file system instead via sshfs Mount - however, cryosparc makes an empty project directory with the correct name, but does not recognise/write to the directory and this error message pops up:

Unable to create project: ServerError: Error: new project directory not writable /home/cryosparc4/Mount2/Work/cryosparc4/tutorial/CS-tutorial

I made the container directory wide open but it seems like a permissions issue going on somewhere. Do you know what may be causing this?

wtempel · February 12, 2024, 8:17pm

Unfortunately, I am unfamiliar with sharing CryoSPARC project directories via sshfs. In generic terms, I can see quite a few things going wrong even with “wide open” permissions (which I would discourage):

the directory might not be “served” in write-enabled mode
the directory might not be mounted on write-enabled mode on the storage client
write access on the storage client might additionally be restricted to specific Linux accounts

bfisher · February 13, 2024, 6:56pm

Hi Wolfram - managed to sort this problem out by changing the sshfs options and now can make project directories (i.e. -o allow_other,follow_symlinks but NOT with default_permissions).

In addition, I can now submit to the worker node properly (and jobs actually run!) and the issue was with the setup configuration where, if you are accessing as username@server, it is necessary to supply --sshstr in the correct direction - I removed the original connection via cryosparcm cli “remove_scheduler_lane(‘server’)” and redid it with the added option as above.

daniel.sanchez · February 27, 2024, 6:35pm

Hi Wolfram,

Just to lead I’m relatively new to Cryosparc so I’m still learning the ropes of configuring the software. I’ve encountered this exact same issue (Passwordless SSH set up and confirmed working, workers connected but stuck on launched job state) and since it looks like the aforementioned sshfs method is atypical based on your response we’re waiting to try that as a last resort.

I’ve taken a look at the cryosparcm log command_core and noticed that this lack of a shared filesystem structure is more than likely what’s causing our issue:

cryosparc_user@cryosparc_master:~/software/cryosparc/cryosparc_master$ cryosparcm log command_core
2024-02-27 16:50:13,565 run_job              INFO     |         Running job using: /home/cryosparc_user/software/cryosparc/cryosparc_worker/bin/cryosparcw
2024-02-27 16:50:13,566 run_job              INFO     |         Running job on remote worker node hostname cryosparc_worker
2024-02-27 16:50:13,566 run_job              INFO     |         cmd: bash -c "nohup /home/cryosparc_user/software/cryosparc/cryosparc_worker/bin/cryosparcw run --project P1 --job J1 --master_hostname cryosparc_master --master_command_core_port 39002 > /home/cryosparc_user/jobs/CS-test/J1/job.log 2>&1 & "
2024-02-27 16:50:14,028 run_job              INFO     | bash: line 1: /home/cryosparc_user/jobs/CS-test/J1/job.log: No such file or directory

2024-02-27 16:50:14,029 scheduler_run_core   INFO     | Finished

I’d assume based on your previous response the referenced directory/file that wasn’t found should exist on the worker node? If so are we required to manually create these directories or should Cryosparc be automatically creating them as needed when running a job? And if it should be automatic, is there a way I can remedy this issue since it seems there isn’t even an attempt to create the proper directories on the worker node?

Thank you in advance for your help.

wtempel · February 27, 2024, 7:06pm

Should have been created on the CryoSPARC master host as soon as you created the job. It is assumed that /home/cryosparc_user/jobs/CS-test is hosted on storage that is shared between the CryoSPARC master and all worker hosts. Worker hosts should therefore “see” directory J1/ immediately after it has been created on the master host. It seems to me that on the worker node
cryosparcw run was unable to create or write to the job.log file, perhaps because the storage was not shared as assumed.
If /home/cryosparc_user/jobs/CS-test/ is a local directory on the master host, you may want to

configure the master host as an nfs server
export /home/cryosparc_user/jobs/CS-test/ or one of its super directories to the worker hosts as an nfs share
on the worker nodes, mount the nfs share at a suitable mount point such that the absolute path
/home/cryosparc_user/jobs/CS-test/ is preserved

Perhaps more commonly for larger storage configurations, the project directory would be hosted and served by a dedicated nfs server and mounted by master and worker hosts alike.

daniel.sanchez · February 28, 2024, 5:05pm

Thank you Wolfram, your input helped me get everything up and running properly! Seems like I entirely missed checking the NFS server, but fixing the configuration there and making sure the path was preserved instantly fixed the problem!