Jobs not starting on worker node(s)

Hello,

I have set up the cryoSPARC master on a single node and installed cryoSPARC workers on two other nodes that contain GPUs and SSD for caching. However, no jobs are actually running on the worker nodes I see a 2D classification job is scheduled to run on one of the worker nodes but it never starts. I can see the worker nodes in the resource manager on the master node correctly but as far as I can tell the workers have no logging to speak of so I would like to know how I should being troubleshooting the problem.

Also the cryosparcw script needs some serious TLC because the script has no help output so I have to open the script to read through and see if there is something useful.

Turns out this was an issue of where the logfile was trying to be created. The user running the webapp has its home directory in /var/lib which is not a shared file system (Since it is a service account). When the user created the project or job the destination for at least the log file was set to /var/lib/username/P1/J9/job.log and the directory structure did not exist on the node. The job threw an error but no messages were logged by the master. I think this sort of error probably should be logged to aid in troubleshooting.

As a side note. Is there a way to set the default directory the web app starts the file browser in to something other than $HOME?

Hi @clil16,

Is there a way to set the default directory the web app starts the file browser in to something other than $HOME?

No, it always starts at $HOME, but it saves the last viewed folder for each user.

What is the solution to this problem? I’m running into the same issue. When I clone a job that someone else has created (and run fine on the master node) and submit it to a worker node, I get a “<user home/projectname>/job.log: No such file or directory” line in the command_core log. The job just hangs and does nothing.

Looking in that folder, it clearly creates other files. There is an events.bson, gridfs_data, and job.json file. But there is no job.log file.

I’ve also moved the data to a shared folder on the master node. I still get the same error when submitting a job to a worker node.

Please log on to the worker node as the Linux user that “owns” and runs the cryoSPARC instance and:

  1. cd into the job’s directory
  2. post the output of
    hostname -f && pwd && ls -l
  3. run
    touch empty_testfile

Then, log on to the master node and:

  1. cd into the job’s directory
  2. post the output of
    hostname -f && pwd && ls -l
1 Like

In my case, I solved the problem by having the users create projects in the shared NFS file system that all the nodes mount.

1 Like

@clil16 Thank you! that is exactly what I’m planning to do for our setup soon. Have a central folder on a NAS that can be accessed by all the nodes.

@wtempel

On the worker node, it appears there is no job directory. Perhaps our setup is not done properly? Here’s our current process:

  1. We’ve moved the data to a shared folder in the root directory with 777 permissions.
  2. We create the job on the cryosparc webpage running on the master node and it creates a proper work directory for the job. We can run an import job fine on the master node and it runs.
  3. If we create a new job (patch motion correction) using the output of the previous import and run it on a specific GPU on the master node, it works fine.
  4. If we clone the same job from #3 and assign it to a specific GPU on a worker node, the job just hangs. Would a folder for this job be created on the worker node? Where does it create the job folder? Is the problem that its looking for a job folder in a shared directory?

Thank you all for the help on this. Much appreciated!

The job directory is created before the job is assigned to a specific worker node. That directory is assumed to already exist on the worker node (at a path shared between master and worker nodes) when job execution begins. This assumption is trivially true when master and worker are the same host, but require appropriately configured file sharing when master and worker host for a given job.
The symptoms you describe suggest that some cryoSPARC prerequiste, such as

[…] all nodes (including the master) be able to access the same shared file system(s) at the same absolute path.

is not met.

A side note:

Please ensure your setup is secure. CryoSPARC project directories need not be world-writeable (suggestions).