Jobs not starting on worker node(s)

clil16 · September 4, 2019, 10:08pm

Hello,

I have set up the cryoSPARC master on a single node and installed cryoSPARC workers on two other nodes that contain GPUs and SSD for caching. However, no jobs are actually running on the worker nodes I see a 2D classification job is scheduled to run on one of the worker nodes but it never starts. I can see the worker nodes in the resource manager on the master node correctly but as far as I can tell the workers have no logging to speak of so I would like to know how I should being troubleshooting the problem.

Also the cryosparcw script needs some serious TLC because the script has no help output so I have to open the script to read through and see if there is something useful.

clil16 · September 5, 2019, 9:06pm

Turns out this was an issue of where the logfile was trying to be created. The user running the webapp has its home directory in /var/lib which is not a shared file system (Since it is a service account). When the user created the project or job the destination for at least the log file was set to /var/lib/username/P1/J9/job.log and the directory structure did not exist on the node. The job threw an error but no messages were logged by the master. I think this sort of error probably should be logged to aid in troubleshooting.

As a side note. Is there a way to set the default directory the web app starts the file browser in to something other than $HOME?

stephan · September 23, 2019, 3:30pm

Hi @clil16,

Is there a way to set the default directory the web app starts the file browser in to something other than $HOME?

No, it always starts at $HOME, but it saves the last viewed folder for each user.

McSparkFace · September 1, 2022, 6:27pm

What is the solution to this problem? I’m running into the same issue. When I clone a job that someone else has created (and run fine on the master node) and submit it to a worker node, I get a “<user home/projectname>/job.log: No such file or directory” line in the command_core log. The job just hangs and does nothing.

Looking in that folder, it clearly creates other files. There is an events.bson, gridfs_data, and job.json file. But there is no job.log file.

McSparkFace · September 2, 2022, 6:44pm

I’ve also moved the data to a shared folder on the master node. I still get the same error when submitting a job to a worker node.

wtempel · September 6, 2022, 8:59pm

Please log on to the worker node as the Linux user that “owns” and runs the cryoSPARC instance and:

cd into the job’s directory
post the output of
hostname -f && pwd && ls -l
run
touch empty_testfile

Then, log on to the master node and:

cd into the job’s directory
post the output of
hostname -f && pwd && ls -l

clil16 · September 8, 2022, 4:16pm

In my case, I solved the problem by having the users create projects in the shared NFS file system that all the nodes mount.

McSparkFace · September 8, 2022, 9:02pm

@clil16 Thank you! that is exactly what I’m planning to do for our setup soon. Have a central folder on a NAS that can be accessed by all the nodes.

@wtempel

On the worker node, it appears there is no job directory. Perhaps our setup is not done properly? Here’s our current process:

We’ve moved the data to a shared folder in the root directory with 777 permissions.
We create the job on the cryosparc webpage running on the master node and it creates a proper work directory for the job. We can run an import job fine on the master node and it runs.
If we create a new job (patch motion correction) using the output of the previous import and run it on a specific GPU on the master node, it works fine.
If we clone the same job from #3 and assign it to a specific GPU on a worker node, the job just hangs. Would a folder for this job be created on the worker node? Where does it create the job folder? Is the problem that its looking for a job folder in a shared directory?

Thank you all for the help on this. Much appreciated!

wtempel · September 8, 2022, 9:59pm

The job directory is created before the job is assigned to a specific worker node. That directory is assumed to already exist on the worker node (at a path shared between master and worker nodes) when job execution begins. This assumption is trivially true when master and worker are the same host, but require appropriately configured file sharing when master and worker host for a given job.
The symptoms you describe suggest that some cryoSPARC prerequiste, such as

[…] all nodes (including the master) be able to access the same shared file system(s) at the same absolute path.

is not met.

A side note:

Please ensure your setup is secure. CryoSPARC project directories need not be world-writeable (suggestions).