Jobs not starting on worker node(s)


(Christopher Lilienthal) #1


I have set up the cryoSPARC master on a single node and installed cryoSPARC workers on two other nodes that contain GPUs and SSD for caching. However, no jobs are actually running on the worker nodes I see a 2D classification job is scheduled to run on one of the worker nodes but it never starts. I can see the worker nodes in the resource manager on the master node correctly but as far as I can tell the workers have no logging to speak of so I would like to know how I should being troubleshooting the problem.

Also the cryosparcw script needs some serious TLC because the script has no help output so I have to open the script to read through and see if there is something useful.

(Christopher Lilienthal) #2

Turns out this was an issue of where the logfile was trying to be created. The user running the webapp has its home directory in /var/lib which is not a shared file system (Since it is a service account). When the user created the project or job the destination for at least the log file was set to /var/lib/username/P1/J9/job.log and the directory structure did not exist on the node. The job threw an error but no messages were logged by the master. I think this sort of error probably should be logged to aid in troubleshooting.

As a side note. Is there a way to set the default directory the web app starts the file browser in to something other than $HOME?