Jobs stuck in launch state

abasle · May 15, 2024, 8:26pm

Hello,

Like many others I have jobs stuck in launch state. I read a few posts but did not find anything that would help me.
We have one master with GPUs and 5 workers with GPUs.
Some projects on some workers work fine on any workers.

Some projects/jobs on one worker are stuck in launch.

Getting the command from cryosparcm log command_core and copying and running on the master works fine and we can see jobs progressing as normal.

It seems what ever triggers launch state to started running does not happened.

What can I check to see what is not working?

nfs shares seems to be correct and touch test works with cryosparc user.
cryosparc user on master can ssh to all workers without password.

Cheers,
Arnaud

wtempel · May 15, 2024, 9:44pm

Did you perform this test on all workers, using the exact project directory path that can be seen on the web app?

Did you confirm for each worker that on connection attempt from the master, the “cryosparc user” would not be prompted to confirm the identity worker host (due to a potential mismatch or missing record in ~/.ssh/known_hosts)

abasle · May 16, 2024, 6:06am

@wtempel Thanks. To be sure I deleted known_hosts and ssh from master with cryosparc user to each worker and answered yes to each connections to recreate known_hosts. It does not seem to be the answer as it made no change.

abasle · May 16, 2024, 6:17am

@wtempel I did connect to each worker as cryosparc user and “touch test” from each in the project folder.
I can confirm the cryosparc project folder is correctly mounted/shared on each worker and the cryosparc user has permissions rwx (I can touch, rm and ls).

abasle · May 16, 2024, 6:25am

Is there anyway to get detailed log from job launch to following steps?

I’m puzzled that job submitted from the web app get stuck in launch but subsequently submitted the command copied from the log to a master terminal makes the job progress as expected.

wtempel · May 16, 2024, 2:01pm

@abasle Please can you post the output of the command

cryosparcm cli "get_scheduler_targets()"

(on the CryoSPARC master).
What are the outputs of the command

cat /etc/*release

for the master and worker nodes?

abasle · May 20, 2024, 8:03am

@wtempel. Many thanks I managed to clean up my mess.

For those with similar issues:

for house keeping reasons I had created a symlink for one of the projects and that did not work properly.
Additionally some nfs shares were not working. We have 6 nodes all sharing on or two location with each other so it is easy to miss something.

Cheers,
Arnaud