Job launch error

martin · May 16, 2023, 12:44pm

Hi,

We’re having an odd problem launching jobs on one of our workers after a recent OS update. We have a single-master, multiple worker setup, and all machines are running CryoSPARC 4.2.1 on Ubuntu 22.04. When we try and launch a job on one specific machine, it hangs at the following stage:

License is valid.
Launching job on lane XXXXX target XXXXX.XXX.XXX
Running job on remote worker node hostname XXXXX.XXX.XXX

There is no further output. Looking at the metadata log, I see the following top-level error:

cryosparc_tools.cryosparc.command.Error: *** CommandClient: (http://XXXXXX:39002/api) URL Error [Errno -3] Temporary failure in name resolution

Running “cryosparcm log command_core” reveals nothing unusual. SSH connections from the master to worker work fine both with short and fully-specified addreses.

All other workers are configured in exactly the same way, yet jobs launch fine on them. Any help appreciated!

wtempel · May 16, 2023, 1:56pm

Please can you run these commands on the “failing” worker and on a “working” … worker

host XXXXXX
curl XXXXXX:39002

and compare their outputs between the workers?

martin · May 16, 2023, 3:21pm

Hi,
Yes - these give different results - on the failing worker I get a “host not found” error and the curl job does not return a result. With a little further digging it appears that the failing machine has picked up a different (incorrect) search domain. These are supposed to be automatically set by DNS, but for some reason it is not consistent between workers. If I manually add the full search domain, the problem is fixed.
Thanks for the help!