Hi,
I keep getting the following error when submitting 2D classification jobs:
Command ‘[‘ssh’, u’dgl@dgl’, ‘nohup’, u’/home/dgl/cryosparc2_worker/bin/cryosparcw run --project P1 --job J7 --master_hostname dgl-Precision-7920-Tower --master_command_core_port 39002 > /home/dgl/cryosparc2_projects/PhoPQ K19R SMA/P1/J7/job.log 2>&1 & ‘]’ returned non-zero exit status 255
and the job overview states:
Launching job on lane default target dgl …
License is valid.
Running job on remote worker node hostname dgl
Failed to launch! 255
ssh: Could not resolve hostname dgl: Name or service not known
Attempting to register worker dgl@localhost to command dgl@localhost:39002
Connecting as unix user dgl
Will register using ssh string: dgl@dgl-Precision-7920-Tower
If this is incorrect, you should re-run this command with the flag --sshstr
Connected to master.
Current connected workers:
dgl
dgl@dgl-Precision-7920-Tower
Autodetecting available GPUs…
Detected 1 CUDA devices.
id pci-bus name
0 0000:73:00.0 Quadro P4000
All devices will be enabled now.
This can be changed later using --update
Worker will be registered without SSD.
Autodetecting the amount of RAM available…
This machine has 64.02GB RAM .
Registering worker…
Done.
You can now launch jobs on the master node and they will be scheduled
on to this worker node if resource requirements are met.
It looks like you have multiple workers registered in the default lane:
Current connected workers:
dgl
dgl@dgl-Precision-7920-Tower
and although you are connecting your dgl-Precision-7920-Tower correctly, the other workers are misconfigured and when you try to launch a job the scheduler is trying to run it on one of the other registered workers and failing.
Try to create a new lane and assign just the correct worker to that lane, and then queue a job to that lane: cryosparcw connect --master <master_hostname> --worker <worker_hostname> --update --newlane --lane "dgl_lane"
After this in the UI you’ll see a second lane other than default on which you can queue jobs.
Attempting to register worker localhost to command localhost:39002
Connecting as unix user dgl
Will register using ssh string: dgl@localhost
If this is incorrect, you should re-run this command with the flag --sshstr
Connected to master.
Current connected workers:
dgl
dgl@dgl-Precision-7920-Tower
dgl@localhost dgl@169.230.158.117
Autodetecting available GPUs…
Detected 1 CUDA devices.
id pci-bus name
0 0000:73:00.0 Quadro P4000
All devices will be enabled now.
This can be changed later using --update
Traceback (most recent call last):
File “bin/connect.py”, line 197, in
cache_path = check_ssd_path()
File “bin/connect.py”, line 88, in check_ssd_path
assert os.path.isdir(cache_path_expand), “Path %s does not exist.” % args.ssdpath
AssertionError: Path /scratch/cryosparc_cache does not exist.
I see that there are multiple connected workers I would like to disconnect. I also see that there is a cache error even though /home/dgl/scratch/cryosparc_cache exists. Can you let me know how to disconnect the workers and fix the cache issue?
trying cryosparcw connect --master localhost --worker localhost --update --newlane --lane “dgl_lane”
returns the “cryosparcw: command not found” error