Exit status 255 error/

Bruk · September 11, 2018, 1:50am

Hi,
I keep getting the following error when submitting 2D classification jobs:

Command ‘[‘ssh’, u’dgl@dgl’, ‘nohup’, u’/home/dgl/cryosparc2_worker/bin/cryosparcw run --project P1 --job J7 --master_hostname dgl-Precision-7920-Tower --master_command_core_port 39002 > /home/dgl/cryosparc2_projects/PhoPQ K19R SMA/P1/J7/job.log 2>&1 & ‘]’ returned non-zero exit status 255

and the job overview states:

Launching job on lane default target dgl …
License is valid.
Running job on remote worker node hostname dgl
Failed to launch! 255
ssh: Could not resolve hostname dgl: Name or service not known

Here are the configurations:

CRYOSPARC CONNECT --------------------------------------------

Attempting to register worker dgl@localhost to command dgl@localhost:39002
Connecting as unix user dgl
Will register using ssh string: dgl@dgl-Precision-7920-Tower
If this is incorrect, you should re-run this command with the flag --sshstr

Connected to master.

Current connected workers:
dgl
dgl@dgl-Precision-7920-Tower

Autodetecting available GPUs…
Detected 1 CUDA devices.

id pci-bus name

   0      0000:73:00.0  Quadro P4000

All devices will be enabled now.
This can be changed later using --update

Worker will be registered without SSD.

Autodetecting the amount of RAM available…
This machine has 64.02GB RAM .

Registering worker…
Done.

You can now launch jobs on the master node and they will be scheduled
on to this worker node if resource requirements are met.

Final configuration for dgl@localhost
lane : default
name : dgl@localhost
title : Worker node dgl@localhost
resource_slots : {u’GPU’: [0], u’RAM’: [0, 1, 2, 3, 4, 5, 6, 7], u’CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}
hostname : dgl@localhost
worker_bin_path : /home/dgl/cryosparc2_worker/bin/cryosparcw
cache_path : None
cache_quota_mb : None
resource_fixed : {u’SSD’: False}
cache_reserve_mb : 10000
type : node
ssh_str : dgl@dgl-Precision-7920-Tower
desc : None

apunjani · September 11, 2018, 1:13pm

Hi @Bruk,

It looks like you have multiple workers registered in the default lane:

Current connected workers:
dgl
dgl@dgl-Precision-7920-Tower

and although you are connecting your dgl-Precision-7920-Tower correctly, the other workers are misconfigured and when you try to launch a job the scheduler is trying to run it on one of the other registered workers and failing.
Try to create a new lane and assign just the correct worker to that lane, and then queue a job to that lane:
cryosparcw connect --master <master_hostname> --worker <worker_hostname> --update --newlane --lane "dgl_lane"
After this in the UI you’ll see a second lane other than default on which you can queue jobs.

Bruk · September 11, 2018, 6:53pm

I don’t have a cryosparcw command available; just cryosparcm. What are the commands i can use to remove workers and create a new lane with cryosparcm?

stephan · September 12, 2018, 2:29pm

Hi @Bruk,

The cryosparcw command is available on the node that hosts the cryosparc2_worker files.

cryosparc2_worker/bin/cryosparcw

Bruk · September 12, 2018, 5:06pm

okay, the following command

cd /home/dgl/cryosparc2_worker
bin/cryosparcw connect --worker localhost --master localhost --port 39000 --ssdpath /scratch/cryosparc_cache

gives the following output:

CRYOSPARC CONNECT --------------------------------------------

Attempting to register worker localhost to command localhost:39002
Connecting as unix user dgl
Will register using ssh string: dgl@localhost
If this is incorrect, you should re-run this command with the flag --sshstr

Connected to master.

Current connected workers:
dgl
dgl@dgl-Precision-7920-Tower
dgl@localhost
dgl@169.230.158.117

Autodetecting available GPUs…
Detected 1 CUDA devices.

id pci-bus name

   0      0000:73:00.0  Quadro P4000

All devices will be enabled now.
This can be changed later using --update

Traceback (most recent call last):
File “bin/connect.py”, line 197, in
cache_path = check_ssd_path()
File “bin/connect.py”, line 88, in check_ssd_path
assert os.path.isdir(cache_path_expand), “Path %s does not exist.” % args.ssdpath
AssertionError: Path /scratch/cryosparc_cache does not exist.

I see that there are multiple connected workers I would like to disconnect. I also see that there is a cache error even though /home/dgl/scratch/cryosparc_cache exists. Can you let me know how to disconnect the workers and fix the cache issue?

trying cryosparcw connect --master localhost --worker localhost --update --newlane --lane “dgl_lane”
returns the “cryosparcw: command not found” error

Bruk · September 17, 2018, 2:39am

never mind, uninstalled and reinstalled everything again.

Exit status 255 error/

Here are the configurations:

CRYOSPARC CONNECT --------------------------------------------

Attempting to register worker dgl@localhost to command dgl@localhost:39002 Connecting as unix user dgl Will register using ssh string: dgl@dgl-Precision-7920-Tower If this is incorrect, you should re-run this command with the flag --sshstr

Connected to master.

Current connected workers: dgl dgl@dgl-Precision-7920-Tower

id pci-bus name

All devices will be enabled now. This can be changed later using --update

Worker will be registered without SSD.

Autodetecting the amount of RAM available… This machine has 64.02GB RAM .

You can now launch jobs on the master node and they will be scheduled on to this worker node if resource requirements are met.

CRYOSPARC CONNECT --------------------------------------------

Attempting to register worker localhost to command localhost:39002 Connecting as unix user dgl Will register using ssh string: dgl@localhost If this is incorrect, you should re-run this command with the flag --sshstr

Connected to master.

Current connected workers: dgl dgl@dgl-Precision-7920-Tower dgl@localhost dgl@169.230.158.117

id pci-bus name

All devices will be enabled now. This can be changed later using --update

Attempting to register worker dgl@localhost to command dgl@localhost:39002
Connecting as unix user dgl
Will register using ssh string: dgl@dgl-Precision-7920-Tower
If this is incorrect, you should re-run this command with the flag --sshstr

Current connected workers:
dgl
dgl@dgl-Precision-7920-Tower

All devices will be enabled now.
This can be changed later using --update

Autodetecting the amount of RAM available…
This machine has 64.02GB RAM .

You can now launch jobs on the master node and they will be scheduled
on to this worker node if resource requirements are met.

Attempting to register worker localhost to command localhost:39002
Connecting as unix user dgl
Will register using ssh string: dgl@localhost
If this is incorrect, you should re-run this command with the flag --sshstr

Current connected workers:
dgl
dgl@dgl-Precision-7920-Tower
dgl@localhost
dgl@169.230.158.117

All devices will be enabled now.
This can be changed later using --update