Failed to launch! 255 - upon change cryosparc_master_hostname and then update

wtempel · November 6, 2024, 9:59pm

That depends on your circumstances, such as who has access to the network. In general, you want to allow desired connections and block all other connection attempts (guide).

You could append the $CRYOSPARC_MASTER_HOSTNAME to the end of that line

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 name.mc.institution.edu

This error might indicate that the command_core service was not running. For the cryosparcw connect command to work, you additionally have to ensure that CryoSPARC master services have been started, and not failed since startup. To check, you can run the command

cryosparcm status | grep -v LICENSE

cryofun · November 6, 2024, 11:06pm

Thanks @wtempel - this did get me to a new stage. I added the $CRYOSPARC_MASTER_HOSTNAME as you suggested and now when I run those curl commands with cryosparc running I get hello world prompts for both. I didn’t try with cryosparc running before.

I was able to get cryosparc to see the GPU’s but now I get an error when i try to kill jobs as they get stuck in launch and kill gives
Unable to kill P# J#: ServerError: ‘run_on_master_direct’

To get to this point, I reran the command to establish the worker, which I suspect was wrong.

./bin/cryosparcw connect --master name.mc.institution.edu --worker cname.mc.institution.edu --ssdpath /mnt/SCRATCH/cryosparc_cache/

Attempting to register worker name.mc.institution.edu to command name.mc.institution.edu:39002
Connecting as unix user user
Will register using ssh string: user@name.mc.institution.edu
If this is incorrect, you should re-run this command with the flag --sshstr
Autodetecting available GPUs…
Detected 2 CUDA devices.

id pci-bus name
0 0000:01:00.0 NVIDIA GeForce RTX 3080 Ti
1 0000:21:00.0 NVIDIA GeForce RTX 3080 Ti

All devices will be enabled now.
This can be changed later using --update
Worker will be registered with SSD cache location /mnt/scratch
Autodetecting the amount of RAM available…
This machine has xGB RAM .
Registering worker…
Done.

You can now launch jobs on the master node and they will be scheduled
on to this worker node if resource requirements are met.
Final configuration for name.mc.institution.edu
cache_path : /mnt/scratch
cache_quota_mb : None
cache_reserve_mb : 10000
desc : None
gpus : [{‘id’: 0, ‘mem’: 12631212032, ‘name’: ‘NVIDIA GeForce RTX 3080 Ti’}, {‘id’: 1, ‘mem’: 12639338496, ‘name’: ‘NVIDIA GeForce RTX 3080 Ti’}]
hostname : name.mc.institution.edu
lane : default
monitor_port : None
name : name.mc.institution.edu
resource_fixed : {‘SSD’: True}
resource_slots : {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}
ssh_str : user@name.mc.institution.edu
title : Worker node name.mc.institution.edu
type : node
worker_bin_path : /home/user/software/cryosparc/cryosparc_worker/bin/cryosparcw

wtempel · November 7, 2024, 3:05pm

Please can you confirm that the "name": value in the output of the command

cryosparcm cli "get_scheduler_targets()"

matches $CRYOSPARC_MASTER_HOSTNAME exactly (it did not in the cryosparcw connect command you posted, but that may have been a typo).

cryofun · November 7, 2024, 3:19pm

Thanks, @wtempel
Yes, sorry that was a typo and the master, worker, and $CRYOSPARC_MASTER_HOSTNAME all match exactly in that command, except worker having a prefix of 'Worker node

‘name’: ‘name.institution.edu’
‘title’: ‘Worker node name.institution.edu’
cryosparc_master/config.sh : export CRYOSPARC_MASTER_HOSTNAME=“name.institution.edu”

wtempel · November 7, 2024, 3:59pm

@cryofun What is the output now of the command

cryosparcm cli "show_scheduler_targets()"

?

cryofun · November 7, 2024, 4:09pm

Should this be “get_scheduler_targets”?

cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/mnt/SCRATCH/’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 12631212032, ‘name’: ‘NVIDIA GeForce RTX 3080 Ti’}, {‘id’: 1, ‘mem’: 12639338496, ‘name’: ‘NVIDIA GeForce RTX 3080 Ti’}], ‘hostname’: ‘name.institution.edu’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘name.institution.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘user@name.institution.edu’, ‘title’: ‘Worker node name.institution.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/user/software/cryosparc/cryosparc_worker/bin/cryosparcw’}]

cryosparcm cli “show_scheduler_targets()”
Traceback (most recent call last):
File “/home/user/software/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/user/software/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/user/software/cryosparc/cryosparc_master/cryosparc_compute/client.py”, line 57, in
print(eval(“cli.”+command))
File “”, line 1, in
AttributeError: ‘CommandClient’ object has no attribute ‘show_scheduler_targets’

cryofun · November 13, 2024, 3:41pm

Hi @wtempel sorry to bug you, but any thoughts on the output of that command? Should I look elsewhere for a worker/master hostname mismatch? Thanks

wtempel · November 13, 2024, 10:33pm

The issue is hard to troubleshoot with certain relevant information obfuscated for privacy. Please can you send us an email with the following (non-redacted) information:

output of the command

cryosparcm cli "get_scheduler_targets()"

the tgz file created by the command
cryosparcm snaplogs
the job log file corresponding to

cryofun:

Unable to kill P# J#: ServerError: ‘run_on_master_direct’

if it exists. The path to the file can be obtained from the command (with actual project and job IDs instead of P99 and J199, respectively)
```
cryosparcm cli "get_job_log_path_abs('P99', 'J199')"
```