Failed to launch! 255 - upon change cryosparc_master_hostname and then update

I have not tested such a downgrade, but I would expect that, if the CryoSPARC instance was never at a version below 4.4, you will need

  1. an independent installation of the CUDA toolkit version 11.x
  2. a definition inside
    export CRYOSPARC_CUDA_PATH=/your/path/to/cuda
    
    such that
    /your/path/to/cuda such that
    /your/path/to/cuda/bin/nvcc
    exists.

It was previously at v4.2.1 with Nvidia 470.74 and running smoothly.

I expect I could downgrade and then update the hostnames as outlined above and hope it gets back to running.

Hi @wtempel
I have performed the downgrade to v4.2.1 and maintain CUDA version 470.74, but now I got an error about the worker connection. Cryosparc does not see my GPUs.

My master hostname has changed and I fixed the master config.sh file.
How do I know the new worker hostname?
I ran cryosparcw connect --worker new_hostname --master new_hostname (with the same name for each, the same name in the config.sh file), but I am not sure this is right.

Unexpectedly, despite cryosparcm status showing v4.2.1, I get the error “AssertionError: Nvidia driver version 470.74 is out of date and will not work with this version of CryoSPARC. Please install version 520.61.05 or newer.”

I was previously running fine on v4.2.1 with this nvidia driver. Sorry to resurrect an old issue, but any advice will be appreciated.

@cryofun
What is the output of the command

cat /home/user/software/cryosparc/cryosparc_worker/version

?

Matching hostnames would be correct for a single-computer “standalone” CryoSPARC instance.
Also, what is the path to your independently installed CUDA toolkit?

That command gives 4.4.1

/usr/local/cuda-11.2

This is a single-computer standalone instance.

Thanks for the help.

You may want to

  1. ensure that the command
    grep "^export CRYOSPARC_CUDA_PATH" /home/user/software/cryosparc/cryosparc_worker/config.sh | tail -n 1
    
    outputs
    export CRYOSPARC_CUDA_PATH=/usr/local/cuda-11.2
    
  2. follow manual worker update instructions.

Does this help?

Thanks,
After manual worker update, it is now at 4.2.1 according to the grep command in your last post.
cryosparcm status and the web interface all show v4.2.1
but I can’t see GPU’s in the web interface “instance” and cannot queue jobs (queue is greyed out, not clickable).

nvidia-smi shows my 2 GPU’s and CUDA Version: 11.4, which I thought should be 11.2 according to the contents of /usr/local (which doesn’t even contain 11.4). But this definitely hasn’t changed and is probably unrelated to the issue. It’s how it was setup by our computer provider (Exxact system)

Could it be that the worker and master are not communicating and worker needs to be added?
I thought I would try to re-establish the gpu’s with
./bin/cryosparcw connect --master name --worker name --gpus 0,1

and it gave

CRYOSPARC CONNECT
Attempting to register worker name to command name:39002
Connecting as unix user cryosparc_user
Will register using ssh string: cryosparc_user@name
If this is incorrect, you should re-run this command with the flag --sshstr
*** CommandClient: (http://name:39002/api) URL Error [Errno 111] Connection refused
Traceback (most recent call last):

Since this is a standalone computer, it shouldn’t need to connect to itself with a port like this, right? How can I force master & worker to connect internally without a port, assuming this is the issue.

nvidia-smi will display the CUDA version corresponding to the version of the driver, which may differ from, but must be compatible with, the version of the CUDA toolkit. Your installed toolkit v11.2 may be compatible with the nvidia driver, even if the CUDA version displayed for the driver is different.

Access to the port is needed, but a simplified configuration may be used where no additional worker nodes may be added.
What are the outputs of these commands (in a fresh shell)

eval $(cryosparcm env)
host $CRYOSPARC_MASTER_HOSTNAME
curl ${CRYOSPARC_MASTER_HOSTNAME}:39002
curl 127.0.0.1:39002
exit

Thanks for the explanation regarding CUDA versions.

The commands give connection refused at those ports
host $CRYOSPARC_MASTER_HOSTNAME
name has address 156.111.#.#

curl ${CRYOSPARC_MASTER_HOSTNAME}:39002
curl: (7) Failed connect to name:39002; Connection refused

curl 127.0.0.1:39002
curl: (7) Failed connect to 127.0.0.1:39002; Connection refused

So can I just open it with the following command (centos7)? Do I need to worry about leaving this open?

firewall-cmd --zone=public --permanent --add-port=39000-39009/tcp
firewall-cmd --reload

I found Dan’s old post about the /etc/hosts file and note mine does not contain the new “hostname” after 127.0.0.1. Should I add this?

It shows:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

from:

Hi @wtempel I tried opening the ports and still get the same output (connection refused) from both curl commands. I didn’t use --add-port=39000-39009/tcp but rather one command for each port, since the output of firewall-cmd --list-all results were in nomenclature dependent on the original command - not sure if this is right or matters in the end.
Followed based on this - Installing troubles - please advise - #2 by UNCuser

Based on other posts, it seems like centos7 hostname is often an issue.
(example post - Cryosparc2_worker installation problem on Centos7 - #5 by vamsee)

hostname -f gives
name.mc.institution.edu

hostname gives
name

The cryosparc_master/config.sh has CRYOSPARC_MASTER_HOSTNAME=“name.mc.institution.edu”

That depends on your circumstances, such as who has access to the network. In general, you want to allow desired connections and block all other connection attempts (guide).

You could append the $CRYOSPARC_MASTER_HOSTNAME to the end of that line

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 name.mc.institution.edu

This error might indicate that the command_core service was not running. For the cryosparcw connect command to work, you additionally have to ensure that CryoSPARC master services have been started, and not failed since startup. To check, you can run the command

cryosparcm status | grep -v LICENSE

Thanks @wtempel - this did get me to a new stage. I added the $CRYOSPARC_MASTER_HOSTNAME as you suggested and now when I run those curl commands with cryosparc running I get hello world prompts for both. I didn’t try with cryosparc running before.

I was able to get cryosparc to see the GPU’s but now I get an error when i try to kill jobs as they get stuck in launch and kill gives
Unable to kill P# J#: ServerError: ‘run_on_master_direct’

To get to this point, I reran the command to establish the worker, which I suspect was wrong.

./bin/cryosparcw connect --master name.mc.institution.edu --worker cname.mc.institution.edu --ssdpath /mnt/SCRATCH/cryosparc_cache/

Attempting to register worker name.mc.institution.edu to command name.mc.institution.edu:39002
Connecting as unix user user
Will register using ssh string: user@name.mc.institution.edu
If this is incorrect, you should re-run this command with the flag --sshstr
Autodetecting available GPUs…
Detected 2 CUDA devices.

id pci-bus name
0 0000:01:00.0 NVIDIA GeForce RTX 3080 Ti
1 0000:21:00.0 NVIDIA GeForce RTX 3080 Ti

All devices will be enabled now.
This can be changed later using --update
Worker will be registered with SSD cache location /mnt/scratch
Autodetecting the amount of RAM available…
This machine has xGB RAM .
Registering worker…
Done.

You can now launch jobs on the master node and they will be scheduled
on to this worker node if resource requirements are met.
Final configuration for name.mc.institution.edu
cache_path : /mnt/scratch
cache_quota_mb : None
cache_reserve_mb : 10000
desc : None
gpus : [{‘id’: 0, ‘mem’: 12631212032, ‘name’: ‘NVIDIA GeForce RTX 3080 Ti’}, {‘id’: 1, ‘mem’: 12639338496, ‘name’: ‘NVIDIA GeForce RTX 3080 Ti’}]
hostname : name.mc.institution.edu
lane : default
monitor_port : None
name : name.mc.institution.edu
resource_fixed : {‘SSD’: True}
resource_slots : {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}
ssh_str : user@name.mc.institution.edu
title : Worker node name.mc.institution.edu
type : node
worker_bin_path : /home/user/software/cryosparc/cryosparc_worker/bin/cryosparcw

Please can you confirm that the "name": value in the output of the command

cryosparcm cli "get_scheduler_targets()"

matches $CRYOSPARC_MASTER_HOSTNAME exactly (it did not in the cryosparcw connect command you posted, but that may have been a typo).

Thanks, @wtempel
Yes, sorry that was a typo and the master, worker, and $CRYOSPARC_MASTER_HOSTNAME all match exactly in that command, except worker having a prefix of 'Worker node

‘name’: ‘name.institution.edu’
‘title’: ‘Worker node name.institution.edu’
cryosparc_master/config.sh : export CRYOSPARC_MASTER_HOSTNAME=“name.institution.edu”

@cryofun What is the output now of the command

cryosparcm cli "show_scheduler_targets()"

?

1 Like

Should this be “get_scheduler_targets”?

cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/mnt/SCRATCH/’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 12631212032, ‘name’: ‘NVIDIA GeForce RTX 3080 Ti’}, {‘id’: 1, ‘mem’: 12639338496, ‘name’: ‘NVIDIA GeForce RTX 3080 Ti’}], ‘hostname’: ‘name.institution.edu’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘name.institution.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘user@name.institution.edu’, ‘title’: ‘Worker node name.institution.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/user/software/cryosparc/cryosparc_worker/bin/cryosparcw’}]

cryosparcm cli “show_scheduler_targets()”
Traceback (most recent call last):
File “/home/user/software/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/user/software/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/user/software/cryosparc/cryosparc_master/cryosparc_compute/client.py”, line 57, in
print(eval(“cli.”+command))
File “”, line 1, in
AttributeError: ‘CommandClient’ object has no attribute ‘show_scheduler_targets’

Hi @wtempel sorry to bug you, but any thoughts on the output of that command? Should I look elsewhere for a worker/master hostname mismatch? Thanks

The issue is hard to troubleshoot with certain relevant information obfuscated for privacy. Please can you send us an email with the following (non-redacted) information:

  1. output of the command
    cryosparcm cli "get_scheduler_targets()"
    
  2. the tgz file created by the command
    cryosparcm snaplogs
  3. the job log file corresponding to if it exists. The path to the file can be obtained from the command (with actual project and job IDs instead of P99 and J199, respectively)
    cryosparcm cli "get_job_log_path_abs('P99', 'J199')"