Failed to launch! 255 - upon change cryosparc_master_hostname and then update

cryofun · March 18, 2024, 6:02pm

I wanted to update cryosparc but cryosparcm backup gave:
ERROR: Re-run this command on the master node: oldname.
Alternatively, set CRYOSPARC_FORCE_HOSTNAME=true in cryosparc_master/config.sh to suppress this error.
If this error message is incorrect, set CRYOSPARC_HOSTNAME_CHECK to the correct hostname in cryosparc_master/config.sh.

We had moved the computer recently, so maybe it had a new name. I ran hostname -f and changed cryosparc_master_hostname to match newname ~/software/cryosparc/cryosparc_master/config.sh

Then cryosparcm backup gave a new error:
database: ERROR (spawn error)

I got this before and followed the same fix of kill all cryosparc processes from
ps -ax | grep cryosparc

Then cryosaprcm backup works.

Then update and UI seems ok, but first job gave:
License is valid.

Launching job on lane default target oldname …

Running job on remote worker node hostname oldname

Failed to launch! 255
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

cryofun · March 18, 2024, 6:17pm

Can I just change the cryosparc_master_hostname back to the original hostname?

The cryosparc job gives:
Running job on remote worker node hostname oldname

wtempel · March 18, 2024, 7:53pm

This is a good first step, as long as you ensure that the new hostname is “stable” in that

the hostname does not change from reboot to reboot
other computers, such as additional CryoSPARC workers on this CryoSPARC instance correctly resolve the new hostname

You may need help from your network admins to ensure the aforementioned conditions.

I suspect the scheduler target records in your CryoSPARC database still include a record to the old hostname. To help us propose a resolution, please let us know:

Is the new hostname “stable” (as defined above)?

What is the output of the command

cryosparcm cli "get_scheduler_targets()"

the old hostname
the new hostname as shown by the command
hostname -f

cryofun · March 18, 2024, 8:09pm

Hi @wtempel and thanks for the quick response! Here are the answers:

Is the new hostname “stable” (as defined above)?

I will have to check with sysadmin, but it has not changed since the initial check and update from earlier today. The name had previously been the same asdf and now has .mc.institution.edu added on after, (asdf.mc.institution.edu)

What is the output of the command cryosparcm cli “get_scheduler_targets()”

With cryosparc running
cryosparcm cli “get_scheduler_targets()”

[{‘cache_path’: ‘/mnt/ssd-scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 12631212032, ‘name’: ‘NVIDIA GeForce RTX 3080 Ti’}, {‘id’: 1, ‘mem’: 12639338496, ‘name’: ‘NVIDIA GeForce RTX 3080 Ti’}], ‘hostname’: ‘asdf’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘asdf’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘user@asdf’, ‘title’: ‘Worker node asdf’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/user/software/cryosparc/cryosparc_worker/bin/cryosparcw’}]

the old hostname

asdf

the new hostname as shown by the command: hostname -f

asdf.mc.institution.edu

wtempel · March 18, 2024, 8:52pm

Thanks for posting this info.
If it turns out that asdf.mc.institution.edu is stable, you can avoid the necessity for an ssh connection and also avoid CRYOSPARC_FORCE_HOSTNAME=true (which one would want to avoid in the absence of “special” circumstances) by having a three-way match between

hostname -f output
$CRYOSPARC_MASTER_HOSTNAME
the target "hostname" value

You can achieve this by:

deleting the outdated target:

cryosparcm cli "remove_scheduler_target_node('asdf')"

adding a target with the new hostname, ensuring correct master and worker hostnames and correct port number ($CRYOSPARC_BASE_PORT inside `cryosparc_master/config.sh) (guide)

/home/user/software/cryosparc/cryosparc_worker/bin/cryosparcw connect --master asdf.mc.institution.edu --worker asdf.mc.institution.edu --port 99999 --ssdpath /mnt/ssd-scratch/cryosparc_cache

cryofun · March 19, 2024, 6:12pm

Thank you! I have an issue now with cryosparcm cli “get_scheduler_targets()” returns and “AssertionError: Nvidia driver version 470.74 is out of date”

Is this really just a CUDA update issue or have I messed up something else? I cannot run non-GPU job type “remove duplicates”.

Also, why is it connecting on port 39002 instead of 39000 and is this a big deal?

I ran
cryosparcm cli “remove_scheduler_target_node(‘asdf’)”

Then check effect with
cryosparcm cli “get_scheduler_targets()”

I assume this means it worked as the previous text is cleared

Then
/home/user/software/cryosparc/cryosparc_worker/bin/cryosparcw connect --master asdf.institution.edu --worker asdf.institution.edu --port 39000 --ssdpath /mnt/ssd/cryosparc_cache

The output is below:

CRYOSPARC CONNECT --------------------------------------------

Attempting to register worker asdf.institution.edu to command asdf.institution.edu:39002
Connecting as unix user user
Will register using ssh string: user@asdf.institution.edu
If this is incorrect, you should re-run this command with the flag --sshstr

Connected to master.

Current connected workers:

Worker will be registered with 64 CPUs.
Autodetecting available GPUs…
Traceback (most recent call last):
File “bin/connect.py”, line 233, in
gpu_devidxs = check_gpus()
File “bin/connect.py”, line 97, in check_gpus
assert correct_driver_version is None, (
AssertionError: Nvidia driver version 470.74 is out of date and will not work with this version of CryoSPARC. Please install version 520.61.05 or newer.

wtempel · March 19, 2024, 6:44pm

This is expected. The software added +2 to the CRYOSPARC_BASE_PORT to identify the command_core port of your CryoSPARC installation.

Please update the nvidia driver of the asdf computer to version 520.61.05 or newer and reboot the computer after the update.
After the reboot, please record and post the outputs of these commands:

cryosparcm cli "get_scheduler_targets()"
/home/user/software/cryosparc/cryosparc_worker/bin/cryosparcw gpulist

cryofun · March 21, 2024, 2:30pm

I will downgrade to perform some work and backup before updating nvidia drivers. Are there any considerations I should take before performing
cryosparcm update --version=v4.2.1

wtempel · March 21, 2024, 4:00pm

I have not tested such a downgrade, but I would expect that, if the CryoSPARC instance was never at a version below 4.4, you will need

an independent installation of the CUDA toolkit version 11.x
a definition inside
```
export CRYOSPARC_CUDA_PATH=/your/path/to/cuda
```
such that
/your/path/to/cuda such that
/your/path/to/cuda/bin/nvcc
exists.

cryofun · March 21, 2024, 4:44pm

It was previously at v4.2.1 with Nvidia 470.74 and running smoothly.

I expect I could downgrade and then update the hostnames as outlined above and hope it gets back to running.

cryofun · November 4, 2024, 6:45pm

Hi @wtempel
I have performed the downgrade to v4.2.1 and maintain CUDA version 470.74, but now I got an error about the worker connection. Cryosparc does not see my GPUs.

My master hostname has changed and I fixed the master config.sh file.
How do I know the new worker hostname?
I ran cryosparcw connect --worker new_hostname --master new_hostname (with the same name for each, the same name in the config.sh file), but I am not sure this is right.

Unexpectedly, despite cryosparcm status showing v4.2.1, I get the error “AssertionError: Nvidia driver version 470.74 is out of date and will not work with this version of CryoSPARC. Please install version 520.61.05 or newer.”

I was previously running fine on v4.2.1 with this nvidia driver. Sorry to resurrect an old issue, but any advice will be appreciated.

wtempel · November 4, 2024, 7:56pm

@cryofun
What is the output of the command

cat /home/user/software/cryosparc/cryosparc_worker/version

?

Matching hostnames would be correct for a single-computer “standalone” CryoSPARC instance.
Also, what is the path to your independently installed CUDA toolkit?

cryofun · November 4, 2024, 8:08pm

That command gives 4.4.1

/usr/local/cuda-11.2

This is a single-computer standalone instance.

Thanks for the help.

wtempel · November 4, 2024, 9:31pm

You may want to

ensure that the command

grep "^export CRYOSPARC_CUDA_PATH" /home/user/software/cryosparc/cryosparc_worker/config.sh | tail -n 1

outputs

export CRYOSPARC_CUDA_PATH=/usr/local/cuda-11.2

follow manual worker update instructions.

Does this help?

cryofun · November 4, 2024, 10:29pm

Thanks,
After manual worker update, it is now at 4.2.1 according to the grep command in your last post.
cryosparcm status and the web interface all show v4.2.1
but I can’t see GPU’s in the web interface “instance” and cannot queue jobs (queue is greyed out, not clickable).

nvidia-smi shows my 2 GPU’s and CUDA Version: 11.4, which I thought should be 11.2 according to the contents of /usr/local (which doesn’t even contain 11.4). But this definitely hasn’t changed and is probably unrelated to the issue. It’s how it was setup by our computer provider (Exxact system)

Could it be that the worker and master are not communicating and worker needs to be added?
I thought I would try to re-establish the gpu’s with
./bin/cryosparcw connect --master name --worker name --gpus 0,1

and it gave

CRYOSPARC CONNECT
Attempting to register worker name to command name:39002
Connecting as unix user cryosparc_user
Will register using ssh string: cryosparc_user@name
If this is incorrect, you should re-run this command with the flag --sshstr
*** CommandClient: (http://name:39002/api) URL Error [Errno 111] Connection refused
Traceback (most recent call last):

Since this is a standalone computer, it shouldn’t need to connect to itself with a port like this, right? How can I force master & worker to connect internally without a port, assuming this is the issue.

wtempel · November 4, 2024, 11:05pm

nvidia-smi will display the CUDA version corresponding to the version of the driver, which may differ from, but must be compatible with, the version of the CUDA toolkit. Your installed toolkit v11.2 may be compatible with the nvidia driver, even if the CUDA version displayed for the driver is different.

Access to the port is needed, but a simplified configuration may be used where no additional worker nodes may be added.
What are the outputs of these commands (in a fresh shell)

eval $(cryosparcm env)
host $CRYOSPARC_MASTER_HOSTNAME
curl ${CRYOSPARC_MASTER_HOSTNAME}:39002
curl 127.0.0.1:39002
exit

cryofun · November 5, 2024, 3:41pm

Thanks for the explanation regarding CUDA versions.

The commands give connection refused at those ports
host $CRYOSPARC_MASTER_HOSTNAME
name has address 156.111.#.#

curl ${CRYOSPARC_MASTER_HOSTNAME}:39002
curl: (7) Failed connect to name:39002; Connection refused

curl 127.0.0.1:39002
curl: (7) Failed connect to 127.0.0.1:39002; Connection refused

cryofun · November 5, 2024, 3:53pm

So can I just open it with the following command (centos7)? Do I need to worry about leaving this open?

firewall-cmd --zone=public --permanent --add-port=39000-39009/tcp
firewall-cmd --reload

cryofun · November 5, 2024, 4:00pm

I found Dan’s old post about the /etc/hosts file and note mine does not contain the new “hostname” after 127.0.0.1. Should I add this?

It shows:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

from:

cryofun · November 6, 2024, 9:15pm

Hi @wtempel I tried opening the ports and still get the same output (connection refused) from both curl commands. I didn’t use --add-port=39000-39009/tcp but rather one command for each port, since the output of firewall-cmd --list-all results were in nomenclature dependent on the original command - not sure if this is right or matters in the end.
Followed based on this - Installing troubles - please advise - #2 by UNCuser

Based on other posts, it seems like centos7 hostname is often an issue.
(example post - Cryosparc2_worker installation problem on Centos7 - #5 by vamsee)

hostname -f gives
name.mc.institution.edu

hostname gives
name

The cryosparc_master/config.sh has CRYOSPARC_MASTER_HOSTNAME=“name.mc.institution.edu”

Failed to launch! 255 - upon change cryosparc_master_hostname and then update

CRYOSPARC CONNECT --------------------------------------------

Attempting to register worker asdf.institution.edu to command asdf.institution.edu:39002 Connecting as unix user user Will register using ssh string: user@asdf.institution.edu If this is incorrect, you should re-run this command with the flag --sshstr

Connected to master.

Current connected workers:

Attempting to register worker asdf.institution.edu to command asdf.institution.edu:39002
Connecting as unix user user
Will register using ssh string: user@asdf.institution.edu
If this is incorrect, you should re-run this command with the flag --sshstr