Failed to launch! 255 - upon change cryosparc_master_hostname and then update

I wanted to update cryosparc but cryosparcm backup gave:
ERROR: Re-run this command on the master node: oldname.
Alternatively, set CRYOSPARC_FORCE_HOSTNAME=true in cryosparc_master/config.sh to suppress this error.
If this error message is incorrect, set CRYOSPARC_HOSTNAME_CHECK to the correct hostname in cryosparc_master/config.sh.

We had moved the computer recently, so maybe it had a new name. I ran hostname -f and changed cryosparc_master_hostname to match newname ~/software/cryosparc/cryosparc_master/config.sh

Then cryosparcm backup gave a new error:
database: ERROR (spawn error)

I got this before and followed the same fix of kill all cryosparc processes from
ps -ax | grep cryosparc

Then cryosaprcm backup works.

Then update and UI seems ok, but first job gave:
License is valid.

Launching job on lane default target oldname ā€¦

Running job on remote worker node hostname oldname

Failed to launch! 255
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

Can I just change the cryosparc_master_hostname back to the original hostname?

The cryosparc job gives:
Running job on remote worker node hostname oldname

This is a good first step, as long as you ensure that the new hostname is ā€œstableā€ in that

  • the hostname does not change from reboot to reboot
  • other computers, such as additional CryoSPARC workers on this CryoSPARC instance correctly resolve the new hostname

You may need help from your network admins to ensure the aforementioned conditions.

I suspect the scheduler target records in your CryoSPARC database still include a record to the old hostname. To help us propose a resolution, please let us know:

  1. Is the new hostname ā€œstableā€ (as defined above)?
  2. What is the output of the command
    cryosparcm cli "get_scheduler_targets()"
    
  3. the old hostname
  4. the new hostname as shown by the command
    hostname -f

Hi @wtempel and thanks for the quick response! Here are the answers:

  1. Is the new hostname ā€œstableā€ (as defined above)?

I will have to check with sysadmin, but it has not changed since the initial check and update from earlier today. The name had previously been the same asdf and now has .mc.institution.edu added on after, (asdf.mc.institution.edu)

  1. What is the output of the command cryosparcm cli ā€œget_scheduler_targets()ā€

With cryosparc running
cryosparcm cli ā€œget_scheduler_targets()ā€

[{ā€˜cache_pathā€™: ā€˜/mnt/ssd-scratch/cryosparc_cacheā€™, ā€˜cache_quota_mbā€™: None, ā€˜cache_reserve_mbā€™: 10000, ā€˜descā€™: None, ā€˜gpusā€™: [{ā€˜idā€™: 0, ā€˜memā€™: 12631212032, ā€˜nameā€™: ā€˜NVIDIA GeForce RTX 3080 Tiā€™}, {ā€˜idā€™: 1, ā€˜memā€™: 12639338496, ā€˜nameā€™: ā€˜NVIDIA GeForce RTX 3080 Tiā€™}], ā€˜hostnameā€™: ā€˜asdfā€™, ā€˜laneā€™: ā€˜defaultā€™, ā€˜monitor_portā€™: None, ā€˜nameā€™: ā€˜asdfā€™, ā€˜resource_fixedā€™: {ā€˜SSDā€™: True}, ā€˜resource_slotsā€™: {ā€˜CPUā€™: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], ā€˜GPUā€™: [0, 1], ā€˜RAMā€™: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ā€˜ssh_strā€™: ā€˜user@asdfā€™, ā€˜titleā€™: ā€˜Worker node asdfā€™, ā€˜typeā€™: ā€˜nodeā€™, ā€˜worker_bin_pathā€™: ā€˜/home/user/software/cryosparc/cryosparc_worker/bin/cryosparcwā€™}]

  1. the old hostname

asdf

  1. the new hostname as shown by the command: hostname -f

asdf.mc.institution.edu

Thanks for posting this info.
If it turns out that asdf.mc.institution.edu is stable, you can avoid the necessity for an ssh connection and also avoid CRYOSPARC_FORCE_HOSTNAME=true (which one would want to avoid in the absence of ā€œspecialā€ circumstances) by having a three-way match between

  • hostname -f output
  • $CRYOSPARC_MASTER_HOSTNAME
  • the target "hostname" value

You can achieve this by:

  1. deleting the outdated target:
    cryosparcm cli "remove_scheduler_target_node('asdf')"
    
  2. adding a target with the new hostname, ensuring correct master and worker hostnames and correct port number ($CRYOSPARC_BASE_PORT inside `cryosparc_master/config.sh) (guide)
    /home/user/software/cryosparc/cryosparc_worker/bin/cryosparcw connect --master asdf.mc.institution.edu --worker asdf.mc.institution.edu --port 99999 --ssdpath /mnt/ssd-scratch/cryosparc_cache
    

Thank you! I have an issue now with cryosparcm cli ā€œget_scheduler_targets()ā€ returns and ā€œAssertionError: Nvidia driver version 470.74 is out of dateā€

Is this really just a CUDA update issue or have I messed up something else? I cannot run non-GPU job type ā€œremove duplicatesā€.

Also, why is it connecting on port 39002 instead of 39000 and is this a big deal?

I ran
cryosparcm cli ā€œremove_scheduler_target_node(ā€˜asdfā€™)ā€

Then check effect with
cryosparcm cli ā€œget_scheduler_targets()ā€

I assume this means it worked as the previous text is cleared

Then
/home/user/software/cryosparc/cryosparc_worker/bin/cryosparcw connect --master asdf.institution.edu --worker asdf.institution.edu --port 39000 --ssdpath /mnt/ssd/cryosparc_cache

The output is below:


CRYOSPARC CONNECT --------------------------------------------

Attempting to register worker asdf.institution.edu to command asdf.institution.edu:39002
Connecting as unix user user
Will register using ssh string: user@asdf.institution.edu
If this is incorrect, you should re-run this command with the flag --sshstr

Connected to master.

Current connected workers:

Worker will be registered with 64 CPUs.
Autodetecting available GPUsā€¦
Traceback (most recent call last):
File ā€œbin/connect.pyā€, line 233, in
gpu_devidxs = check_gpus()
File ā€œbin/connect.pyā€, line 97, in check_gpus
assert correct_driver_version is None, (
AssertionError: Nvidia driver version 470.74 is out of date and will not work with this version of CryoSPARC. Please install version 520.61.05 or newer.

This is expected. The software added +2 to the CRYOSPARC_BASE_PORT to identify the command_core port of your CryoSPARC installation.

Please update the nvidia driver of the asdf computer to version 520.61.05 or newer and reboot the computer after the update.
After the reboot, please record and post the outputs of these commands:

cryosparcm cli "get_scheduler_targets()"
/home/user/software/cryosparc/cryosparc_worker/bin/cryosparcw gpulist

I will downgrade to perform some work and backup before updating nvidia drivers. Are there any considerations I should take before performing
cryosparcm update --version=v4.2.1

I have not tested such a downgrade, but I would expect that, if the CryoSPARC instance was never at a version below 4.4, you will need

  1. an independent installation of the CUDA toolkit version 11.x
  2. a definition inside
    export CRYOSPARC_CUDA_PATH=/your/path/to/cuda
    
    such that
    /your/path/to/cuda such that
    /your/path/to/cuda/bin/nvcc
    exists.

It was previously at v4.2.1 with Nvidia 470.74 and running smoothly.

I expect I could downgrade and then update the hostnames as outlined above and hope it gets back to running.

Hi @wtempel
I have performed the downgrade to v4.2.1 and maintain CUDA version 470.74, but now I got an error about the worker connection. Cryosparc does not see my GPUs.

My master hostname has changed and I fixed the master config.sh file.
How do I know the new worker hostname?
I ran cryosparcw connect --worker new_hostname --master new_hostname (with the same name for each, the same name in the config.sh file), but I am not sure this is right.

Unexpectedly, despite cryosparcm status showing v4.2.1, I get the error ā€œAssertionError: Nvidia driver version 470.74 is out of date and will not work with this version of CryoSPARC. Please install version 520.61.05 or newer.ā€

I was previously running fine on v4.2.1 with this nvidia driver. Sorry to resurrect an old issue, but any advice will be appreciated.

@cryofun
What is the output of the command

cat /home/user/software/cryosparc/cryosparc_worker/version

?

Matching hostnames would be correct for a single-computer ā€œstandaloneā€ CryoSPARC instance.
Also, what is the path to your independently installed CUDA toolkit?

That command gives 4.4.1

/usr/local/cuda-11.2

This is a single-computer standalone instance.

Thanks for the help.

You may want to

  1. ensure that the command
    grep "^export CRYOSPARC_CUDA_PATH" /home/user/software/cryosparc/cryosparc_worker/config.sh | tail -n 1
    
    outputs
    export CRYOSPARC_CUDA_PATH=/usr/local/cuda-11.2
    
  2. follow manual worker update instructions.

Does this help?

Thanks,
After manual worker update, it is now at 4.2.1 according to the grep command in your last post.
cryosparcm status and the web interface all show v4.2.1
but I canā€™t see GPUā€™s in the web interface ā€œinstanceā€ and cannot queue jobs (queue is greyed out, not clickable).

nvidia-smi shows my 2 GPUā€™s and CUDA Version: 11.4, which I thought should be 11.2 according to the contents of /usr/local (which doesnā€™t even contain 11.4). But this definitely hasnā€™t changed and is probably unrelated to the issue. Itā€™s how it was setup by our computer provider (Exxact system)

Could it be that the worker and master are not communicating and worker needs to be added?
I thought I would try to re-establish the gpuā€™s with
./bin/cryosparcw connect --master name --worker name --gpus 0,1

and it gave

CRYOSPARC CONNECT
Attempting to register worker name to command name:39002
Connecting as unix user cryosparc_user
Will register using ssh string: cryosparc_user@name
If this is incorrect, you should re-run this command with the flag --sshstr
*** CommandClient: (http://name:39002/api) URL Error [Errno 111] Connection refused
Traceback (most recent call last):

Since this is a standalone computer, it shouldnā€™t need to connect to itself with a port like this, right? How can I force master & worker to connect internally without a port, assuming this is the issue.

nvidia-smi will display the CUDA version corresponding to the version of the driver, which may differ from, but must be compatible with, the version of the CUDA toolkit. Your installed toolkit v11.2 may be compatible with the nvidia driver, even if the CUDA version displayed for the driver is different.

Access to the port is needed, but a simplified configuration may be used where no additional worker nodes may be added.
What are the outputs of these commands (in a fresh shell)

eval $(cryosparcm env)
host $CRYOSPARC_MASTER_HOSTNAME
curl ${CRYOSPARC_MASTER_HOSTNAME}:39002
curl 127.0.0.1:39002
exit

Thanks for the explanation regarding CUDA versions.

The commands give connection refused at those ports
host $CRYOSPARC_MASTER_HOSTNAME
name has address 156.111.#.#

curl ${CRYOSPARC_MASTER_HOSTNAME}:39002
curl: (7) Failed connect to name:39002; Connection refused

curl 127.0.0.1:39002
curl: (7) Failed connect to 127.0.0.1:39002; Connection refused

So can I just open it with the following command (centos7)? Do I need to worry about leaving this open?

firewall-cmd --zone=public --permanent --add-port=39000-39009/tcp
firewall-cmd --reload

I found Danā€™s old post about the /etc/hosts file and note mine does not contain the new ā€œhostnameā€ after 127.0.0.1. Should I add this?

It shows:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

from:

Hi @wtempel I tried opening the ports and still get the same output (connection refused) from both curl commands. I didnā€™t use --add-port=39000-39009/tcp but rather one command for each port, since the output of firewall-cmd --list-all results were in nomenclature dependent on the original command - not sure if this is right or matters in the end.
Followed based on this - Installing troubles - please advise - #2 by UNCuser

Based on other posts, it seems like centos7 hostname is often an issue.
(example post - Cryosparc2_worker installation problem on Centos7 - #5 by vamsee)

hostname -f gives
name.mc.institution.edu

hostname gives
name

The cryosparc_master/config.sh has CRYOSPARC_MASTER_HOSTNAME=ā€œname.mc.institution.eduā€