Worker on another node gets Failed to launch! 255 Permission denied, please try again

RobK · August 22, 2024, 8:15pm

License is valid.
 
Launching job on lane default target cryo10.ourdomain..edu ...
 
Running job on remote worker node hostname cryo10.ourdomain.edu
 
Failed to launch! 255
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

I already looked at this thread with the same permission error but that appears to be for a worker and master on the same server. There are a few other threads with this error but I haven’t found one that resembles this.

The worker starts without errors:

bin/cryosparcw connect --worker cryo10.ourdomain.edu  --master cryo11.ourdomain.edu --port 39000 --nossd --update
 ---------------------------------------------------------------
  CRYOSPARC CONNECT --------------------------------------------
 ---------------------------------------------------------------
  Attempting to register worker cryo10.ourdomain.edu to command cryo11.ourdomain.edu:39002
  Connecting as unix user root
  Will register using ssh string: root@cryo10.ourdomain.edu
  If this is incorrect, you should re-run this command with the flag --sshstr <ssh string> 
 ---------------------------------------------------------------
  Connected to master.
 ---------------------------------------------------------------
  Current connected workers:
    sn46xxx
    cryo10.ourdomain.edu

Here are the results of the suggested commands in the FAQ:

eval bin/cryosparcw env
export "CRYOSPARC_USE_GPU=true"
export "CRYOSPARC_CONDA_ENV=cryosparc_worker_env"
export "CRYOSPARC_DEVELOP=false"
export "CRYOSPARC_LICENSE_ID=bae9edd6-54dd-11ef-93a3-7b0d1eadc7e2"
export "CRYOSPARC_ROOT_DIR=/opt/cryosparc_worker"
export "CRYOSPARC_PATH=/opt/cryosparc_worker/bin"
export "PATH=/opt/cryosparc_worker/bin:/opt/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/opt/cryosparc_worker/deps/anaconda/condabin:/usr/lib/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin"
export "LD_LIBRARY_PATH="
export "LD_PRELOAD=/opt/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/libpython3.10.so"
export "PYTHONPATH=/opt/cryosparc_worker"
export "PYTHONNOUSERSITE=true"
export "CONDA_SHLVL=1"
export "CONDA_PROMPT_MODIFIER=(cryosparc_worker_env)"
export "CONDA_EXE=/opt/cryosparc_worker/deps/anaconda/bin/conda"
export "CONDA_PREFIX=/opt/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env"
export "CONDA_PYTHON_EXE=/opt/cryosparc_worker/deps/anaconda/bin/python"
export "CONDA_DEFAULT_ENV=cryosparc_worker_env"
export "NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT=0"
export "NUMBA_CUDA_INCLUDE_PATH=/opt/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include"
export "NUMBA_CUDA_USE_NVIDIA_BINDING=1"


 env | grep PATH
PATH=/usr/lib/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/cryosparc_worker/bin:/root/bin


 /sbin/ldconfig -p | grep -I cuda
	libicudata.so.50 (libc6,x86-64) => /lib64/libicudata.so.50
	libcudadebugger.so.1 (libc6,x86-64) => /lib64/libcudadebugger.so.1
	libcuda.so.1 (libc6,x86-64) => /lib64/libcuda.so.1
	libcuda.so.1 (libc6) => /lib/libcuda.so.1
	libcuda.so (libc6,x86-64) => /lib64/libcuda.so
	libcuda.so (libc6) => /lib/libcuda.so

uname -a
Linux sn4622115934 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
 cryosparc_worker]# nvidia-smi

Thu Aug 22 13:00:14 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:4F:00.0 Off |                  Off |
| 30%   32C    P8     8W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:52:00.0 Off |                  Off |
| 30%   36C    P8    14W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000    Off  | 00000000:56:00.0 Off |                  Off |
| 30%   33C    P8    16W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    Off  | 00000000:57:00.0 Off |                  Off |
| 30%   35C    P8    14W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000    Off  | 00000000:D1:00.0 Off |                  Off |
| 30%   33C    P8    30W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000    Off  | 00000000:D2:00.0 Off |                  Off |
| 30%   34C    P8    24W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A6000    Off  | 00000000:D5:00.0 Off |                  Off |
| 30%   37C    P8    28W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A6000    Off  | 00000000:D6:00.0 Off |                  Off |
| 30%   38C    P0    78W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvcc is available via 
/opt/sbgrid/x86_64-linux/diffdock/73ef67f/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver

Also on the installation instructions page there’s a mention of registering with the master process: Connect a Cluster to CryoSPARC Once the cryosparc_worker package is installed, the cluster must be registered with the master process. This requires a template for job submission commands and scripts that the master process will use to submit jobs to the cluster scheduler. To register the cluster, provide CryoSPARC with the following two files and call the cryosparcm cluster conn) there’s a mention of registering with the master process:

Connect a Cluster to CryoSPARC

Once the cryosparc_worker package is installed, the cluster must be registered with the master process. This requires a template for job submission commands and scripts that the master process will use to submit jobs to the cluster scheduler.

To register the cluster, provide CryoSPARC with the following two files and call the cryosparcm cluster connect command.

Is a job scheduler required to get a worker going?

Pardon the light obfuscation of the hostnames…

wtempel · August 22, 2024, 10:19pm

Welcome to the forum @RobK.

To use a given GPU the server as a CryoSPARC worker, one could either

if that GPU server is part of a gridengine, slurm or similar cluster, register the cluster using the
cryosparcm cluster connect command (run on CryoSPARC master host, details)
or, alternatively, register the server using a form of the
cryosparcw connect command (run on the GPU server in question, details)

but not both.

A common Linux account (same user name, numeric user id) on all CryoSPARC master and worker servers is an installation prerequisite. CryoSPARC should be installed and all cryosparcm and cryosparcw commands should be run under that Linux account. Do not use the root account for this purpose.

suggests that password-less ssh access from the CryoSPARC master to the CryoSPARC worker has not been configured or the GPU server’s host key has not been confirmed. After one chose and used the cryosparcw connect method to register a GPU server with the CryoSPARC master, jobs are “sent” via ssh from the master to the GPU server. This process requires password-less (for example, key-based) ssh access from the dedicated Linux account on the master to the Linux account with same username and user id on the GPU server.

RobK · August 23, 2024, 5:24pm

OK thanks for the tips. I reinstalled as a non root user and made sure password-less ssh works between the 2 nodes.

For any project or data that was running I see a suggestion to:
NOTE you should probably back up the run/db folder from your cryosparc installation to save the database result metadata from your previous installation in case you need it again.

There is a run directory that has:

run]# ls -lt
total 9372
-rw-r--r--. 1 root root   18277 Aug 23 11:07 supervisord.log
-rw-rw-r--. 1 root root    9180 Aug 23 11:07 command_rtp.log
-rw-rw-r--. 1 root root 7488311 Aug 23 11:07 command_core.log
-rw-rw-r--. 1 root root   12342 Aug 23 11:07 command_vis.log
-rw-r--r--. 1 root root  673767 Aug 23 11:07 database.log
-rw-rw-r--. 1 root root 1358859 Aug 23 10:08 app.log
-rw-rw-r--. 1 root root     205 Aug 22 15:32 app_api.log
drwxrwxr-x. 2 root root    4096 Aug  8 11:15 vis

However the cryosparc_database directory in a directory where cryosparc_master lives. What should actually be copied over?

This tutorial mentions running rsync on the full directory:

rsync -r --links /data/cryosparc/cryosparc_database/* /cryoem/cryosparc/cryosparc_database

Since it’s a new installation there isn’t too much data but I wanted to see what I could copy over.

There are quite a few collection* and index* files as well as dirs like diagnostic.data and journal and looks like just one project with .wt, and .turtle files.

RobK · August 26, 2024, 1:41pm

So the permission error went away from the worker node however the job stalls before starting. The log only says this:

License is valid. Launching job on lane default target ourserver.edu

What else can we do to troubleshoot or check settings?

wtempel · August 26, 2024, 3:00pm

Did you confirm that

the cryosparc_worker/bin/cryosparcw file is available
the project directory is available for writing

on all CryoSPARC worker computers at their expected respective absolute paths?

What are the outputs of the commands

cryosparcm cli "get_scheduler_targets()"
cryosparcm cli "get_job('P99', 'J199', 'instance_information')"
cryosparcm cli "get_project_dir_abs('P99')"

where P99, J199 are the stuck job’s project and job IDs, respectively?

RobK · August 26, 2024, 4:09pm

Yes

OK perhaps this is a leftover from installing previously as root but from the below requested commands I see there is a P1 J33 showing status as Launched. But the J1 project directory is owned by root:
drwxrwx---. 3 root root 186 Aug 12 14:30 J1

I believe this should be owned by the non-root user that I did the installation under?

There is only one worker but I am still trying to understand what “expected” means here.

cryosparcm cli "get_scheduler_targets()"
[{'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 1, 'mem': server, 'name': 'NVIDIA RTX A6000'}, {'id': 2, 'mem': server, 'name': 'NVIDIA RTX A6000'}, {'id': 3, 'mem': server, 'name': 'NVIDIA RTX A6000'}, {'id': 4, 'mem': server, 'name': 'NVIDIA RTX A6000'}, {'id': 5, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 6, 'mem': server, 'name': 'NVIDIA RTX A6000'}, {'id': 7, 'mem': server, 'name': 'NVIDIA RTX A6000'}], 'hostname': 'server', 'lane': 'default', 'monitor_port': None, 'name': 'server', 'resource_fixed': {'SSD': False}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257]}, 'ssh_str': 'root@server', 'title': 'Worker node sn4622115935', 'type': 'node', 'worker_bin_path': '/opt/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/opt/apps/cryosparc_ssd', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 1, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 2, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 3, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 4, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 5, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 6, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 7, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}], 'hostname': 'cryoem10.ourdomain.edu', 'lane': 'default', 'monitor_port': None, 'name': 'cryoem10.ourdomain.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257]}, 'ssh_str': 'user@cryoem10.ourdomain.edu', 'title': 'Worker node cryoem10.ourdomain.edu', 'type': 'node', 'worker_bin_path': '/opt/apps/cryosparc_worker/bin/cryosparcw'}]

I’ll start with changing the ownership of the J1 project directory.

wtempel · August 26, 2024, 5:50pm

That is a good start. You may need to also change the ownership of the the project directory that contains the J1 directory.

The get_scheduler_targets() output indicates two workers, the first of which should probably be removed.

Expected:

/opt/apps/cryosparc_worker/bin/cryosparcw exists and can be executed on the worker
the same, shared project directory is mounted and has the same absolute path on the master and worker(s)

RobK · August 26, 2024, 6:09pm

Well, there are 2 workers in that the master also has “Install The cryosparc_worker Package”. Which entry shows more than 2 workers?

wtempel · August 26, 2024, 6:29pm

I see. The first entry’s

may need to be updated to correspond to the non-root user unless the entry’s 'hostname' value exactly matches the value of the CRYOSPARC_MASTER_HOSTNAME variable (as defined inside cryosparc_master/config.sh).
[edited to correct an error]

RobK · August 26, 2024, 6:36pm

In this case, yes, the results of the cmd hostname match the value of CRYOSPARC_MASTER_HOSTNAME in cryosparc_master/config.sh. Since we’re no longer using ‘root’ what would be updated here just ssh_str?

wtempel · August 26, 2024, 6:53pm

To modify the worker target entry for the master host (acting as bot master and worker), you could use

cryosparcw connect [..] --ssh "<user>@<hostname>" --update

where <user> should be replaced with the username of the dedicated non-root user and should be replaced with the value of the CRYOSPARC_MASTER_HOSTNAME variable.
[edited]

RobK · August 26, 2024, 7:28pm

Ok this seemed to work (obfuscating the real user and domain):

cryosparcw connect --worker cryoem10.ourdomain.edu  --master cryoem11.ourdomain.edu  --ssh "user@hostname" --update
 ---------------------------------------------------------------
  CRYOSPARC CONNECT --------------------------------------------
 ---------------------------------------------------------------
  Attempting to register worker cryoem10.ourdomain.edu to command cryoem11.ourdomain.edu:39002
  Connecting as unix user user
  Will register using ssh string: user@hostname
  If this is incorrect, you should re-run this command with the flag --sshstr <ssh string> 
 ---------------------------------------------------------------
  Connected to master.
 ---------------------------------------------------------------
  Current connected workers:
    hostname
    cryoem10.ourdomain.edu
 ---------------------------------------------------------------
  Worker will be registered with 72 CPUs.
 ---------------------------------------------------------------
  Updating target cryoem10.ourdomain.edu
  Current configuration:
               cache_path :  /opt/apps/cryosparc_ssd
           cache_quota_mb :  None
         cache_reserve_mb :  10000
                     desc :  None
                     gpus :  [{'id': 0, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 1, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 2, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 3, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 4, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 5, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 6, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 7, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}]
                 hostname :  cryoem10.ourdomain.edu
                     lane :  default
             monitor_port :  None
                     name :  cryoem10.ourdomain.edu
           resource_fixed :  {'SSD': True}
           resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257]}
                  ssh_str :  exx@cryoem10.ourdomain.edu
                    title :  Worker node cryoem10.ourdomain.edu
                     type :  node
          worker_bin_path :  /opt/apps/cryosparc_worker/bin/cryosparcw
 ---------------------------------------------------------------
  SSH connection string will be updated to user@hostname
 ---------------------------------------------------------------
  Updating.. 
  Done. 
 ---------------------------------------------------------------
  Final configuration for cryoem10.ourdomain.edu
               cache_path :  /opt/apps/cryosparc_ssd
           cache_quota_mb :  None
         cache_reserve_mb :  10000
                     desc :  None
                     gpus :  [{'id': 0, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 1, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 2, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 3, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 4, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 5, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 6, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 7, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}]
                 hostname :  cryoem10.ourdomain.edu
                     lane :  default
             monitor_port :  None
                     name :  cryoem10.ourdomain.edu
           resource_fixed :  {'SSD': True}
           resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257]}
                  ssh_str :  user@hostname
                    title :  Worker node cryoem10.ourdomain.edu
                     type :  node
          worker_bin_path :  /opt/apps/cryosparc_worker/bin/cryosparcw
 ---------------------------------------------------------------

wtempel · August 26, 2024, 8:34pm

It appears that the incorrect worker has been updated, and the --sshstr value copied literally rather than adjusted according to the correct username and hostname.
You may want to remove and redo the worker connections.
cryosparcw connect should be run on each worker node to be connected (including the master node if the master will also be used as a worker) under the exx Linux account (which I assume is the non-root account that “owns” the CryoSPARC installation and processes). In this case --sshstr will be inferred automatically and does not need to be specified as part of the command.
Please ensure that the exx Linux account has the same numeric user id an the master and worker hosts.
Side notes:

suggests that the cache directory is stored on the same device as the software installation and, potentially, other important files. Because cache_path will be subject to significant wear and likely a shortened life cycle, you may want to specify for cache_path a directory on a device that does_not_ store important files.

There seems to have been some confusion. user@hostname was not meant to be included in the command literally, but user and hostname were meant to be replaced by values specific to your CryoSPARC installation. I will update my own earlier post to make that clearer. But, as I mentioned earlier, you may not need to specify --sshstr when you recreate worker connections from scratch (under the correct Linux account).

RobK · August 26, 2024, 8:57pm

I’ve tried a few variations and I keep getting none:

 cryosparcm cli 'remove_scheduler_target_node("'sn4622115935'")'
None

Confirmed.

Noted but the disk that /opt is on happens to be a SSD.

Pardon the obfuscation I was just trying to hide the real username and hostname which I’ll share now is exx and sn4622115935 is the master hostname,

Here is an updated get_scheduler_targets

cryosparcm cli "get_scheduler_targets()"

[{'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 1, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 2, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 3, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 4, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 5, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 6, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 7, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}], 'hostname': 'sn4622115935', 'lane': 'default', 'monitor_port': None, 'name': 'sn4622115935', 'resource_fixed': {'SSD': False}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257]}, 'ssh_str': 'root@sn4622115935', 'title': 'Worker node sn4622115935', 'type': 'node', 'worker_bin_path': '/opt/cryosparc_worker/bin/cryosparcw'}]

wtempel · August 26, 2024, 9:45pm

None is not an indication of command failure. To confirm the success of the command, please inspect the output of a renewed

cryosparcm cli "get_scheduler_targets()"

command.

This target specification still does not look right (and currently includes only a single host.). You should be able to remove it with the command

cryosparcm cli "remove_scheduler_target_node('sn4622115935')"

I also suspect that sn4622115935 is not a suitable hostname for a CryoSPARC master host that has or will have additional CryoSPARC workers connected. You may want to ask your IT support team to assign a permanent FQDN to this computer that can be resolved on your (private) network by the CryoSPARC worker computer.

RobK · August 27, 2024, 12:40am

Indeed it already has a FQDN which I switched to and now it appears the worker node is starting jobs:

cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/opt/apps/cryosparc_ssd', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 1, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 2, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 3, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 4, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 5, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 6, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 7, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}], 'hostname': 'cryoem10.fitzpatrick.zi.columbia.edu', 'lane': 'default', 'monitor_port': None, 'name': 'cryoem10.fitzpatrick.zi.columbia.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257]}, 'ssh_str': 'exx@cryoem10.fitzpatrick.zi.columbia.edu', 'title': 'Worker node cryoem10.fitzpatrick.zi.columbia.edu', 'type': 'node', 'worker_bin_path': '/opt/apps/cryosparc_worker/bin/cryosparcw'}]

I wish I had known about the job.log file and how to find what the user defined as the directory where jobs are started!

RobK · August 27, 2024, 1:13pm

OK so now master is not taking any jobs. All jobs we are queuing are going directly to cryoem10. It does still say that cryoem10 is the worker node. These status looks good:

 cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/opt/apps/cryosparc_master
Current cryoSPARC version: v4.5.3
----------------------------------------------------------------------------

CryoSPARC process status:

app                              RUNNING   pid 66549, uptime 12:42:36
app_api                          RUNNING   pid 66568, uptime 12:42:34
app_api_dev                      STOPPED   Not started
command_core                     RUNNING   pid 66491, uptime 12:42:48
command_rtp                      RUNNING   pid 66522, uptime 12:42:40
command_vis                      RUNNING   pid 66518, uptime 12:42:42
database                         RUNNING   pid 66379, uptime 12:42:52

----------------------------------------------------------------------------
License is valid
----------------------------------------------------------------------------

global config variables:
export CRYOSPARC_LICENSE_ID="xxx"
export CRYOSPARC_MASTER_HOSTNAME="cryoem11.fitzpatrick.zi.columbia.edu"
export CRYOSPARC_DB_PATH="/opt/apps/cryosparc_database"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
export CRYOSPARC_INSECURE=true 
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_CLICK_WRAP=true

So what else got changed that could lead to this? The job.log file just ends with:

========= sending heartbeat at 2024-08-27 05:38:27.728572
<string>:1: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
========= sending heartbeat at 2024-08-27 05:38:37.755518
========= sending heartbeat at 2024-08-27 05:38:47.778933
<string>:1: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
========= sending heartbeat at 2024-08-27 05:38:57.802843
/opt/apps/cryosparc_worker/cryosparc_compute/sigproc.py:656: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  x = n.linalg.lstsq(w.reshape((-1,1))*A, w*b)[0]
/opt/apps/cryosparc_worker/cryosparc_compute/plotutil.py:571: RuntimeWarning: divide by zero encountered in log
  logabs = n.log(n.abs(fM))
========= sending heartbeat at 2024-08-27 05:39:07.828806
/opt/apps/cryosparc_worker/cryosparc_compute/plotutil.py:44: RuntimeWarning: invalid value encountered in sqrt
  cradwn = n.sqrt(cradwn)
========= sending heartbeat at 2024-08-27 05:39:17.854324
/opt/apps/cryosparc_worker/bin/cryosparcw: line 150: 14110 Terminated              python -c "import cryosparc_compute.run as run; run.run()" "$@"

Edit: I realized that when I ran:
Cyosparcm cli "remove_scheduler_target_node('sn4622115935')"

That deleted the master worker. To re-add do I really need to specify these 6 options?

 cryosparcm cli "add_scheduler_target_node('cryoem11.fitzpatrick.zi.columbia.edu')"
*** (http://cryoem11.fitzpatrick.zi.columbia.edu:39002, code 400) Encountered ServerError from JSONRPC function "add_scheduler_target_node" with params ('cryoem11.fitzpatrick.zi.columbia.edu',):
ServerError: add_scheduler_target_node() missing 6 required positional arguments: 'ssh_str', 'worker_bin_path', 'num_cpus', 'cuda_devs', 'ram_mb', and 'has_ssd'
Traceback (most recent call last):
  File "/opt/apps/cryosparc_master/cryosparc_command/commandcommon.py", line 196, in wrapper
    res = func(*args, **kwargs)
TypeError: add_scheduler_target_node() missing 6 required positional arguments: 'ssh_str', 'worker_bin_path', 'num_cpus', 'cuda_devs', 'ram_mb', and 'has_ssd'

wtempel · August 28, 2024, 6:27pm

You may want to instead run the
cryosparcw connect command on cryoem11 (guide).

RobK · September 4, 2024, 2:14pm

Thanks that worked. Hope this thread hopes someone down the line.

One suggestion would be if the backup and restore is done as a different user, come up with a way to handle the permissions.

wtempel · September 4, 2024, 9:30pm

Are you referring to the cryosparc subcommands backup and restore?
Did you run

cryosparcm backup as the Linux user that also runs the CryoSPARC instance whose database is being backed up?
cryosparcm restore as the Linux user that also runs the CryoSPARC instance that should use the restored database?