Job Launched on Worker Node but Doesn't Run

sith2546 · June 13, 2024, 8:04pm

Hello,

I am new to cryosparc and I keep encountering with a problem. I have installed the worker version on one workstation and the master standalone version on another station. I successfully connected the worker to the master. But when I run a job, it says launched, but doesn’t do anything. I have followed some of the solutions shared on cryosparc discuss but am still unsuccessful.

I have done the following :

cryosparcm joblog P3 J104

/home/haydin/Research/CryoEM/J104/job.log: No such file or directory

Additionally, I have established a nfs system between the worker and master so I don’t understand why the job is stuck at launched.

Any assistance with this matter would be greatly appreciated.

Thank you!

wtempel · June 13, 2024, 9:13pm

Welcome to the forum @sith2546 .
Please can you post the outputs of the following commands on the CryoSPARC master node:

ls -al /home/haydin/Research/CryoEM/J104/
cryosparcm status | grep -v LICENSE
cryosparcm cli "get_scheduler_targets()"
cryosparcm filterlog -l error command_core | tail -n 40

[edited: corrected command]

sith2546 · June 13, 2024, 9:35pm

Hello,

The following are the outputs from cryosparc master node.

ls -al J104
total 80
drwxrwxr-x 1 haydin haydin 60 Jun 11 09:56 .
drwxrwxr-x 1 haydin haydin 726 Jun 13 13:12 …
-rw-rw-r-- 1 haydin haydin 553 Jun 13 13:11 events.bson
drwxrwxr-x 1 haydin haydin 0 Jun 11 09:56 gridfs_data
-rw-rw-r-- 1 haydin haydin 75651 Jun 13 13:11 job.json
cryosparcm status | grep -v LICENSE

CryoSPARC System master node installed at
/opt/cryosparc/cryosparc_master
Current cryoSPARC version: v4.5.3

CryoSPARC process status:

app RUNNING pid 300936, uptime 3:06:51
app_api RUNNING pid 300955, uptime 3:06:49
app_api_dev STOPPED Not started
command_core RUNNING pid 300858, uptime 3:07:01
command_rtp RUNNING pid 300922, uptime 3:06:53
command_vis RUNNING pid 300886, uptime 3:06:55
database RUNNING pid 300756, uptime 3:07:05

License is valid

global config variables:
export CRYOSPARC_MASTER_HOSTNAME=“localhost”
export CRYOSPARC_HOSTNAME_CHECK=“localhost”
export CRYOSPARC_DB_PATH=“/opt/cryosparc/cryosparc_database”
export CRYOSPARC_BASE_PORT=38000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false
export CRYOSPARC_CLICK_WRAP=true
export CRYOSPARC_JOB_LAUNCH_TIMEOUT_SECONDS=120

cryosparcm cli “get_scheduler_targets()”

[{‘cache_path’: ‘/cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25332940800, ‘name’: ‘NVIDIA GeForce RTX 3090’}], ‘hostname’: ‘localhost’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘localhost’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], ‘GPU’: [0], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘haydin@localhost’, ‘title’: ‘Worker node localhost’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/opt/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25322520576, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 1, ‘mem’: 25322520576, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 2, ‘mem’: 25322520576, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 3, ‘mem’: 25322520576, ‘name’: ‘NVIDIA RTX A5000’}], ‘hostname’: ‘172.21.17.91’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘172.21.17.91’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, ‘ssh_str’: ‘haydin@172.21.17.91’, ‘title’: ‘Worker node 172.21.17.91’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/opt/cryosparc/cryosparc_worker/bin/cryosparcw’}]

c13m filterlog -l error command_core | tail -n 40

For this I got an error : c13m: command not found

wtempel · June 13, 2024, 9:48pm

Sorry, my mistake. Please can you try instead:

cryosparcm filterlog -l error command_core | tail -n 40

sith2546 · June 13, 2024, 9:48pm

Do you mean cryosparcm filterlog -l error command_core | tail -n 40 ?

then i get the following

2024-06-11 09:55:38,462 get_gpu_info_run 2024-06-11 09:55:38,462 get_gpu_info_run 2024-06-11 09:55:38,462 get_gpu_info_run 2024-06-11 09:55:38,462 get_gpu_info_run 2024-06-11 09:55:38,462 get_gpu_info_run 2024-06-11 09:55:38,462 get_gpu_info_run 2024-06-11 09:55:38,462 get_gpu_info_run 2024-06-11 09:55:38,462 get_gpu_info_run 2024-06-11 09:55:38,462 get_gpu_info_run 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper 2024-06-13 12:26:38,993 wrapper ERROR | Failed to get GPU info on 172.21.17.111
ERROR | Traceback (most recent call last):
ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 1516, in get_gpu_info_run
ERROR | value = subprocess.check_output(full_command, stderr=subprocess.STDOUT, shell=shell, timeout=JOB_LAUNCH_TIMEOUT_SECONDS).decode()
ERROR | File “/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py”, line 421, in check_output
ERROR | return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
ERROR | File “/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py”, line 526, in run
ERROR | raise CalledProcessError(retcode, process.args,
ERROR | subprocess.CalledProcessError: Command ‘[‘ssh’, ‘haydin@172.21.17.111’, ‘bash -c “eval $(/opt/cryosparc/cryosparc_worker/bin/cryosparcw env); python /opt/cryosparc/cryosparc_worker/cryosparc_compute/get_gpu_info.py”’]’ returned non-zero exit status 255.
ERROR | JSONRPC ERROR at get_job_log_path_abs
ERROR | Traceback (most recent call last):
ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py”, line 196, in wrapper
ERROR | res = func(*args, **kwargs)
ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 8173, in get_job_log_path_abs
ERROR | job_dir_abs = get_job_dir_abs(project_uid, job_uid)
ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py”, line 187, in wrapper
ERROR | return func(*args, **kwargs)
ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 8157, in get_job_dir_abs
ERROR | job_doc = get_job(project_uid, job_uid, ‘job_dir’)
ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py”, line 187, in wrapper
ERROR | return func(*args, **kwargs)
ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 6124, in get_job
ERROR | raise ValueError(f"{project_uid} {job_uid} does not exist.")
ERROR | ValueError: PX JY does not exist.

sith2546 · June 13, 2024, 9:51pm

So 172.21.17.111 is the wrong worker node ip address. I did change it to 172.21.17.91, which is actually the right worker node. So, I don’t understand why it shows 172.21.17.111

wtempel · June 13, 2024, 10:20pm

This setup is not supported on a CryoSPARC instance with external workers.
You may want to ask your IT support to

assign a permanent, resolvable hostname to the CryoSPARC master host (on the DNS server and, probably, DHCP server)
ensure that the command
hostname -f, when executed on the CryoSPARC master host, prints that permanent, resolvable hostname. Let’s assume that hostname is server11.your.domain.
similarly, let you know the permanent, resolvable hostname of the GPU node, say server12.your.domain and ensure that hostname is printed by the
hostname -f command.

Then

use that permanent, resolvable hostname of the master host in the definition of CRYOSPARC_MASTER_HOSTNAME and remove the line

sith2546:

export CRYOSPARC_HOSTNAME_CHECK=“localhost”

inside cryosparc_master/config.sh and restart CryoSPARC for the changes to take effect.
remove the existing target node records
re-connect the target nodes, specifying the permanent, resolvable hostname of the CryoSPARC master host with the --master keyword and
--port 38000 each time. Important: cryosparcw connect must be run on the worker that is being connected.
ensure that /opt/cryosparc/cryosparc_worker/bin/cryosparcw exists on host 172.21.17.91 (whose real hostname you should know by now)
ensure that /home/haydin/Research/CryoEM/ exists on 172.21.17.91 at that same path and is shared with the CryoSPARC master node

sith2546 · June 13, 2024, 10:34pm

Initially, the hostname in CryoSPARC was set to “rosalind.” However, I encountered a timeout error. Based on recommendations from CryoSPARC discussions, I changed the hostname to “localhost.” When I attempt to change it back to “rosalind,” the hostname does not update on the web interface. What steps should I take to resolve this issue?

Additionally, you mentioned that the worker node should also have the /home/haydin/Research/CryoEM/ path. This path originates from the master node. Does this mean that during the NFS setup, I should have mounted this path on the worker node as well?

wtempel · June 14, 2024, 12:21pm

Please ensure that the GPU server (172.21.17.91) resolve the rosalind hostname. What is the output of the command (on 172.21.17.91):

host rosalind

What are the outputs of the following commands on rosalind:

hostname -f
host $(hostname -f)

This may work on Single Workstation CryoSPARC instances that do not have additional workers, and therefore does not apply to the current case.

Project directories need to be shared between master and worker nodes. This could be achieved by one of the nodes acting as the NFS server (“exporting” shared directories) and the others as NFS clients (mounting the shared directories). On larger setups, it is may be more common to have a dedicated storage server that exports project directories to both master and worker nodes. It is important that the data are available under a common path on the master and worker nodes.

sith2546 · June 14, 2024, 5:14pm

Hello,

Here are the results following the successful mounting of /home/haydin/Research/CryoEM/ on the worker node.

host rosalind
rosalind has address 172.21.17.241
Host rosalind not found: 3(NXDOMAIN)
hostname -f
rosalind

3)host $(hostname -f)
rosalind has address 127.0.1.1

In my /etc/hosts file on the master node I have the following

127.0.0.1 localhost
127.0.1.1 rosalind
172.21.17.91 sn4622119034 (this is my worker node hostname)

But when I do ifconfig, the IP address for the host is 172.21.17.241. Do I need to change the IP address to 172.21.17.241 for rosalind?

wtempel · June 21, 2024, 6:38pm

If you are referring to the definition inside rosalind:/etc/hosts: no change is needed at this time.
I also think that inside /opt/cryosparc/cryosparc_master/config.sh,

you may want to change (back) to

export CRYOSPARC_MASTER_HOSTNAME=“rosalind”

remove or comment out the CRYOSPARC_HOSTNAME_CHECK line

and restart CryoSPARC.
On rosalind, as user haydin, please run the following commands and post their outputs:

hostname -f
id
nvidia-smi --query-gpu=index,name --format=csv
ssh 172.21.17.91 'hostname -f && id && nvidia-smi --query-gpu=index,name --format=csv'
ssh 172.21.17.91 'hostname && ls -al /home/haydin/Research/CryoEM/J104/'

Please ensure password-less ssh access from rosalind to haydin@172.21.17.91 works (without prompt for password or hostkey)

Job Launched on Worker Node but Doesn't Run

CryoSPARC System master node installed at /opt/cryosparc/cryosparc_master Current cryoSPARC version: v4.5.3

License is valid

CryoSPARC System master node installed at
/opt/cryosparc/cryosparc_master
Current cryoSPARC version: v4.5.3