Job Launched on Worker Node but Doesn't Run

Hello,

I am new to cryosparc and I keep encountering with a problem. I have installed the worker version on one workstation and the master standalone version on another station. I successfully connected the worker to the master. But when I run a job, it says launched, but doesn’t do anything. I have followed some of the solutions shared on cryosparc discuss but am still unsuccessful.

I have done the following :

cryosparcm joblog P3 J104

/home/haydin/Research/CryoEM/J104/job.log: No such file or directory

Additionally, I have established a nfs system between the worker and master so I don’t understand why the job is stuck at launched.

Any assistance with this matter would be greatly appreciated.

Thank you!

Welcome to the forum @sith2546 .
Please can you post the outputs of the following commands on the CryoSPARC master node:

ls -al /home/haydin/Research/CryoEM/J104/
cryosparcm status | grep -v LICENSE
cryosparcm cli "get_scheduler_targets()"
cryosparcm filterlog -l error command_core | tail -n 40

[edited: corrected command]

Hello,

The following are the outputs from cryosparc master node.

  1. ls -al J104
    total 80
    drwxrwxr-x 1 haydin haydin 60 Jun 11 09:56 .
    drwxrwxr-x 1 haydin haydin 726 Jun 13 13:12 …
    -rw-rw-r-- 1 haydin haydin 553 Jun 13 13:11 events.bson
    drwxrwxr-x 1 haydin haydin 0 Jun 11 09:56 gridfs_data
    -rw-rw-r-- 1 haydin haydin 75651 Jun 13 13:11 job.json

  2. cryosparcm status | grep -v LICENSE


CryoSPARC System master node installed at
/opt/cryosparc/cryosparc_master
Current cryoSPARC version: v4.5.3

CryoSPARC process status:

app RUNNING pid 300936, uptime 3:06:51
app_api RUNNING pid 300955, uptime 3:06:49
app_api_dev STOPPED Not started
command_core RUNNING pid 300858, uptime 3:07:01
command_rtp RUNNING pid 300922, uptime 3:06:53
command_vis RUNNING pid 300886, uptime 3:06:55
database RUNNING pid 300756, uptime 3:07:05


License is valid

global config variables:
export CRYOSPARC_MASTER_HOSTNAME=“localhost”
export CRYOSPARC_HOSTNAME_CHECK=“localhost”
export CRYOSPARC_DB_PATH=“/opt/cryosparc/cryosparc_database”
export CRYOSPARC_BASE_PORT=38000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false
export CRYOSPARC_CLICK_WRAP=true
export CRYOSPARC_JOB_LAUNCH_TIMEOUT_SECONDS=120

  1. cryosparcm cli “get_scheduler_targets()”

[{‘cache_path’: ‘/cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25332940800, ‘name’: ‘NVIDIA GeForce RTX 3090’}], ‘hostname’: ‘localhost’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘localhost’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], ‘GPU’: [0], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘haydin@localhost’, ‘title’: ‘Worker node localhost’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/opt/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25322520576, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 1, ‘mem’: 25322520576, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 2, ‘mem’: 25322520576, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 3, ‘mem’: 25322520576, ‘name’: ‘NVIDIA RTX A5000’}], ‘hostname’: ‘172.21.17.91’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘172.21.17.91’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, ‘ssh_str’: ‘haydin@172.21.17.91’, ‘title’: ‘Worker node 172.21.17.91’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/opt/cryosparc/cryosparc_worker/bin/cryosparcw’}]

  1. c13m filterlog -l error command_core | tail -n 40

For this I got an error : c13m: command not found

Sorry, my mistake. Please can you try instead:

cryosparcm filterlog -l error command_core | tail -n 40

Do you mean cryosparcm filterlog -l error command_core | tail -n 40 ?

then i get the following

2024-06-11 09:55:38,462 get_gpu_info_run ERROR | Failed to get GPU info on 172.21.17.111
2024-06-11 09:55:38,462 get_gpu_info_run ERROR | Traceback (most recent call last):
2024-06-11 09:55:38,462 get_gpu_info_run ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 1516, in get_gpu_info_run
2024-06-11 09:55:38,462 get_gpu_info_run ERROR | value = subprocess.check_output(full_command, stderr=subprocess.STDOUT, shell=shell, timeout=JOB_LAUNCH_TIMEOUT_SECONDS).decode()
2024-06-11 09:55:38,462 get_gpu_info_run ERROR | File “/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py”, line 421, in check_output
2024-06-11 09:55:38,462 get_gpu_info_run ERROR | return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2024-06-11 09:55:38,462 get_gpu_info_run ERROR | File “/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py”, line 526, in run
2024-06-11 09:55:38,462 get_gpu_info_run ERROR | raise CalledProcessError(retcode, process.args,
2024-06-11 09:55:38,462 get_gpu_info_run ERROR | subprocess.CalledProcessError: Command ‘[‘ssh’, ‘haydin@172.21.17.111’, ‘bash -c “eval $(/opt/cryosparc/cryosparc_worker/bin/cryosparcw env); python /opt/cryosparc/cryosparc_worker/cryosparc_compute/get_gpu_info.py”’]’ returned non-zero exit status 255.
2024-06-13 12:26:38,993 wrapper ERROR | JSONRPC ERROR at get_job_log_path_abs
2024-06-13 12:26:38,993 wrapper ERROR | Traceback (most recent call last):
2024-06-13 12:26:38,993 wrapper ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py”, line 196, in wrapper
2024-06-13 12:26:38,993 wrapper ERROR | res = func(*args, **kwargs)
2024-06-13 12:26:38,993 wrapper ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 8173, in get_job_log_path_abs
2024-06-13 12:26:38,993 wrapper ERROR | job_dir_abs = get_job_dir_abs(project_uid, job_uid)
2024-06-13 12:26:38,993 wrapper ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py”, line 187, in wrapper
2024-06-13 12:26:38,993 wrapper ERROR | return func(*args, **kwargs)
2024-06-13 12:26:38,993 wrapper ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 8157, in get_job_dir_abs
2024-06-13 12:26:38,993 wrapper ERROR | job_doc = get_job(project_uid, job_uid, ‘job_dir’)
2024-06-13 12:26:38,993 wrapper ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py”, line 187, in wrapper
2024-06-13 12:26:38,993 wrapper ERROR | return func(*args, **kwargs)
2024-06-13 12:26:38,993 wrapper ERROR | File “/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 6124, in get_job
2024-06-13 12:26:38,993 wrapper ERROR | raise ValueError(f"{project_uid} {job_uid} does not exist.")
2024-06-13 12:26:38,993 wrapper ERROR | ValueError: PX JY does not exist.

So 172.21.17.111 is the wrong worker node ip address. I did change it to 172.21.17.91, which is actually the right worker node. So, I don’t understand why it shows 172.21.17.111

This setup is not supported on a CryoSPARC instance with external workers.
You may want to ask your IT support to

  • assign a permanent, resolvable hostname to the CryoSPARC master host (on the DNS server and, probably, DHCP server)
  • ensure that the command
    hostname -f, when executed on the CryoSPARC master host, prints that permanent, resolvable hostname. Let’s assume that hostname is server11.your.domain.
  • similarly, let you know the permanent, resolvable hostname of the GPU node, say server12.your.domain and ensure that hostname is printed by the
    hostname -f command.

Then

  • use that permanent, resolvable hostname of the master host in the definition of CRYOSPARC_MASTER_HOSTNAME and remove the line inside cryosparc_master/config.sh and restart CryoSPARC for the changes to take effect.
  • remove the existing target node records
  • re-connect the target nodes, specifying the permanent, resolvable hostname of the CryoSPARC master host with the --master keyword and
    --port 38000 each time. Important: cryosparcw connect must be run on the worker that is being connected.
  • ensure that /opt/cryosparc/cryosparc_worker/bin/cryosparcw exists on host 172.21.17.91 (whose real hostname you should know by now)
  • ensure that /home/haydin/Research/CryoEM/ exists on 172.21.17.91 at that same path and is shared with the CryoSPARC master node

Initially, the hostname in CryoSPARC was set to “rosalind.” However, I encountered a timeout error. Based on recommendations from CryoSPARC discussions, I changed the hostname to “localhost.” When I attempt to change it back to “rosalind,” the hostname does not update on the web interface. What steps should I take to resolve this issue?

Additionally, you mentioned that the worker node should also have the /home/haydin/Research/CryoEM/ path. This path originates from the master node. Does this mean that during the NFS setup, I should have mounted this path on the worker node as well?

Please ensure that the GPU server (172.21.17.91) resolve the rosalind hostname. What is the output of the command (on 172.21.17.91):

host rosalind

What are the outputs of the following commands on rosalind:

hostname -f
host $(hostname -f)

This may work on Single Workstation CryoSPARC instances that do not have additional workers, and therefore does not apply to the current case.

Project directories need to be shared between master and worker nodes. This could be achieved by one of the nodes acting as the NFS server (“exporting” shared directories) and the others as NFS clients (mounting the shared directories). On larger setups, it is may be more common to have a dedicated storage server that exports project directories to both master and worker nodes. It is important that the data are available under a common path on the master and worker nodes.

Hello,

Here are the results following the successful mounting of /home/haydin/Research/CryoEM/ on the worker node.

  1. host rosalind
    rosalind has address 172.21.17.241
    Host rosalind not found: 3(NXDOMAIN)

  2. hostname -f
    rosalind

3)host $(hostname -f)
rosalind has address 127.0.1.1

In my /etc/hosts file on the master node I have the following

127.0.0.1 localhost
127.0.1.1 rosalind
172.21.17.91 sn4622119034 (this is my worker node hostname)

But when I do ifconfig, the IP address for the host is 172.21.17.241. Do I need to change the IP address to 172.21.17.241 for rosalind?

If you are referring to the definition inside rosalind:/etc/hosts: no change is needed at this time.
I also think that inside /opt/cryosparc/cryosparc_master/config.sh,

  1. you may want to change (back) to
    export CRYOSPARC_MASTER_HOSTNAME=“rosalind”
    
  2. remove or comment out the CRYOSPARC_HOSTNAME_CHECK line

and restart CryoSPARC.
On rosalind, as user haydin, please run the following commands and post their outputs:

hostname -f
id
nvidia-smi --query-gpu=index,name --format=csv
ssh 172.21.17.91 'hostname -f && id && nvidia-smi --query-gpu=index,name --format=csv'
ssh 172.21.17.91 'hostname && ls -al /home/haydin/Research/CryoEM/J104/'

Please ensure password-less ssh access from rosalind to haydin@172.21.17.91 works (without prompt for password or hostkey)