Jobs are not running/python processes not seen in worker node

Hi Cryosparc Team,

Operationg system Redhat. Cryosparc version: 4.6.0.I have successfully installed master in head node and worker in worker node and connected successfully and everything looks fine but jobs are not running in worker node. After doing top in worker node I can’t see any python process running in worker node which should be there if jobs are successfully running.

  1. I can see only below two process running in worker node:
    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    74156 janesh 20 0 85708 8208 5444 R 5.9 0.0 0:00.02 top
    73935 janesh 20 0 52996 7600 5288 S 0.0 0.0 0:00.04 bash

image

  1. Out put of “get_scheduler_targets()”

./cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: None, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 1, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 2, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 3, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}], ‘hostname’: ‘r04gn04’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘r04gn04’, ‘resource_fixed’: {‘SSD’: False}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]}, ‘ssh_str’: ‘janesh@r04gn04’, ‘title’: ‘Worker node r04gn04’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: None, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 1, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 2, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 3, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}], ‘hostname’: ‘r05gn06’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘r05gn06’, ‘resource_fixed’: {‘SSD’: False}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]}, ‘ssh_str’: ‘janesh@r05gn06’, ‘title’: ‘Worker node r05gn06’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw’}]

  1. Two worker nodes are registered(r05gn06 and r05gn06):
    cryosparc_worker]$ ./bin/cryosparcw gpulist
    Detected 4 CUDA devices.

    id pci-bus name

    0                 1  NVIDIA A100-SXM4-80GB
    1                65  NVIDIA A100-SXM4-80GB
    2               129  NVIDIA A100-SXM4-80GB
    3               193  NVIDIA A100-SXM4-80GB
    

Please help!

Regards,
Aparna

Thanks @aparna for posting these details. Please can you post the outputs of these commands (run on the CryoSPARC master):

csprojectid=P99 # replace with actual project ID
csjobid=J199 # replace with id of a job that should be running
uname -a
cryosparcm joblog $csprojectid $csjobid | tail -n 40
cryosparcm eventlog $csprojectid $csjobid | tail -n 40
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"
cryosparcm cli "get_project_dir_abs('$csprojectid')"
ssh janesh@r04gn04 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"
ssh janesh@r05gn06 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"

Thank You your response crysparc team!
Here is answers to your question:

  1. ./bin/cryosparcm joblog $csprojectid $csjobid | tail -n 40
    No output

  2. ./bin/cryosparcm eventlog $csprojectid $csjobid | tail -n 40

    [Wed, 30 Oct 2024 08:19:25 GMT]  License is valid.
    [Wed, 30 Oct 2024 08:19:25 GMT]  Launching job on lane default target r04gn04 ...
    [Wed, 30 Oct 2024 08:19:25 GMT]  Running job on remote worker node hostname r04gn04
    
  3. ./bin/cryosparcm cli “get_job(‘$csprojectid’, ‘$csjobid’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’)”

    {'_id': '6720d233643cebdbfeb108ef', 'errors_run': [], 'instance_information': {}, 'job_type': 'extensive_workflow_bench', 'params_spec': {'compute_use_ssd': {'value': False}, 'dataset_data_dir': {'value': '/home/cryosparc/cryosparc_master/bin/empiar_10025_subset'}, 'resource_selection': {'value': ':r04gn04:0'}, 'run_advanced_jobs': {'value': True}}, 'project_uid': 'P3', 'status': 'launched', 'uid': 'J1', 'version': 'v4.6.0'}
    
  4. cryosparcm cli “get_project_dir_abs(‘$csprojectid’)”
    /scratch/janesh/CS-test

  5. cryosparc_master]$ ssh janesh@r04gn04 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"
    -rwxr-xr-x 1 janesh ccmb 14496 Sep 10 20:04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw
    
    /scratch/janesh/CS-test:
    total 20
    -rw-rw-r-- 1 janesh ccmb   88 Oct 29 17:46 cs.lock
    drwxrwxr-x 3 janesh ccmb 4096 Oct 30 13:49 J1
    -rw-rw-r-- 1 janesh ccmb   36 Oct 30 13:49 job_manifest.json
    -rw-rw-r-- 1 janesh ccmb  743 Oct 29 17:46 project.json
    -rw-rw-r-- 1 janesh ccmb  447 Oct 29 17:46 workspaces.json
    Linux r04gn04 4.18.0-425.3.1.el8.x86_64 #1 SMP Fri Sep 30 11:45:06 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
    
  6. ssh janesh@r05gn06 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"
    error writing "stdout": broken pipe
        while executing
    "puts stdout {test 0 = 1;}"
        (procedure "renderFalse" line 19)
        invoked from within
    "renderFalse"
        invoked from within
    "if {[catch {
       # parse all command-line arguments before doing any action, no output is
       # made during argument parse to wait for potential paging ..."
        (file "/cm/local/apps/environment-modules/4.5.3/libexec/modulecmd.tcl" line 11097)
    

Regards,
Aparna

Thanks @aparna for posting these outputs.

Please can you also post the outputs of these commands (run on the CryoSPARC master computer)

uname -a
cryosparcm status | grep -v LICENSE
ssh janesh@r04gn04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw gpulist

There seems to be a problem connecting from the CryoSPARC master computer to the worker r05gn06. Have you tried whether running the command (on the CryoSPARC master computer)

ssh janesh@r05gn06

connects you to r05gn06 without any prompt for password or for a host key confirmation?