Jobs are not running/python processes not seen in worker node

Hi Cryosparc Team,

Operationg system Redhat. Cryosparc version: 4.6.0.I have successfully installed master in head node and worker in worker node and connected successfully and everything looks fine but jobs are not running in worker node. After doing top in worker node I can’t see any python process running in worker node which should be there if jobs are successfully running.

  1. I can see only below two process running in worker node:
    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    74156 janesh 20 0 85708 8208 5444 R 5.9 0.0 0:00.02 top
    73935 janesh 20 0 52996 7600 5288 S 0.0 0.0 0:00.04 bash

image

  1. Out put of “get_scheduler_targets()”

./cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: None, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 1, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 2, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 3, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}], ‘hostname’: ‘r04gn04’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘r04gn04’, ‘resource_fixed’: {‘SSD’: False}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]}, ‘ssh_str’: ‘janesh@r04gn04’, ‘title’: ‘Worker node r04gn04’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: None, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 1, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 2, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 3, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}], ‘hostname’: ‘r05gn06’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘r05gn06’, ‘resource_fixed’: {‘SSD’: False}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]}, ‘ssh_str’: ‘janesh@r05gn06’, ‘title’: ‘Worker node r05gn06’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw’}]

  1. Two worker nodes are registered(r05gn06 and r05gn06):
    cryosparc_worker]$ ./bin/cryosparcw gpulist
    Detected 4 CUDA devices.

    id pci-bus name

    0                 1  NVIDIA A100-SXM4-80GB
    1                65  NVIDIA A100-SXM4-80GB
    2               129  NVIDIA A100-SXM4-80GB
    3               193  NVIDIA A100-SXM4-80GB
    

Please help!

Regards,
Aparna

Thanks @aparna for posting these details. Please can you post the outputs of these commands (run on the CryoSPARC master):

csprojectid=P99 # replace with actual project ID
csjobid=J199 # replace with id of a job that should be running
uname -a
cryosparcm joblog $csprojectid $csjobid | tail -n 40
cryosparcm eventlog $csprojectid $csjobid | tail -n 40
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"
cryosparcm cli "get_project_dir_abs('$csprojectid')"
ssh janesh@r04gn04 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"
ssh janesh@r05gn06 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"

Thank You your response crysparc team!
Here is answers to your question:

  1. ./bin/cryosparcm joblog $csprojectid $csjobid | tail -n 40
    No output

  2. ./bin/cryosparcm eventlog $csprojectid $csjobid | tail -n 40

    [Wed, 30 Oct 2024 08:19:25 GMT]  License is valid.
    [Wed, 30 Oct 2024 08:19:25 GMT]  Launching job on lane default target r04gn04 ...
    [Wed, 30 Oct 2024 08:19:25 GMT]  Running job on remote worker node hostname r04gn04
    
  3. ./bin/cryosparcm cli “get_job(‘$csprojectid’, ‘$csjobid’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’)”

    {'_id': '6720d233643cebdbfeb108ef', 'errors_run': [], 'instance_information': {}, 'job_type': 'extensive_workflow_bench', 'params_spec': {'compute_use_ssd': {'value': False}, 'dataset_data_dir': {'value': '/home/cryosparc/cryosparc_master/bin/empiar_10025_subset'}, 'resource_selection': {'value': ':r04gn04:0'}, 'run_advanced_jobs': {'value': True}}, 'project_uid': 'P3', 'status': 'launched', 'uid': 'J1', 'version': 'v4.6.0'}
    
  4. cryosparcm cli “get_project_dir_abs(‘$csprojectid’)”
    /scratch/janesh/CS-test

  5. cryosparc_master]$ ssh janesh@r04gn04 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"
    -rwxr-xr-x 1 janesh ccmb 14496 Sep 10 20:04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw
    
    /scratch/janesh/CS-test:
    total 20
    -rw-rw-r-- 1 janesh ccmb   88 Oct 29 17:46 cs.lock
    drwxrwxr-x 3 janesh ccmb 4096 Oct 30 13:49 J1
    -rw-rw-r-- 1 janesh ccmb   36 Oct 30 13:49 job_manifest.json
    -rw-rw-r-- 1 janesh ccmb  743 Oct 29 17:46 project.json
    -rw-rw-r-- 1 janesh ccmb  447 Oct 29 17:46 workspaces.json
    Linux r04gn04 4.18.0-425.3.1.el8.x86_64 #1 SMP Fri Sep 30 11:45:06 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
    
  6. ssh janesh@r05gn06 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"
    error writing "stdout": broken pipe
        while executing
    "puts stdout {test 0 = 1;}"
        (procedure "renderFalse" line 19)
        invoked from within
    "renderFalse"
        invoked from within
    "if {[catch {
       # parse all command-line arguments before doing any action, no output is
       # made during argument parse to wait for potential paging ..."
        (file "/cm/local/apps/environment-modules/4.5.3/libexec/modulecmd.tcl" line 11097)
    

Regards,
Aparna

Thanks @aparna for posting these outputs.

Please can you also post the outputs of these commands (run on the CryoSPARC master computer)

uname -a
cryosparcm status | grep -v LICENSE
ssh janesh@r04gn04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw gpulist

There seems to be a problem connecting from the CryoSPARC master computer to the worker r05gn06. Have you tried whether running the command (on the CryoSPARC master computer)

ssh janesh@r05gn06

connects you to r05gn06 without any prompt for password or for a host key confirmation?

Hi Cryosparc Team,

Thanks for your responses! Sorry for delay from my side!
I got below responses for your quesries:

  1. ssh janesh@r05gn06 “ls -l $(cryosparcm cli “get_project_dir_abs(‘$csprojectid’)”) /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a”

-rwxr-xr-x 1 janesh ccmb 14496 Sep 10 20:04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw

/scratch/janesh/CS-test:
total 20
-rw-rw-r-- 1 janesh ccmb 88 Oct 29 17:46 cs.lock
drwxrwxr-x 3 janesh ccmb 4096 Oct 30 13:49 J1
-rw-rw-r-- 1 janesh ccmb 36 Oct 30 13:49 job_manifest.json
-rw-rw-r-- 1 janesh ccmb 743 Oct 29 17:46 project.json
-rw-rw-r-- 1 janesh ccmb 447 Oct 29 17:46 workspaces.json
Linux r05gn06 4.18.0-425.3.1.el8.x86_64 #1 SMP Fri Sep 30 11:45:06 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

I am not sure why I got error other day here

  1. cryosparcm status | grep -v LICENSE
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/janesh/cryosparc/cryosparc_master
Current cryoSPARC version: v4.6.0
----------------------------------------------------------------------------

CryoSPARC process status:

app                              RUNNING   pid 1947047, uptime 5 days, 21:23:56
app_api                          RUNNING   pid 1947102, uptime 5 days, 21:23:55
app_api_dev                      STOPPED   Not started
command_core                     RUNNING   pid 1945950, uptime 5 days, 21:24:23
command_rtp                      RUNNING   pid 1946275, uptime 5 days, 21:24:12
command_vis                      RUNNING   pid 1946210, uptime 5 days, 21:24:14
database                         RUNNING   pid 1945762, uptime 5 days, 21:24:26

----------------------------------------------------------------------------
License is valid
----------------------------------------------------------------------------

global config variables:
export CRYOSPARC_MASTER_HOSTNAME="clustername"
export CRYOSPARC_DB_PATH="/home/janesh/cryosparc/cryosparc_database"
export CRYOSPARC_BASE_PORT=45000
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_CLICK_WRAP=true
export NO_PROXY="${CRYOSPARC_MASTER_HOSTNAME},localhost,127.0.0.1"

3.[janesh@champ2 ~] ssh janesh@r04gn04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw gpulist [janesh@champ2 ~] ssh janesh@r05gn06 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw gpulist
[janesh@champ2 ~] ssh janesh@r04gn04 Register this system with Red Hat Insights: insights-client --register Create an account or view all your systems at https://red.ht/insights-dashboard Last login: Fri Nov 1 10:35:20 2024 from 10.20.5.253 [janesh@r04gn04 ~] Connection to r04gn04 closed.
[janesh@champ2 ~] ssh janesh@r05gn06 Register this system with Red Hat Insights: insights-client --register Create an account or view all your systems at https://red.ht/insights-dashboard Last login: Wed Oct 30 10:40:34 2024 from 10.20.5.253 [janesh@r05gn06 ~] Connection to r05gn06 closed.

Note: direct ssh to nodes are not allowed unless jobs are running there.

  1. ssh janesh@r04gn04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw gpulist
    This command returned no output since ssh is not allowed without job running so I acquired the node and executed above command and output is:
    [janesh@r05gn06 bin]$ ./cryosparcw gpulist
    Detected 4 CUDA devices.

    id pci-bus name

    0                 1  NVIDIA A100-SXM4-80GB
    1                65  NVIDIA A100-SXM4-80GB
    2               129  NVIDIA A100-SXM4-80GB
    3               193  NVIDIA A100-SXM4-80GB
    

Regards,
Aparna

This restriction is incompatible with your current scheduler target configuration.

How did “acquire” the node. The answer may suggest a suitable reconfiguration of your scheduler targets.
If you use a workload manager like slurm or gridengine to “acquire” nodes, you may consider configuring the a cluster-type configuration of CryoSPARC workers.

  1. I acquired the node in interactive mode through scheduler using qsub.

  2. So I need help in doing cluster-type configuration
    -Where to place this file cluster_info.jason and how to give below variable in my case.

    “send_cmd_tpl” : “ssh loginnode {{ command }}”, //here ssh r04gn04 or anything else as we have two worker gpu node or only {{ command }} or ssh champ2 {{ command }}
    “qsub_cmd_tpl” : “qsub {{ script_path_abs }}”, //here qsub and path of cluster_script.sh inside {} or keep as it is

  3. cluster_script.sh
    #!/bin/bash

#PBS -l select=1:ncpus={{ num_cpu }}:ngpus={{ num_gpu }}:mem={{ (ram_gb*1000)|int }}mb:gputype=P100 // Here if all variable inside {} should be specified in numbers or kept as it is i.e. no changes required.
#PBS -o {{ job_dir_abs }}/cluster.out //here by removing both {} output filename with path should be given

  1. Do I need to connect nodes again as above steps will add a new lane, previously nodes/workers were registered using default lane.

We are using PBS scheduler.

  1. Also we have more than 2 GPU nodes with 4 GPU cards each. So jobs will go only to 2 registered node/worker with master or it can go to any gpu nodes available.

Regards,
Aparna

@aparna Before moving on to the worker configuration, please can you describe your CryoSPARC master setup, and how the CryoSPARC master host is related to the cluster, included, but not limited to:

  1. Do the CryoSPARC master processes on a “permanently” assigned host, not as a PBS job?
  2. Is your CryoSPARC master host “authorized” to qsub jobs to the cluster?

Hi CryoSPARC Team,

1.Here there is a common High Performance cluster with one master node and many computing node(CPU and GPU ) nodes. I have installed it the home of user needing this software. Hence master process of CryoSPARC will run/running on master node.

Right now this software has not been integrated to PBS and hence unable to submit jobs as direct ssh is not possible to nodes.

  1. I registered/connected two GPU node as worker node to CryoSPARC master. Both master and worker is installed in users home which is common storage available to masternode and all the compute nodes.
    " Is your CryoSPARC master host “authorized” to qsub jobs to the cluster?" I am not sure if I understood it correctly but yes since CryoSPARC master is running on master node where we submit jobs through qsub, I assume it that CryoSPARC master host “authorized” to qsub jobs to the cluster.

.

Regards,
Aparna