Jobs are not running/python processes not seen in worker node

Hi Cryosparc Team,

Operationg system Redhat. Cryosparc version: 4.6.0.I have successfully installed master in head node and worker in worker node and connected successfully and everything looks fine but jobs are not running in worker node. After doing top in worker node I can’t see any python process running in worker node which should be there if jobs are successfully running.

  1. I can see only below two process running in worker node:
    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    74156 janesh 20 0 85708 8208 5444 R 5.9 0.0 0:00.02 top
    73935 janesh 20 0 52996 7600 5288 S 0.0 0.0 0:00.04 bash

image

  1. Out put of “get_scheduler_targets()”

./cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: None, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 1, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 2, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 3, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}], ‘hostname’: ‘r04gn04’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘r04gn04’, ‘resource_fixed’: {‘SSD’: False}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]}, ‘ssh_str’: ‘janesh@r04gn04’, ‘title’: ‘Worker node r04gn04’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: None, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 1, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 2, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}, {‘id’: 3, ‘mem’: 84987740160, ‘name’: ‘NVIDIA A100-SXM4-80GB’}], ‘hostname’: ‘r05gn06’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘r05gn06’, ‘resource_fixed’: {‘SSD’: False}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]}, ‘ssh_str’: ‘janesh@r05gn06’, ‘title’: ‘Worker node r05gn06’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw’}]

  1. Two worker nodes are registered(r05gn06 and r05gn06):
    cryosparc_worker]$ ./bin/cryosparcw gpulist
    Detected 4 CUDA devices.

    id pci-bus name

    0                 1  NVIDIA A100-SXM4-80GB
    1                65  NVIDIA A100-SXM4-80GB
    2               129  NVIDIA A100-SXM4-80GB
    3               193  NVIDIA A100-SXM4-80GB
    

Please help!

Regards,
Aparna

Thanks @aparna for posting these details. Please can you post the outputs of these commands (run on the CryoSPARC master):

csprojectid=P99 # replace with actual project ID
csjobid=J199 # replace with id of a job that should be running
uname -a
cryosparcm joblog $csprojectid $csjobid | tail -n 40
cryosparcm eventlog $csprojectid $csjobid | tail -n 40
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"
cryosparcm cli "get_project_dir_abs('$csprojectid')"
ssh janesh@r04gn04 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"
ssh janesh@r05gn06 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"

Thank You your response crysparc team!
Here is answers to your question:

  1. ./bin/cryosparcm joblog $csprojectid $csjobid | tail -n 40
    No output

  2. ./bin/cryosparcm eventlog $csprojectid $csjobid | tail -n 40

    [Wed, 30 Oct 2024 08:19:25 GMT]  License is valid.
    [Wed, 30 Oct 2024 08:19:25 GMT]  Launching job on lane default target r04gn04 ...
    [Wed, 30 Oct 2024 08:19:25 GMT]  Running job on remote worker node hostname r04gn04
    
  3. ./bin/cryosparcm cli “get_job(‘$csprojectid’, ‘$csjobid’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’)”

    {'_id': '6720d233643cebdbfeb108ef', 'errors_run': [], 'instance_information': {}, 'job_type': 'extensive_workflow_bench', 'params_spec': {'compute_use_ssd': {'value': False}, 'dataset_data_dir': {'value': '/home/cryosparc/cryosparc_master/bin/empiar_10025_subset'}, 'resource_selection': {'value': ':r04gn04:0'}, 'run_advanced_jobs': {'value': True}}, 'project_uid': 'P3', 'status': 'launched', 'uid': 'J1', 'version': 'v4.6.0'}
    
  4. cryosparcm cli “get_project_dir_abs(‘$csprojectid’)”
    /scratch/janesh/CS-test

  5. cryosparc_master]$ ssh janesh@r04gn04 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"
    -rwxr-xr-x 1 janesh ccmb 14496 Sep 10 20:04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw
    
    /scratch/janesh/CS-test:
    total 20
    -rw-rw-r-- 1 janesh ccmb   88 Oct 29 17:46 cs.lock
    drwxrwxr-x 3 janesh ccmb 4096 Oct 30 13:49 J1
    -rw-rw-r-- 1 janesh ccmb   36 Oct 30 13:49 job_manifest.json
    -rw-rw-r-- 1 janesh ccmb  743 Oct 29 17:46 project.json
    -rw-rw-r-- 1 janesh ccmb  447 Oct 29 17:46 workspaces.json
    Linux r04gn04 4.18.0-425.3.1.el8.x86_64 #1 SMP Fri Sep 30 11:45:06 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
    
  6. ssh janesh@r05gn06 "ls -l $(cryosparcm cli "get_project_dir_abs('$csprojectid')") /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a"
    error writing "stdout": broken pipe
        while executing
    "puts stdout {test 0 = 1;}"
        (procedure "renderFalse" line 19)
        invoked from within
    "renderFalse"
        invoked from within
    "if {[catch {
       # parse all command-line arguments before doing any action, no output is
       # made during argument parse to wait for potential paging ..."
        (file "/cm/local/apps/environment-modules/4.5.3/libexec/modulecmd.tcl" line 11097)
    

Regards,
Aparna

Thanks @aparna for posting these outputs.

Please can you also post the outputs of these commands (run on the CryoSPARC master computer)

uname -a
cryosparcm status | grep -v LICENSE
ssh janesh@r04gn04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw gpulist

There seems to be a problem connecting from the CryoSPARC master computer to the worker r05gn06. Have you tried whether running the command (on the CryoSPARC master computer)

ssh janesh@r05gn06

connects you to r05gn06 without any prompt for password or for a host key confirmation?

Hi Cryosparc Team,

Thanks for your responses! Sorry for delay from my side!
I got below responses for your quesries:

  1. ssh janesh@r05gn06 “ls -l $(cryosparcm cli “get_project_dir_abs(‘$csprojectid’)”) /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw && uname -a”

-rwxr-xr-x 1 janesh ccmb 14496 Sep 10 20:04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw

/scratch/janesh/CS-test:
total 20
-rw-rw-r-- 1 janesh ccmb 88 Oct 29 17:46 cs.lock
drwxrwxr-x 3 janesh ccmb 4096 Oct 30 13:49 J1
-rw-rw-r-- 1 janesh ccmb 36 Oct 30 13:49 job_manifest.json
-rw-rw-r-- 1 janesh ccmb 743 Oct 29 17:46 project.json
-rw-rw-r-- 1 janesh ccmb 447 Oct 29 17:46 workspaces.json
Linux r05gn06 4.18.0-425.3.1.el8.x86_64 #1 SMP Fri Sep 30 11:45:06 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

I am not sure why I got error other day here

  1. cryosparcm status | grep -v LICENSE
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/janesh/cryosparc/cryosparc_master
Current cryoSPARC version: v4.6.0
----------------------------------------------------------------------------

CryoSPARC process status:

app                              RUNNING   pid 1947047, uptime 5 days, 21:23:56
app_api                          RUNNING   pid 1947102, uptime 5 days, 21:23:55
app_api_dev                      STOPPED   Not started
command_core                     RUNNING   pid 1945950, uptime 5 days, 21:24:23
command_rtp                      RUNNING   pid 1946275, uptime 5 days, 21:24:12
command_vis                      RUNNING   pid 1946210, uptime 5 days, 21:24:14
database                         RUNNING   pid 1945762, uptime 5 days, 21:24:26

----------------------------------------------------------------------------
License is valid
----------------------------------------------------------------------------

global config variables:
export CRYOSPARC_MASTER_HOSTNAME="clustername"
export CRYOSPARC_DB_PATH="/home/janesh/cryosparc/cryosparc_database"
export CRYOSPARC_BASE_PORT=45000
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_CLICK_WRAP=true
export NO_PROXY="${CRYOSPARC_MASTER_HOSTNAME},localhost,127.0.0.1"

3.[janesh@champ2 ~] ssh janesh@r04gn04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw gpulist [janesh@champ2 ~] ssh janesh@r05gn06 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw gpulist
[janesh@champ2 ~] ssh janesh@r04gn04 Register this system with Red Hat Insights: insights-client --register Create an account or view all your systems at https://red.ht/insights-dashboard Last login: Fri Nov 1 10:35:20 2024 from 10.20.5.253 [janesh@r04gn04 ~] Connection to r04gn04 closed.
[janesh@champ2 ~] ssh janesh@r05gn06 Register this system with Red Hat Insights: insights-client --register Create an account or view all your systems at https://red.ht/insights-dashboard Last login: Wed Oct 30 10:40:34 2024 from 10.20.5.253 [janesh@r05gn06 ~] Connection to r05gn06 closed.

Note: direct ssh to nodes are not allowed unless jobs are running there.

  1. ssh janesh@r04gn04 /home/janesh/cryosparc/cryosparc_worker/bin/cryosparcw gpulist
    This command returned no output since ssh is not allowed without job running so I acquired the node and executed above command and output is:
    [janesh@r05gn06 bin]$ ./cryosparcw gpulist
    Detected 4 CUDA devices.

    id pci-bus name

    0                 1  NVIDIA A100-SXM4-80GB
    1                65  NVIDIA A100-SXM4-80GB
    2               129  NVIDIA A100-SXM4-80GB
    3               193  NVIDIA A100-SXM4-80GB
    

Regards,
Aparna

This restriction is incompatible with your current scheduler target configuration.

How did “acquire” the node. The answer may suggest a suitable reconfiguration of your scheduler targets.
If you use a workload manager like slurm or gridengine to “acquire” nodes, you may consider configuring the a cluster-type configuration of CryoSPARC workers.

  1. I acquired the node in interactive mode through scheduler using qsub.

  2. So I need help in doing cluster-type configuration
    -Where to place this file cluster_info.jason and how to give below variable in my case.

    “send_cmd_tpl” : “ssh loginnode {{ command }}”, //here ssh r04gn04 or anything else as we have two worker gpu node or only {{ command }} or ssh champ2 {{ command }}
    “qsub_cmd_tpl” : “qsub {{ script_path_abs }}”, //here qsub and path of cluster_script.sh inside {} or keep as it is

  3. cluster_script.sh
    #!/bin/bash

#PBS -l select=1:ncpus={{ num_cpu }}:ngpus={{ num_gpu }}:mem={{ (ram_gb*1000)|int }}mb:gputype=P100 // Here if all variable inside {} should be specified in numbers or kept as it is i.e. no changes required.
#PBS -o {{ job_dir_abs }}/cluster.out //here by removing both {} output filename with path should be given

  1. Do I need to connect nodes again as above steps will add a new lane, previously nodes/workers were registered using default lane.

We are using PBS scheduler.

  1. Also we have more than 2 GPU nodes with 4 GPU cards each. So jobs will go only to 2 registered node/worker with master or it can go to any gpu nodes available.

Regards,
Aparna

@aparna Before moving on to the worker configuration, please can you describe your CryoSPARC master setup, and how the CryoSPARC master host is related to the cluster, included, but not limited to:

  1. Do the CryoSPARC master processes on a “permanently” assigned host, not as a PBS job?
  2. Is your CryoSPARC master host “authorized” to qsub jobs to the cluster?

Hi CryoSPARC Team,

1.Here there is a common High Performance cluster with one master node and many computing node(CPU and GPU ) nodes. I have installed it the home of user needing this software. Hence master process of CryoSPARC will run/running on master node.

Right now this software has not been integrated to PBS and hence unable to submit jobs as direct ssh is not possible to nodes.

  1. I registered/connected two GPU node as worker node to CryoSPARC master. Both master and worker is installed in users home which is common storage available to masternode and all the compute nodes.
    " Is your CryoSPARC master host “authorized” to qsub jobs to the cluster?" I am not sure if I understood it correctly but yes since CryoSPARC master is running on master node where we submit jobs through qsub, I assume it that CryoSPARC master host “authorized” to qsub jobs to the cluster.

.

Regards,
Aparna

Based on your description, I think this assumption is correct.
Moreover, a cluster-type (via cryosparcm cluster connect), as opposed to a node-type (via cryosparcw connect ...), CryoSPARC lane may be more suitable under the circumstances. Please ensure that storage for

  • cryosparc_worker/ software
  • data importable to CryoSPARC, such as cryoem movies
  • CryoSPARC project directories

are shared between the CryoSPARC master host and cluster compute nodes under consistent absolute paths.

The suitable edited cluster_info.json and cluster_script.sh files need to be present in the current working directory when and where the cryosparcm cluster connect command is run. You may choose the current working directory according to your needs. It may be a temporary directory because after cryosparcm cluster connect has been run, the cluster_info.json and cluster_script.sh files are no longer needed.
The cryosparcm cluster connect will create a cluster lane in CryoSPARC.

After creating the cluster lane, if the default lane only contains PBS cluster nodes, you may want to remove the default lane with the command

cryosparcm cli "remove_scheduler_lane('default')"

Dear Cryosparc Team,

I have successfully setup cluster lane and job goes to running state using cluster lane but again goes to ‘H’ state. Also it is not going to GPU queue/node but cpu queue. It is showing select=1, ncpus=1,ngpus=0. Please check my below line in cluster_script:

#PBS -l select=1:ncpus={{ num_cpu }}:ngpus={{ num_gpu }}:mem={{ (ram_gb*1000)|int }}mb

We have nvidia A100 GPU but I skipped gputype: field in above line as I was getting error ‘qsub: Unknown resource: gputype’ while validation if I gave gputype=A100 or P100.

How to make it go to GPU node. Also with more no. of core and multiple gpu card on that node?

Regards,
Aparna

num_cpu and num_gpu are determined by the job type. For which job type did you observe

?

Please confirm with the admins of your cluster if selection of specific GPUs is supported on your cluster and how to implement such selection in your cluster scripts.

Dear Cryosparc Team,

Thanks for the reply!

So you mean to say that cpu node or gpu node assignment is decided by job type automatically by the scheduler/script?

Now jobs are running but failing. Looks like proxy issue. I have put proxy details in bash file and following lines in craysparc config file. export NO_PROXY=“${CRYOSPARC_MASTER_HOSTNAME},localhost,127.0.0.1”
However,
I am able to access cryosparc with localhost:port but not with <champ2.cm.cluster/champ2>:port
with champ2:port I am getting Unable to forward request this time error

//output of cryosparcm joblog $csprojectid $csjobid | tail -n 40

cryosparc_tools.cryosparc.errors.CommandError: *** (http://champ2.cm.cluster:45002/api, code 500) HTTP Error 500 Internal Server Error; please check cryosparcm log command_core for additional information.
Response from server: b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html><head>\n<meta type="copyright" content="Copyright (C) 1996-2022 The Squid Software Foundation and contributors">\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n<title>ERROR: The requested URL could not be retrieved</title>\n<style type="text/css"><!-- \n /*\n * Copyright (C) 1996-2022 The Squid Software Foundation and contributors\n *\n * Squid software is distributed under GPLv2+ license and includes\n * contributions from numerous individuals and organizations.\n * Please see the COPYING and CONTRIBUTORS files for details.\n */\n\n/*\n Stylesheet for Squid Error pages\n Adapted from design by Free CSS Templates\n http://www.freecsstemplates.org\n Released for free under a Creative Commons Attribution 2.5 License\n*/\n\n/* Page basics */\n* {\n\tfont-family: verdana, sans-serif;\n}\n\nhtml body {\n\tmargin: 0;\n\tpadding: 0;\n\tbackground: #efefef;\n\tfont-size: 12px;\n\tcolor: #1e1e1e;\n}\n\n/* Page displayed title area */\n#titles {\n\tmargin-left: 15px;\n\tpadding: 10px;\n\tpadding-left: 100px;\n\tbackground: url(\'/squid-internal-static/icons/SN.png\') no-repeat left;\n}\n\n/* initial title */\n#titles h1 {\n\tcolor: #000000;\n}\n#titles h2 {\n\tcolor: #000000;\n}\n\n/* special event: FTP success page titles */\n#titles ftpsuccess {\n\tbackground-color:#00ff00;\n\twidth:100%;\n}\n\n/* Page displayed body content area */\n#content {\n\tpadding: 10px;\n\tbackground: #ffffff;\n}\n\n/* General text */\np {\n}\n\n/* error brief description */\n#error p {\n}\n\n/* some data which may have caused the problem */\n#data {\n}\n\n/* the error message received from the system or other software */\n#sysmsg {\n}\n\npre {\n}\n\n/* special event: FTP directory listing */\n#dirmsg {\n    font-family: courier, monospace;\n    color: black;\n    font-size: 10pt;\n}\n#dirlisting {\n    margin-left: 2%;\n    margin-right: 2%;\n}\n#dirlisting tr.entry td.icon,td.filename,td.size,td.date {\n    border-bottom: groove;\n}\n#dirlisting td.size {\n    width: 50px;\n    text-align: right;\n    padding-right: 5px;\n}\n\n/* horizontal lines */\nhr {\n\tmargin: 0;\n}\n\n/* page displayed footer area */\n#footer {\n\tfont-size: 9px;\n\tpadding-left: 10px;\n}\n\n\nbody\n:lang(fa) { direction: rtl; font-size: 100%; font-family: Tahoma, Roya, sans-serif; float: right; }\n:lang(he) { direction: rtl; }\n --></style>\n</head><body id=ERR_CANNOT_FORWARD>\n<div id="titles">\n<h1>ERROR</h1>\n<h2>The requested URL could not be retrieved</h2>\n</div>\n<hr>\n\n<div id="content">\n<p>The following error was encountered while trying to retrieve the URL: <a href="http://champ2.cm.cluster:45002/api">http://champ2.cm.cluster:45002/api</a></p>\n\n<blockquote id="error">\n<p><b>Unable to forward this request at this time.</b></p>\n</blockquote>\n\n<p>This request could not be forwarded to the origin server or to any parent caches.</p>\n\n<p>Some possible problems are:</p>\n<ul>\n<li id="network-down">An Internet connection needed to access this domains origin servers may be down.</li>\n<li id="no-peer">All configured parent caches may be currently unreachable.</li>\n<li id="permission-denied">The administrator may not allow this cache to make direct connections to origin servers.</li>\n</ul>\n\n<p>Your cache administrator is <a href="mailto:root?subject=CacheErrorInfo%20-%20ERR_CANNOT_FORWARD&amp;body=CacheHost%3A%20nknproxy.cmmacs.ernet.in%0D%0AErrPage%3A%20ERR_CANNOT_FORWARD%0D%0AErr%3A%20%5Bnone%5D%0D%0ATimeStamp%3A%20Wed,%2013%20Nov%202024%2007%3A26%3A10%20GMT%0D%0A%0D%0AClientIP%3A%20192.168.103.41%0D%0A%0D%0AHTTP%20Request%3A%0D%0APOST%20%2Fapi%20HTTP%2F1.1%0AAccept-Encoding%3A%20identity%0D%0AContent-Length%3A%20107%0D%0AUser-Agent%3A%20Python-urllib%2F3.10%0D%0AOriginator%3A%20client%0D%0ALicense-Id%3A%20466c270c-7f2c-11ef-86e6-1bfceec8334d%0D%0AContent-Type%3A%20application%2Fjson%0D%0AConnection%3A%20close%0D%0AHost%3A%20champ2.cm.cluster%3A45002%0D%0A%0D%0A%0D%0A">root</a>.</p>\n\n<br>\n</div>\n\n<hr>\n<div id="footer">\n<p>Generated Wed, 13 Nov 2024 07:26:10 GMT by nknproxy.cmmacs.ernet.in (squid/5.5)</p>\n<!-- ERR_CANNOT_FORWARD -->\n</div>\n</body></html>\n'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
    raise CommandError(error_reason, url=url, code=code, data=resdata)
cryosparc_tools.cryosparc.errors.CommandError: *** (http://champ2.cm.cluster:45002/api, code 500) HTTP Error 500 Internal Server Error; please check cryosparcm log command_core for additional information.
Response from server: b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html><head>\n<meta type="copyright" content="Copyright (C) 1996-2022 The Squid Software Foundation and contributors">\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n<title>ERROR: The requested URL could not be retrieved</title>\n<style type="text/css"><!-- \n /*\n * Copyright (C) 1996-2022 The Squid Software Foundation and contributors\n *\n * Squid software is distributed under GPLv2+ license and includes\n * contributions from numerous individuals and organizations.\n * Please see the COPYING and CONTRIBUTORS files for details.\n */\n\n/*\n Stylesheet for Squid Error pages\n Adapted from design by Free CSS Templates\n http://www.freecsstemplates.org\n Released for free under a Creative Commons Attribution 2.5 License\n*/\n\n/* Page basics */\n* {\n\tfont-family: verdana, sans-serif;\n}\n\nhtml body {\n\tmargin: 0;\n\tpadding: 0;\n\tbackground: #efefef;\n\tfont-size: 12px;\n\tcolor: #1e1e1e;\n}\n\n/* Page displayed title area */\n#titles {\n\tmargin-left: 15px;\n\tpadding: 10px;\n\tpadding-left: 100px;\n\tbackground: url(\'/squid-internal-static/icons/SN.png\') no-repeat left;\n}\n\n/* initial title */\n#titles h1 {\n\tcolor: #000000;\n}\n#titles h2 {\n\tcolor: #000000;\n}\n\n/* special event: FTP success page titles */\n#titles ftpsuccess {\n\tbackground-color:#00ff00;\n\twidth:100%;\n}\n\n/* Page displayed body content area */\n#content {\n\tpadding: 10px;\n\tbackground: #ffffff;\n}\n\n/* General text */\np {\n}\n\n/* error brief description */\n#error p {\n}\n\n/* some data which may have caused the problem */\n#data {\n}\n\n/* the error message received from the system or other software */\n#sysmsg {\n}\n\npre {\n}\n\n/* special event: FTP directory listing */\n#dirmsg {\n    font-family: courier, monospace;\n    color: black;\n    font-size: 10pt;\n}\n#dirlisting {\n    margin-left: 2%;\n    margin-right: 2%;\n}\n#dirlisting tr.entry td.icon,td.filename,td.size,td.date {\n    border-bottom: groove;\n}\n#dirlisting td.size {\n    width: 50px;\n    text-align: right;\n    padding-right: 5px;\n}\n\n/* horizontal lines */\nhr {\n\tmargin: 0;\n}\n\n/* page displayed footer area */\n#footer {\n\tfont-size: 9px;\n\tpadding-left: 10px;\n}\n\n\nbody\n:lang(fa) { direction: rtl; font-size: 100%; font-family: Tahoma, Roya, sans-serif; float: right; }\n:lang(he) { direction: rtl; }\n --></style>\n</head><body id=ERR_CANNOT_FORWARD>\n<div id="titles">\n<h1>ERROR</h1>\n<h2>The requested URL could not be retrieved</h2>\n</div>\n<hr>\n\n<div id="content">\n<p>The following error was encountered while trying to retrieve the URL: <a href="http://champ2.cm.cluster:45002/api">http://champ2.cm.cluster:45002/api</a></p>\n\n<blockquote id="error">\n<p><b>Unable to forward this request at this time.</b></p>\n</blockquote>\n\n<p>This request could not be forwarded to the origin server or to any parent caches.</p>\n\n<p>Some possible problems are:</p>\n<ul>\n<li id="network-down">An Internet connection needed to access this domains origin servers may be down.</li>\n<li id="no-peer">All configured parent caches may be currently unreachable.</li>\n<li id="permission-denied">The administrator may not allow this cache to make direct connections to origin servers.</li>\n</ul>\n\n<p>Your cache administrator is <a href="mailto:root?subject=CacheErrorInfo%20-%20ERR_CANNOT_FORWARD&amp;body=CacheHost%3A%20nknproxy.cmmacs.ernet.in%0D%0AErrPage%3A%20ERR_CANNOT_FORWARD%0D%0AErr%3A%20%5Bnone%5D%0D%0ATimeStamp%3A%20Wed,%2013%20Nov%202024%2007%3A26%3A10%20GMT%0D%0A%0D%0AClientIP%3A%20192.168.103.41%0D%0A%0D%0AHTTP%20Request%3A%0D%0APOST%20%2Fapi%20HTTP%2F1.1%0AAccept-Encoding%3A%20identity%0D%0AContent-Length%3A%20107%0D%0AUser-Agent%3A%20Python-urllib%2F3.10%0D%0AOriginator%3A%20client%0D%0ALicense-Id%3A%20466c270c-7f2c-11ef-86e6-1bfceec8334d%0D%0AContent-Type%3A%20application%2Fjson%0D%0AConnection%3A%20close%0D%0AHost%3A%20champ2.cm.cluster%3A45002%0D%0A%0D%0A%0D%0A">root</a>.</p>\n\n<br>\n</div>\n\n<hr>\n<div id="footer">\n<p>Generated Wed, 13 Nov 2024 07:26:10 GMT by nknproxy.cmmacs.ernet.in (squid/5.5)</p>\n<!-- ERR_CANNOT_FORWARD -->\n</div>\n</body></html>\n'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_master/cryosparc_compute/run.py", line 188, in cryosparc_master.cryosparc_compute.run.run
  File "cryosparc_master/cryosparc_compute/run.py", line 242, in cryosparc_master.cryosparc_compute.run.run
  File "cryosparc_master/cryosparc_compute/run.py", line 38, in cryosparc_master.cryosparc_compute.run.main
  File "/home/janesh/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 131, in connect
  File "/home/janesh/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 131, in connect
    cli = client.CommandClient(master_hostname, int(master_command_core_port), service="command_core")
  File "/home/janesh/cryosparc/cryosparc_worker/cryosparc_compute/client.py", line 38, in __init__
    super().__init__(service, host, port, url, timeout, headers, cls=NumpyEncoder)
    cli = client.CommandClient(master_hostname, int(master_command_core_port), service="command_core")
  File "/home/janesh/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 97, in __init__
  File "/home/janesh/cryosparc/cryosparc_worker/cryosparc_compute/client.py", line 38, in __init__
    super().__init__(service, host, port, url, timeout, headers, cls=NumpyEncoder)
    self._reload()  # attempt connection immediately to gather methods
  File "/home/janesh/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 97, in __init__
  File "/home/janesh/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 135, in _reload
    self._reload()  # attempt connection immediately to gather methods
  File "/home/janesh/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 135, in _reload
    system = self._get_callable("system.describe")()
  File "/home/janesh/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 108, in func
    system = self._get_callable("system.describe")()
  File "/home/janesh/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 108, in func
    raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://champ2.cm.cluster:45002, code 500) Encounted error from JSONRPC function "system.describe" with params ()
    raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://champ2.cm.cluster:45002, code 500) Encounted error from JSONRPC function "system.describe" with params ()

Regards,
Aparna

Regards,
Aparna

Not directly, at least. The values of num_cpu, num_gpu, and ram_gb are determined automatically, based on the job’s type and the job’s parameters. The PBS cluster manager may use those values in assigning a suitable compute node.

It could be that connection attempts from the CryoSPARC worker to the CryoSPARC master, champ2.cm.cluster, are redirected to a proxy. You may want to check with cluster IT support how to ensure that compute nodes can connect to port 45002 on host champ2.cm.cluster.
You may try if adding the line
export NO_PROXY="champ2.cm.cluster" to the file /home/janesh/cryosparc/cryosparc_worker/config.sh helps.

Dear Cryosparc Team,

Thanks for all the help!

Regards,
Aparna