New worker node gets Encounted error from JSONRPC function "system.describe" with params ()

The FQDN resolves that’s why I wonder what is preventing the FQDN from working. Both servers /etc/hosts have the IP and FQDN

Typo on my part I fixed my post, poor obfuscation.

I can confirm the user on master (cryoem8) can ssh exx@cryoem9 without being prompted.

All good:

'**worker_bin_path**': '/home/exx/cryosparc_worker/bin/cryosparcw'}]

ls /home/exx/cryosparc_worker/bin/cryosparcw
/home/exx/cryosparc_worker/bin/cryosparcw

This has to be the issue. The master, cryoem8, is running under a different account that does not exist on worker cryoem9. However as indicated above that user can still ssh without password. I take it that does not matter?

I also do not know. Are you saying that sn4622115580 (which, by the way, looks more like a manufacturer-assigned than “stable”, resolvable hostname) has an entry inside /etc/hosts:
W.X.Y.Z cryoem8.ourdomain.edu, where W.X.Y.Z is the same IP address that, when used for the
cryosparcw connect --master parameter, allowed successful connection?

That other user may not have write to and/or otherwise access the shared project directory.

Absolutely correct. That sn is serial number, it’s how Exxact does it.

RW access confirmed:
touch /path/to/J2059/testfile

Looking back at your error

this looks like interception of the request by a http proxy on your network.

  1. What is the output of the command (as myuser on sn4622115580):

    env | grep -i -e proxy -e http -e request
    

    ?

  2. Is sn4622115580 the same host as cryoem9?

  3. Did you test this by running on the CryoSPARC master host (as myuser, or whoever owns CryoSPARC processes on the CryoSPARC master), replacing P99 with the actual id of the project to which J2059 belongs:

    ssh exx@cryoem9 "ls -ld $(cryosparcm cli "get_project_dir_abs('P99')") && uname -a"
    

Ah good catch:

env | grep -i -e proxy -e http -e request

SELINUX_ROLE_**REQUEST**ED=

**http**_**proxy**=**http**://gw-srv-01.ourdomain.edu:3128/

SELINUX_LEVEL_**REQUEST**ED=

Yes.

://cryoem8.ourdomain.edu:39002, code 400) Encountered ServerError from JSONRPC function "get_project_dir_abs" with params ('J2056',):
ServerError: Error retrieving project dir for J2056 - project not found
Traceback (most recent call last):
  File "/home/exx/cryosparc_master/cryosparc_command/commandcommon.py", line 196, in wrapper
    res = func(*args, **kwargs)
  File "/home/exx/cryosparc_master/cryosparc_command/command_core/__init__.py", line 8149, in get_project_dir_abs
    assert project_doc, f"Error retrieving project dir for {project_uid} - project not found"
AssertionError: Error retrieving project dir for J2056 - project not found

drwx------. 32 exx exx 4096 Dec  2 14:53 .
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

But with the other user:

 ssh otheruser@cryoem9 "ls -ld $(/home/otheruser/cryosparc_master/bin/cryosparcm cli "get_project_dir_abs('J2056')") && uname -a"
*** (http://cryoem8.fitzpatrick.zi.columbia.edu:39002, code 400) Encountered ServerError from JSONRPC function "get_project_dir_abs" with params ('J2056',):
ServerError: Error retrieving project dir for J2056 - project not found
Traceback (most recent call last):
  File "/home/otheruser/cryosparc_master/cryosparc_command/commandcommon.py", line 196, in wrapper
    res = func(*args, **kwargs)
  File "/home/otheruser/cryosparc_master/cryosparc_command/command_core/__init__.py", line 8149, in get_project_dir_abs
    assert project_doc, f"Error retrieving project dir for {project_uid} - project not found"
AssertionError: Error retrieving project dir for J2056 - project not found

drwx------. 9 otheruser 1002 4096 Dec  3 10:15 .
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

A simple version works:

ssh otheruser@cryoem9 "ls -ld /path/to/J2056"
drwxrwx---. 3 otheruser otheruser 109 Dec  3 16:50 /path/to/J2056

Edit I see this thread, which mentions setting ``export NO_PROXY="127.0.0.1, localhost,sn4622115580" might work around this. Does this need to happen on cryoem8, the master server? Or just the worker?

get_project_dir_abs() requires a project id (starting with P) instead of a job id. Please can you try again?
You may want setup want to run your multi-host CryoSPARC instance under a consistent numeric userid to avoid:

  • unnecessarily generous permissions on project directories
  • additional problems that inconsistent file ownerships may cause down the road, such as during the management of backups, archives and data migrations.

This may or may not help in your case, depending on the configuration of the network and/or computer. It seems you observed a proxy-related error when running a command on the worker.

Yes the project ID is J2056

 ssh exx@cryoem9 "ls -ld $(./cryosparcm cli "get_project_dir_abs('J2056')") && uname -a"
*** (http://cryoem8.ourdomain.edu:39002, code 400) Encountered ServerError from JSONRPC function "get_project_dir_abs" with params ('J2056',):
ServerError: Error retrieving project dir for J2056 - project not found
Traceback (most recent call last):
  File "/home/ouruser/cryosparc_master/cryosparc_command/commandcommon.py", line 196, in wrapper
    res = func(*args, **kwargs)
  File "/home/ouruser/cryosparc_master/cryosparc_command/command_core/__init__.py", line 8149, in get_project_dir_abs
    assert project_doc, f"Error retrieving project dir for {project_uid} - project not found"
AssertionError: Error retrieving project dir for J2056 - project not found

drwx------. 32 exx exx 4096 Dec  2 14:53 .
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
[ouruser@cryoem8 bin]$ ssh ouruser@cryoem9 "ls -ld $(./cryosparcm cli "get_project_dir_abs('J2056')") && uname -a"
*** (http://cryoem8.ourdomain.edu:39002, code 400) Encountered ServerError from JSONRPC function "get_project_dir_abs" with params ('J2056',):
ServerError: Error retrieving project dir for J2056 - project not found
Traceback (most recent call last):
  File "/home/ouruser/cryosparc_master/cryosparc_command/commandcommon.py", line 196, in wrapper
    res = func(*args, **kwargs)
  File "/home/ouruser/cryosparc_master/cryosparc_command/command_core/__init__.py", line 8149, in get_project_dir_abs
    assert project_doc, f"Error retrieving project dir for {project_uid} - project not found"
AssertionError: Error retrieving project dir for J2056 - project not found

drwx------. 9 ouruser 1002 4096 Dec  3 10:15 .
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

But just to show you both users can read/write to J2056

cryoem8 ~]$ ssh cryoem9 ls -l  /engram/workstation/Zuker/CS-zuker/J2056 
total 176
-rwxrwx---. 1 ouruser ouruser    18 Dec  3 19:20 events.bson
drwxrwx---. 2 ouruser ouruser     0 Dec  3 19:20 gridfs_data
-rwxrwx---. 1 ouruser ouruser 23357 Dec  3 19:20 job.json
-rwxrwx---. 1 ouruser ouruser 22008 Dec  3 19:20 job.log

I’m basing this on a colleague’s post albeit there’s was standalone.

The project ID for this job can be displayed with the command

grep \"project_uid\" /engram/workstation/Zuker/CS-zuker/J2056/job.json

Interesting. Do you have the same output as your colleague for the command (on cryoem8)

cryosparcm call env |grep -i proxy

?
What about the commands

/path/to/cryosparc_worker/bin/cryosparcw call env | grep -i proxy

on each of your workers?

grep \"project_uid\" /engram/workstation/Zuker/CS-zuker/J2056/job.json

**"project_uid"**: "P1",

OK now I get:

ssh ouruni@cryoem9 "ls -ld $(./cryosparcm cli "get_project_dir_abs('P1')") && uname -a"
drwxrwx---. 2254 awf2130 1002 59767 Dec  4 22:40 /engram/workstation/Zuker/CS-zuker
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
./cryosparcm call env |grep -I proxy
NO_PROXY=ouruni.edu
http_proxy=http://gw-srv-01.ouruni.edu:3128
https_proxy=http://gw-srv-01.ouruni.edu:3128
HTTPS_PROXY=http://gw-srv-01.ouruni.edu:3128
no_proxy=localhost,::1,127.0.0.1,cryoem8.ouruni.edu,.ouruni.edu
HTTP_PROXY=http://gw-srv-01.rc.ouruni.edu:3128
./cryosparcw call env | grep -I proxy
http_proxy=http://gw-srv-01.ouruni.edu:3128/
no_proxy=localhost,::1,127.0.0.1,ouruni.edu

Interesting. What about (using the same ssh string, but different command):

ssh ouruni@cryoem9 "id && uname -a"

If the cryosparcw connect --master parameter ended in ouruni.edu, I would have expected the proxy to be bypassed, but I may misunderstand the effect of the no_proxy variable. You might want to

  1. ask your IT support for suggestions
  2. quoting the definition
export no_proxy="localhost,::1,127.0.0.1,ouruni.edu"
  1. including the IP address of the master in the no_proxy definition
uid=xxx(ouruni) gid=500(ouruni) groups=500(ouruni) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

So something like:
export no_proxy=localhost,::1,127.0.0.1,${CRYOSPARC_MASTER_HOSTNAME},.ouruni.edu

In both config.sh on master and worker?

FWIW I was able to do a cryosparcw using the FQDN after setting no_proxy.

I cannot be sure due to the obfuscation, but unless ouruni is awf2130 or member of the 1002 group, ouruni cannot access the project directory on cryoem9, and jobs would fail to run on cryoem9.

Inclusion of ${CRYOSPARC_MASTER_HOSTNAME} in the no_proxy definition would be effective only if CRYOSPARC_MASTER_HOSTNAME were also defined, but CRYOSPARC_MASTER_HOSTNAME might not be defined in the worker environment. In any case, because

I recommend no additional changes to the no_proxy definition.

Yes awf2130 = ouruni. What showed that the user would not be able to run on cryoem9? I can add that to the 1002 group. Here is the actual user for full context:

[awf2130@cryoem8 cryosparc_master]$ id
uid=485959(awf2130) gid=500(user) groups=500(user),46004(habazi)
[awf2130@cryoem8 cryosparc_master]$ ssh awf2130@cryoem9 "id && uname -a"
uid=485959(awf2130) gid=500(awf2130) groups=500(awf2130) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I was missing confirmation that the fictional ouruni user owns the project directory, which you just provided

  1. Is the directory
    /home/workstation/Zuker/CS-zuker/J2056 also owned by awf2130?
  2. Please can you post the output of the command
ls -al /home/workstation/Zuker/CS-zuker/J2056/
  1. In get_scheduler_targets() output, is the ssh_str now awf2130@cryoem9.ourdomain.edu for the cryoem9 worker? It is above shown as exx@cryoem9.ourdomain.edu.

It’s not /home it’s /engram and yes owned by awf2130:

ls -al  /engram/workstation/Zuker/CS-zuker/J2056
total 752
drwxrwx---    3 awf2130 user   109 Dec  3 19:20 .
drwxrwx--- 2281 awf2130 exx  60322 Dec  5 21:02 ..
-rwxrwx---    1 awf2130 user    18 Dec  3 19:20 events.bson
drwxrwx---    2 awf2130 user     0 Dec  3 19:20 gridfs_data
-rwxrwx---    1 awf2130 user 23357 Dec  3 19:20 job.json
-rwxrwx---    1 awf2130 user 22008 Dec  3 19:20 job.log

Yes I was going to use the exx user but all the installations on the other workers and master were by awf2130. That did not exist in cryoem9, so I created it, used the same UID/GID and installed the worker there under awf2130

Thanks @RobK. Please can you describe what is currently not (yet) working as expected?

Well, I thought having a worker be version 4.6.2 and master being 4.6.0 was causing this issue with jobs not running but the issue persists. What other debug can I provide?

The job/log shows:

Unable to forward this request at this time. This request could not be forwarded to the origin server or to any parent caches. Some possible problems are: Internet connection needed to access this domains origin servers may be down.  All configured parent caches may be currently unreachable.  The administrator may not allow this cache to make direct connections to origin servers.

cryosparc_tools.cryosparc.errors.CommandError: *** (http://cryoem8.ouruni.edu:39002, code 500) Encounted error from JSONRPC function "system.describe" with params ()

Could that job have been started on on a worker node that did not have necessary no_proxy setting? On that particular worker node, what is the output of the commands

uname -a
/path/to/cryosparc_worker/bin/cryosparcw call curl http://cryoem8.ouruni.edu:39002
/path/to/cryosparc_worker/bin/cryosparcw env | grep -i proxy

Well I thought I set it via the export command. I’ll add it to config.sh and use the --update option.

When I log in the variable is definitely not set:

 ./cryosparcw call curl http://cryoem8.ourdomain.edu:39002

<html><head>
<meta type="copyright" content="Copyright (C) 1996-2017 The Squid Software Foundation and contributors">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>ERROR: The requested URL could not be retrieved</title>
[snip...]

After running:

export no_proxy=localhost,::1,127.0.0.1,${CRYOSPARC_MASTER_HOSTNAME},.ourdomain.edu

$ ./cryosparcw call curl http://cryoem8.ourdomain.edu:39002
Hello World from cryosparc command core.

If you are referring to cryosparcw connect --update:
Another run of cryosparcw connect [..] --update should not be needed if you only changed
the contents of cryosparc_worker/config.sh and there were no other changes, such as a changed absolute path to the cryosparc_worker/ directory.