The FQDN resolves that’s why I wonder what is preventing the FQDN from working. Both servers /etc/hosts have the IP and FQDN
Typo on my part I fixed my post, poor obfuscation.
I can confirm the user on master (cryoem8) can ssh exx@cryoem9 without being prompted.
All good:
'**worker_bin_path**': '/home/exx/cryosparc_worker/bin/cryosparcw'}]
ls /home/exx/cryosparc_worker/bin/cryosparcw
/home/exx/cryosparc_worker/bin/cryosparcw
This has to be the issue. The master, cryoem8, is running under a different account that does not exist on worker cryoem9. However as indicated above that user can still ssh without password. I take it that does not matter?
I also do not know. Are you saying that sn4622115580 (which, by the way, looks more like a manufacturer-assigned than “stable”, resolvable hostname) has an entry inside /etc/hosts: W.X.Y.Z cryoem8.ourdomain.edu, where W.X.Y.Z is the same IP address that, when used for the cryosparcw connect --master parameter, allowed successful connection?
That other user may not have write to and/or otherwise access the shared project directory.
this looks like interception of the request by a http proxy on your network.
What is the output of the command (as myuser on sn4622115580):
env | grep -i -e proxy -e http -e request
?
Is sn4622115580 the same host as cryoem9?
Did you test this by running on the CryoSPARC master host (as myuser, or whoever owns CryoSPARC processes on the CryoSPARC master), replacing P99 with the actual id of the project to which J2059 belongs:
://cryoem8.ourdomain.edu:39002, code 400) Encountered ServerError from JSONRPC function "get_project_dir_abs" with params ('J2056',):
ServerError: Error retrieving project dir for J2056 - project not found
Traceback (most recent call last):
File "/home/exx/cryosparc_master/cryosparc_command/commandcommon.py", line 196, in wrapper
res = func(*args, **kwargs)
File "/home/exx/cryosparc_master/cryosparc_command/command_core/__init__.py", line 8149, in get_project_dir_abs
assert project_doc, f"Error retrieving project dir for {project_uid} - project not found"
AssertionError: Error retrieving project dir for J2056 - project not found
drwx------. 32 exx exx 4096 Dec 2 14:53 .
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
But with the other user:
ssh otheruser@cryoem9 "ls -ld $(/home/otheruser/cryosparc_master/bin/cryosparcm cli "get_project_dir_abs('J2056')") && uname -a"
*** (http://cryoem8.fitzpatrick.zi.columbia.edu:39002, code 400) Encountered ServerError from JSONRPC function "get_project_dir_abs" with params ('J2056',):
ServerError: Error retrieving project dir for J2056 - project not found
Traceback (most recent call last):
File "/home/otheruser/cryosparc_master/cryosparc_command/commandcommon.py", line 196, in wrapper
res = func(*args, **kwargs)
File "/home/otheruser/cryosparc_master/cryosparc_command/command_core/__init__.py", line 8149, in get_project_dir_abs
assert project_doc, f"Error retrieving project dir for {project_uid} - project not found"
AssertionError: Error retrieving project dir for J2056 - project not found
drwx------. 9 otheruser 1002 4096 Dec 3 10:15 .
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Edit I see this thread, which mentions setting ``export NO_PROXY="127.0.0.1, localhost,sn4622115580" might work around this. Does this need to happen on cryoem8, the master server? Or just the worker?
get_project_dir_abs() requires a project id (starting with P) instead of a job id. Please can you try again?
You may want setup want to run your multi-host CryoSPARC instance under a consistent numeric userid to avoid:
unnecessarily generous permissions on project directories
additional problems that inconsistent file ownerships may cause down the road, such as during the management of backups, archives and data migrations.
This may or may not help in your case, depending on the configuration of the network and/or computer. It seems you observed a proxy-related error when running a command on the worker.
ssh exx@cryoem9 "ls -ld $(./cryosparcm cli "get_project_dir_abs('J2056')") && uname -a"
*** (http://cryoem8.ourdomain.edu:39002, code 400) Encountered ServerError from JSONRPC function "get_project_dir_abs" with params ('J2056',):
ServerError: Error retrieving project dir for J2056 - project not found
Traceback (most recent call last):
File "/home/ouruser/cryosparc_master/cryosparc_command/commandcommon.py", line 196, in wrapper
res = func(*args, **kwargs)
File "/home/ouruser/cryosparc_master/cryosparc_command/command_core/__init__.py", line 8149, in get_project_dir_abs
assert project_doc, f"Error retrieving project dir for {project_uid} - project not found"
AssertionError: Error retrieving project dir for J2056 - project not found
drwx------. 32 exx exx 4096 Dec 2 14:53 .
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
[ouruser@cryoem8 bin]$ ssh ouruser@cryoem9 "ls -ld $(./cryosparcm cli "get_project_dir_abs('J2056')") && uname -a"
*** (http://cryoem8.ourdomain.edu:39002, code 400) Encountered ServerError from JSONRPC function "get_project_dir_abs" with params ('J2056',):
ServerError: Error retrieving project dir for J2056 - project not found
Traceback (most recent call last):
File "/home/ouruser/cryosparc_master/cryosparc_command/commandcommon.py", line 196, in wrapper
res = func(*args, **kwargs)
File "/home/ouruser/cryosparc_master/cryosparc_command/command_core/__init__.py", line 8149, in get_project_dir_abs
assert project_doc, f"Error retrieving project dir for {project_uid} - project not found"
AssertionError: Error retrieving project dir for J2056 - project not found
drwx------. 9 ouruser 1002 4096 Dec 3 10:15 .
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
But just to show you both users can read/write to J2056
cryoem8 ~]$ ssh cryoem9 ls -l /engram/workstation/Zuker/CS-zuker/J2056
total 176
-rwxrwx---. 1 ouruser ouruser 18 Dec 3 19:20 events.bson
drwxrwx---. 2 ouruser ouruser 0 Dec 3 19:20 gridfs_data
-rwxrwx---. 1 ouruser ouruser 23357 Dec 3 19:20 job.json
-rwxrwx---. 1 ouruser ouruser 22008 Dec 3 19:20 job.log
I’m basing this on a colleague’s post albeit there’s was standalone.
Interesting. What about (using the same ssh string, but different command):
ssh ouruni@cryoem9 "id && uname -a"
If the cryosparcw connect --master parameter ended in ouruni.edu, I would have expected the proxy to be bypassed, but I may misunderstand the effect of the no_proxy variable. You might want to
I cannot be sure due to the obfuscation, but unless ouruni is awf2130 or member of the 1002 group, ouruni cannot access the project directory on cryoem9, and jobs would fail to run on cryoem9.
Inclusion of ${CRYOSPARC_MASTER_HOSTNAME} in the no_proxy definition would be effective only if CRYOSPARC_MASTER_HOSTNAME were also defined, but CRYOSPARC_MASTER_HOSTNAME might not be defined in the worker environment. In any case, because
I recommend no additional changes to the no_proxy definition.
Yes awf2130 = ouruni. What showed that the user would not be able to run on cryoem9? I can add that to the 1002 group. Here is the actual user for full context:
[awf2130@cryoem8 cryosparc_master]$ id
uid=485959(awf2130) gid=500(user) groups=500(user),46004(habazi)
[awf2130@cryoem8 cryosparc_master]$ ssh awf2130@cryoem9 "id && uname -a"
uid=485959(awf2130) gid=500(awf2130) groups=500(awf2130) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
Linux sn4622115580 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
I was missing confirmation that the fictional ouruni user owns the project directory, which you just provided
Is the directory /home/workstation/Zuker/CS-zuker/J2056 also owned by awf2130?
Please can you post the output of the command
ls -al /home/workstation/Zuker/CS-zuker/J2056/
In get_scheduler_targets() output, is the ssh_str now awf2130@cryoem9.ourdomain.edu for the cryoem9 worker? It is above shown as exx@cryoem9.ourdomain.edu.
It’s not /home it’s /engram and yes owned by awf2130:
ls -al /engram/workstation/Zuker/CS-zuker/J2056
total 752
drwxrwx--- 3 awf2130 user 109 Dec 3 19:20 .
drwxrwx--- 2281 awf2130 exx 60322 Dec 5 21:02 ..
-rwxrwx--- 1 awf2130 user 18 Dec 3 19:20 events.bson
drwxrwx--- 2 awf2130 user 0 Dec 3 19:20 gridfs_data
-rwxrwx--- 1 awf2130 user 23357 Dec 3 19:20 job.json
-rwxrwx--- 1 awf2130 user 22008 Dec 3 19:20 job.log
Yes I was going to use the exx user but all the installations on the other workers and master were by awf2130. That did not exist in cryoem9, so I created it, used the same UID/GID and installed the worker there under awf2130
Well, I thought having a worker be version 4.6.2 and master being 4.6.0 was causing this issue with jobs not running but the issue persists. What other debug can I provide?
The job/log shows:
Unable to forward this request at this time. This request could not be forwarded to the origin server or to any parent caches. Some possible problems are: Internet connection needed to access this domains origin servers may be down. All configured parent caches may be currently unreachable. The administrator may not allow this cache to make direct connections to origin servers.
cryosparc_tools.cryosparc.errors.CommandError: *** (http://cryoem8.ouruni.edu:39002, code 500) Encounted error from JSONRPC function "system.describe" with params ()
Could that job have been started on on a worker node that did not have necessary no_proxy setting? On that particular worker node, what is the output of the commands
Well I thought I set it via the export command. I’ll add it to config.sh and use the --update option.
When I log in the variable is definitely not set:
./cryosparcw call curl http://cryoem8.ourdomain.edu:39002
<html><head>
<meta type="copyright" content="Copyright (C) 1996-2017 The Squid Software Foundation and contributors">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>ERROR: The requested URL could not be retrieved</title>
[snip...]
After running:
export no_proxy=localhost,::1,127.0.0.1,${CRYOSPARC_MASTER_HOSTNAME},.ourdomain.edu
$ ./cryosparcw call curl http://cryoem8.ourdomain.edu:39002
Hello World from cryosparc command core.
If you are referring to cryosparcw connect --update:
Another run of cryosparcw connect [..] --update should not be needed if you only changed
the contents of cryosparc_worker/config.sh and there were no other changes, such as a changed absolute path to the cryosparc_worker/ directory.