After Upgrade to 4.7.1: Testjobs to remote worker nodes fail

Hello,

after upgrading my setup to v4.7.1 (and restarting), I have a problem getting test jobs running on the remote) worker nodes.

My setup consists of:

  • farcry (master and worker)
  • farcry2 (worker only)
  • farcry3 (worker only)

A job-test works just fine on the host farcry (which is both, master and worker), but it fails for the other two remote workers:

(base) cryosparc@farcry:~$ cryosparcm test workers P27 --test gpu
Using project P27
Specifying gpu test
Running worker tests...
2025-11-12 08:50:49,173 log CRITICAL | Worker test results
2025-11-12 08:50:49,173 log CRITICAL | farcry
2025-11-12 08:50:49,173 log CRITICAL | ✓ GPU
2025-11-12 08:50:49,179 log CRITICAL | farcry2
2025-11-12 08:50:49,180 log CRITICAL | ✕ GPU
2025-11-12 08:50:49,180 log CRITICAL | Error:
2025-11-12 08:50:49,180 log CRITICAL | See P27 J33 for more information
2025-11-12 08:50:49,180 log CRITICAL | farcry3
2025-11-12 08:50:49,180 log CRITICAL | ✕ GPU
2025-11-12 08:50:49,180 log CRITICAL | Error:
2025-11-12 08:50:49,180 log CRITICAL | See P27 J32 for more information

The job logs for J33 and J32 read like:

(base) cryosparc@farcry:~$ cryosparcm eventlog P27 J32
[Wed, 12 Nov 2025 08:40:44 GMT] License is valid.
[Wed, 12 Nov 2025 08:40:44 GMT] Launching job on lane default target farcry3 ...
[Wed, 12 Nov 2025 08:40:44 GMT] Running job on remote worker node hostname farcry3
[Wed, 12 Nov 2025 08:50:49 GMT] **** Kill signal sent by unknown user ****


(base) cryosparc@farcry:~$ cryosparcm eventlog P27 J33
[Wed, 12 Nov 2025 08:40:42 GMT] License is valid.
[Wed, 12 Nov 2025 08:40:42 GMT] Launching job on lane default target farcry2 ...
[Wed, 12 Nov 2025 08:40:42 GMT] Running job on remote worker node hostname farcry2
[Wed, 12 Nov 2025 08:50:49 GMT] **** Kill signal sent by unknown user ****

I also tried the “–test ssd” option, with the same result. So I suppose, there is a communication problem towards the remote workers. Name resolution and key-based ssh-login as user “cryosparc” seem to work fine (back and forth to all nodes).

The output of “cryosparcm test i”

✓ Running as cryoSPARC owner
✓ Running on master node
✓ CryoSPARC is running
✓ Connected to command_core at http://farcry:61002
✓ CRYOSPARC_LICENSE_ID environment variable is set
✓ License has correct format
✓ Insecure mode is disabled
✓ License server set to "https://get.cryosparc.com"
✓ Connection to license server succeeded
✓ License server returned success status code 200
✓ License server returned valid JSON response
✓ License exists and is valid
✓ CryoSPARC is running v4.7.1+250814
✓ Running the latest version of CryoSPARC
✓ Patch update not required
✓ Admin user has been created
✓ GPU worker connected.

I already successfully “force-reinstalled” the master and worker dependencies.

The workers show up correctly via “get_scheduler_targets()”:

(base) cryosparc@farcry:~/sparc/cryosparc_worker$ cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/cryocache/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11714887680, 'name': 'NVIDIA GeForce GTX 1080 Ti'}, {'id': 1, 'mem': 11714887680, 'name': 'NVIDIA GeForce GTX 1080 Ti'}, {'id': 2, 'mem': 11714887680, 'name': 'NVIDIA GeForce GTX 1080 Ti'}], 'hostname': 'farcry', 'lane': 'default', 'monitor_port': None, 'name': 'farcry', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], 'GPU': [0, 1, 2], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'cryosparc@localhost', 'title': 'Worker node farcry', 'type': 'node', 'worker_bin_path': '/home/cryosparc/sparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/cryocache/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11707809792, 'name': 'NVIDIA GeForce GTX 1080 Ti'}, {'id': 1, 'mem': 11707809792, 'name': 'NVIDIA GeForce GTX 1080 Ti'}, {'id': 2, 'mem': 11707809792, 'name': 'NVIDIA GeForce GTX 1080 Ti'}, {'id': 3, 'mem': 11707809792, 'name': 'NVIDIA GeForce GTX 1080 Ti'}], 'hostname': 'farcry2', 'lane': 'default', 'monitor_port': None, 'name': 'farcry2', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'cryosparc@farcry2', 'title': 'Worker node farcry2', 'type': 'node', 'worker_bin_path': '/home/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/cryocache/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11707809792, 'name': 'NVIDIA GeForce GTX 1080 Ti'}, {'id': 1, 'mem': 11707809792, 'name': 'NVIDIA GeForce GTX 1080 Ti'}, {'id': 2, 'mem': 11707809792, 'name': 'NVIDIA GeForce GTX 1080 Ti'}, {'id': 3, 'mem': 11707809792, 'name': 'NVIDIA GeForce GTX 1080 Ti'}], 'hostname': 'farcry3', 'lane': 'default', 'monitor_port': None, 'name': 'farcry3', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'cryosparc@farcry3', 'title': 'Worker node farcry3', 'type': 'node', 'worker_bin_path': '/home/cryosparc/cryosparc_worker/bin/cryosparcw'}]

I would greatly appreciate any advice on how to further narrow down the problem.

Thank you very much

Andreas

Welcome to the forum @Cryozupp. Please can you post the outputs of these commands on farcry as cryosparc.

whoami
uname -a
nvidia-smi -L
ssh farcry2 uname -a
ssh farcry2 nvidia-smi -L
ssh farcry2 df -Th $(cryosparcm cli "get_project_dir_abs('P27')")
ssh farcry2 ls -l $(cryosparcm cli "get_project_dir_abs('P27')") | tail -n 10
ssh farcry2 cat /home/cryosparc/cryosparc_worker/patch

Hi Wtempel, thank you very much for looking into the matter.

Here is the output:

(base) cryosparc@farcry:~$ whoami
cryosparc
(base) cryosparc@farcry:~$ uname -a
Linux farcry 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
(base) cryosparc@farcry:~$ ssh farcry2 uname -a
Linux farcry2 5.15.0-161-generic #171-Ubuntu SMP Sat Oct 11 08:17:01 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
(base) cryosparc@farcry:~$ ssh farcry2 nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-95ecc833-82be-1374-ce40-1c4c19a7b787)
GPU 1: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-21de8686-7c33-e055-36b6-5bc2470d431f)
GPU 2: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-35d2e464-faf1-46f4-d215-10475de16b89)
GPU 3: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-9f8c9593-2401-c1c1-7130-61b7f4f71944)
(base) cryosparc@farcry:~$ ssh farcry2 df -Th $(cryosparcm cli "get_project_dir_abs('P27')")
df: /home/cryosparc/testing/CS-workertest: No such file or directory
(base) cryosparc@farcry:~$ ssh farcry2 ls -l $(cryosparcm cli "get_project_dir_abs('P27')") | tail -n 10
ls: cannot access '/home/cryosparc/testing/CS-workertest': No such file or directory
(base) cryosparc@farcry:~$ ssh farcry2 cat /home/cryosparc/cryosparc_worker/patch
250814
(base) cryosparc@farcry:~$ 

The /home/cryosparc/testing/ directory does only exist on farcry.
It is neither present on farcry2 nor farcry3.

That would explain the failure. Project directories need to be shared between the master and all workers that run jobs within that project.

Yes, thank you very much - that resolved issue.

I just mixed up the names and didn’t end up in the shared folder…
:melting_face:

Thanks so much
:folded_hands:
Andreas