Update from 4.2 to 4.5.1 failed

Hi there,
I just updated from 4.2 to 4.5.1 and after restart my jobs are stalled during lunching.

  1. I ran “cryosparcm test install” and everytning looks fine
  2. but running “cryosparcm test workers P30” does not look good.
  3. I addition I include the “job.log” of P30 J97.

How shall I proceede?


Running installation tests…
✓ Running as cryoSPARC owner
✓ Running on master node
✓ CryoSPARC is running
✓ Connected to command_core at http://msr:61002
✓ CRYOSPARC_LICENSE_ID environment variable is set
✓ License has correct format
✓ Insecure mode is disabled
✓ License server set to “https://get.cryosparc.com
✓ Connection to license server succeeded
✓ License server returned success status code 200
✓ License server returned valid JSON response
✓ License exists and is valid
✓ CryoSPARC is running v4.5.1
✓ Running the latest version of CryoSPARC
Could not get latest patch (status code 404)
✓ Patch update not required
✓ Admin user has been created
✓ GPU worker connected.


but running “cryosparcm test workers P30” does not look good.
2) ***********************************************
Running worker tests…
2024-05-29 18:03:32,630 log CRITICAL | Worker test results
2024-05-29 18:03:32,630 log CRITICAL | msr
2024-05-29 18:03:32,630 log CRITICAL | ✕ LAUNCH
2024-05-29 18:03:32,630 log CRITICAL | Error:
2024-05-29 18:03:32,630 log CRITICAL | See P30 J97 for more information
2024-05-29 18:03:32,630 log CRITICAL | :warning: SSD
2024-05-29 18:03:32,630 log CRITICAL | Did not run: Launch test failed
2024-05-29 18:03:32,630 log CRITICAL | :warning: GPU
2024-05-29 18:03:32,631 log CRITICAL | Did not run: Launch test failed


I addition I include the “job.log” of P30 J97
3) ***********************************************
================= CRYOSPARCW ======= 2024-05-29 18:01:31.462829 =========
Project P30 Job J97
Master msr Port 61002

========= monitor process now starting main process at 2024-05-29 18:01:31.462869
MAINPROCESS PID 69217
Traceback (most recent call last):
File “/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 104, in func
with make_json_request(self, “/api”, data=data) as request:
File “/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py”, line 113, in enter
return next(self.gen)
File “/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 191, in make_request
raise CommandClient.Error(client, error_reason, url=url)
cryosparc_tools.cryosparc.command.Error: *** CommandClient: (http://msr:61002/api) HTTP Error 400 Bad Request; please check cryosparcm log command_core for additional information.
Response from server: b’\n \n Bad Request\n \n \n

Bad Request

\n Invalid Method 'Invalid HTTP method: 'post''\n \n\n’

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “”, line 1, in
File “cryosparc_master/cryosparc_compute/run.py”, line 173, in cryosparc_compute.run.run
File “/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 126, in connect
cli = client.CommandClient(master_hostname, int(master_command_core_port), service=“command_core”)
File “/home/cryosparc/cryosparc_worker/cryosparc_compute/client.py”, line 36, in init
super().init(service, host, port, url, timeout, headers, cls=NumpyEncoder)
File “/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 91, in init
self._reload() # attempt connection immediately to gather methods
File “/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 118, in _reload
system = self._get_callable(“system.describe”)()
File “/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 107, in func
raise CommandClient.Error(
cryosparc_tools.cryosparc.command.Error: *** CommandClient: (http://msr:61002) Did not receive a JSON response from method “system.describe” with params ()
*** CommandClient: (http://msr:61002/api) HTTP Error 400 Bad Request; please check cryosparcm log command_core for additional information.
Response from server: b’\n \n Bad Request\n \n \n

Bad Request

\n Invalid Method 'Invalid HTTP method: 'post''\n \n\n’
*** CommandClient: (http://msr:61002/api) HTTP Error 400 Bad Request; please check cryosparcm log command_core for additional information.
Response from server: b’\n \n Bad Request\n \n \n

Bad Request

\n Invalid Method 'Invalid HTTP method: 'post''\n \n\n’
Process Process-1:
Traceback (most recent call last):
File “/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 104, in func
with make_json_request(self, “/api”, data=data) as request:
File “/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py”, line 113, in enter
return next(self.gen)
File “/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 191, in make_request
raise CommandClient.Error(client, error_reason, url=url)
cryosparc_tools.cryosparc.command.CommandClient.Error: *** CommandClient: (http://msr:61002/api) HTTP Error 400 Bad Request; please check cryosparcm log command_core for additional information.
Response from server: b’\n \n Bad Request\n \n \n

Bad Request

\n Invalid Method 'Invalid HTTP method: 'post''\n \n\n’

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py”, line 315, in _bootstrap
self.run()
File “/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “cryosparc_master/cryosparc_compute/run.py”, line 32, in cryosparc_compute.run.main
File “/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 126, in connect
cli = client.CommandClient(master_hostname, int(master_command_core_port), service=“command_core”)
File “/home/cryosparc/cryosparc_worker/cryosparc_compute/client.py”, line 36, in init
super().init(service, host, port, url, timeout, headers, cls=NumpyEncoder)
File “/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 91, in init
self._reload() # attempt connection immediately to gather methods
File “/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 118, in _reload
system = self._get_callable(“system.describe”)()
File “/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 107, in func
raise CommandClient.Error(
cryosparc_tools.cryosparc.command.CommandClient.Error: *** CommandClient: (http://msr:61002) Did not receive a JSON response from method “system.describe” with params ()


Please can you post the outputs of these commands

  1. On the master
    hostname -f 
    host $(hostname -f)
    cryosparcm cli "get_scheduler_targets()"
    
  2. On one of the CryoSPARC worker computers (the master, in case of a “Single Workstation” instance)
    host msr
    curl msr:61002
    cat /home/cryosparc/cryosparc_worker/version
    

Dear wtempel,

concerning “cat /home/cryosparc/cryosparc_worker/version” it looks that the worker was not updated, right?

hostname -f
msr

host $(hostname -f)
Host msr not found: 2(SERVFAIL)

cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/scr/cryosparc’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 1, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 2, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 3, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 4, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 5, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 6, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 7, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}], ‘hostname’: ‘msr’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘msr’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255], ‘GPU’: [0, 1, 2, 3, 4, 5, 6, 7], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc@msr’, ‘title’: ‘Worker node msr’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/cryosparc/cryosparc_worker/bin/cryosparcw’}]

host msr
Host msr not found: 2(SERVFAIL)

curl msr:61002
Hello World from cryosparc command core.

cat /home/cryosparc/cryosparc_worker/version
v4.2.1

Correct. But I am surprised the update failed under these circumstances. You can try the following in a fresh shell under the cryosparc account:

eval $( /home/cryosparc/cryosparc_worker/bin/cryosparcw env)
cd /home/cryosparc/cryosparc_worker/
curl -L https://get.cryosparc.com/download/worker-v4.5.1/$CRYOSPARC_LICENSE_ID -o cryosparc_worker.tar.gz
./bin/cryosparcw update

I you see
Successfully updated. after running these commands, you can

  1. exit the shell (to avoid inadvertently running commands inside the loaded CryoSPARC environment)
  2. try the worker test again

as suggested by “cryosparcm test workers P30” to run
“nvidia-smi -pm 1” as root

A second
“cryosparcm test workers P30” prints the following
Running worker tests…
2024-05-29 23:44:43,952 log CRITICAL | Worker test results
2024-05-29 23:44:43,952 log CRITICAL | msr
2024-05-29 23:44:43,952 log CRITICAL | ✓ LAUNCH
2024-05-29 23:44:43,952 log CRITICAL | ✓ SSD
2024-05-29 23:44:43,952 log CRITICAL | ✓ GPU
2024-05-29 23:44:43,958 log CRITICAL | :warning: NVIDIA GeForce RTX 2080 Ti @ 00000000:87:00.0: GPU Software Power Cap is Active

is a warning, not an error. For details, see nvidia-smi documentation:

nvidia-smi --help-query-gpu | grep -A 1  sw_power_cap

OK - thanks for the help.
Once the worker was uptated as suggested cSPA is running fine again :slight_smile: