cryoSPARC master crashes

Hey,

Since I updated cryoSPARC on version 4 we are having issue with the cryoSPARC master on multiple computers. Heavy load on the CPU, no matter weather it comes from a local cryosparc worker or another program such as cryodrgn and relion causes the master to crash.

I tried to restart the master with cryosparcm restart which results in: cryosparcm restart CryoSPARC is running. Stopping cryoSPARC unix:///tmp/cryosparc-supervisor-72b9c14670698f15a83f7a9d3c37a37e.sock refused connection

Trying to stop the master using cryosparcm stop gives the same error: cryosparcm stop CryoSPARC is running. Stopping cryoSPARC unix:///tmp/cryosparc-supervisor-72b9c14670698f15a83f7a9d3c37a37e.sock refused connection

The only solution is to delete the temporary .sock file in the tmp folder.

I observed that I can even crash the master intentionally by just loading a couple CPU cores on the system without running any job in cryosparc (threadripper pro 7965wx with 512GB RAM, Intel® Xeon(R) w5-3435X with 512 GB, all workstations run with Ubuntu 22.04.4 LTS). Running jobs on a remote worker does not crash the master.

The currently installed cryoSPARC version is 4.4.1.

I would be happy about help!

Best,
Ole

I had a similar problem, one of my machines was crashing when all four GPUs were running, turned out to be a bad power supply. Since the machine has redundant power supplies it was easy to test by booting with only one or the other PSU plugged in.

The .sock file gets left behind whenever there’s a crash, that part seems like normal behavior.

Thanks for this hint. It would be weird if the power supplies on 7 different machines (5 of them a pre builds from Dell and different generations, not all of them are located in the same building/power circuit) have issue starting from the same time point. Another thing is that I don’t need to load the GPUs to crash the master, CPU load on a couple cores is sometimes already enough.

@OleUns Please can you provide additional details

  1. Under heavy CPU load, does the computer or the CryoSPARC application crash?
  2. After the crash, do the non-CryoSPARC applications that triggered the crash continue running?
  3. What is the output of the command
    sudo journalctl | grep -i oom
    
    ?

Hey yes of course.

  1. Only the cryoSPARC master/application crashes. You can for example still use the cryoSPARC worker if it’s attached to a remote master.
  2. In case another application crashes cryoSPARC, this application just continues running. However, usually it seams like load caused through a cryoSPARC worker on the same computer (can be any job from 2D classification, heterogenous refinement or Nu-refinement) is most of the time causing the master to crash. It appears that the more CPU hungry the job the more likely the crash is.
  3. There no output/blank for sudo journalctl | grep -i oom

Do I understand your setup correctly:

The computer on which the CryoSPARC master processes crash

  1. also acts as a CryoSPARC worker for this CryoSPARC master
  2. also acts as a CryoSPARC worker for another CryoSPARC master that is running on another computer

?

If so, please can you post the outputs of these commands for each of the master computers:

cryosparcm status | grep -v LICENSE
cryosparcm cli "get_scheduler_targets()"

The computer ob which the CryoSPARC master processes crashes:

  1. also acts as a CryoSPARC worker for this CryoSPARC master

  2. also acts as a CryoSPARC master for other CryoSPARC workers.

cryosparcm status | grep -v LICENSE
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/em7/Software/cryosparc/cryosparc_master
Current cryoSPARC version: v4.4.1+240110
----------------------------------------------------------------------------

CryoSPARC process status:

app                              RUNNING   pid 2403428, uptime 7:09:28
app_api                          RUNNING   pid 2403447, uptime 7:09:27
app_api_dev                      STOPPED   Not started
command_core                     RUNNING   pid 2403331, uptime 7:09:38
command_rtp                      RUNNING   pid 2403389, uptime 7:09:30
command_vis                      RUNNING   pid 2403363, uptime 7:09:31
database                         RUNNING   pid 2403220, uptime 7:09:41

----------------------------------------------------------------------------
License is valid
----------------------------------------------------------------------------

global config variables:
export CRYOSPARC_MASTER_HOSTNAME="em7"
export CRYOSPARC_DB_PATH="/home/em7/Software/cryosparc/cryosparc_database"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_CLICK_WRAP=true

cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/media/raid0/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25383469056, 'name': 'NVIDIA GeForce RTX 4090'}, {'id': 1, 'mem': 25386352640, 'name': 'NVIDIA GeForce RTX 4090'}], 'hostname': 'em7', 'lane': 'default', 'monitor_port': None, 'name': 'em7', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, 'ssh_str': 'em7@em7', 'title': 'Worker node em7', 'type': 'node', 'worker_bin_path': '/home/em7/Software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/media/hrishi/scratch2/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11536039936, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}], 'hostname': 'em1', 'lane': 'em1', 'monitor_port': None, 'name': 'em1', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7]}, 'ssh_str': 'hrishi@em1', 'title': 'Worker node em1', 'type': 'node', 'worker_bin_path': '/home/hrishi/Software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/media/scratch/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 16900292608, 'name': 'Quadro RTX 5000'}, {'id': 1, 'mem': 16891707392, 'name': 'Quadro RTX 5000'}], 'hostname': 'em4', 'lane': 'em4', 'monitor_port': None, 'name': 'em4', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'hrishi@em4', 'title': 'Worker node em4', 'type': 'node', 'worker_bin_path': '/home/hrishi/Software/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/media/scratch/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 50962300928, 'name': 'Quadro RTX 8000'}, {'id': 1, 'mem': 50962300928, 'name': 'Quadro RTX 8000'}], 'hostname': 'preprocess1', 'lane': 'preprocess1', 'monitor_port': None, 'name': 'preprocess1', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]}, 'ssh_str': 'hallberg@preprocess1', 'title': 'Worker node preprocess1', 'type': 'node', 'worker_bin_path': '/home/hallberg/software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/media/em6/scratch/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 21116682240, 'name': 'NVIDIA RTX 4000 Ada Generation'}, {'id': 1, 'mem': 21125267456, 'name': 'NVIDIA RTX 4000 Ada Generation'}], 'hostname': 'em6-1', 'lane': 'em6-1', 'monitor_port': None, 'name': 'em6-1', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'em6@em6-1', 'title': 'Worker node em6-1', 'type': 'node', 'worker_bin_path': '/home/em6/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/media/em5/scratch/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 21116682240, 'name': 'NVIDIA RTX 4000 Ada Generation'}, {'id': 1, 'mem': 21125267456, 'name': 'NVIDIA RTX 4000 Ada Generation'}], 'hostname': 'em5', 'lane': 'em5', 'monitor_port': None, 'name': 'em5', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'em5@em5', 'title': 'Worker node em5', 'type': 'node', 'worker_bin_path': '/home/em5/Software/cryosprac/cryosparc_worker/bin/cryosparcw'}]

Previously most workstations were configured to run master and worker at the same time. In this configuration they were not connected to another worker or master on a different workstation. This caused the same issues of unpredictable crashes on each of the workstations.

Another setup that crashed quite a while ago, in this case master and worker are on the same computer:

cryosparcm status | grep -v LICENSE
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/supervisor/cryosparc/cryosparc_master
Current cryoSPARC version: v4.4.1
----------------------------------------------------------------------------

CryoSPARC process status:

app                              RUNNING   pid 3909865, uptime 43 days, 5:56:05
app_api                          RUNNING   pid 3909929, uptime 43 days, 5:56:03
app_api_dev                      STOPPED   Not started
command_core                     RUNNING   pid 3909514, uptime 43 days, 5:56:15
command_rtp                      RUNNING   pid 3909721, uptime 43 days, 5:56:08
command_vis                      RUNNING   pid 3909663, uptime 43 days, 5:56:09
database                         RUNNING   pid 3909316, uptime 43 days, 5:56:18

----------------------------------------------------------------------------
License is valid
----------------------------------------------------------------------------

global config variables:
export CRYOSPARC_MASTER_HOSTNAME="3DEM-Workstation"
export CRYOSPARC_DB_PATH="/home/supervisor/cryosparc/cryosparc_database"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_CLICK_WRAP=true
cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/media/supervisor/DATA/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 12630294528, 'name': 'NVIDIA GeForce RTX 3080 Ti'}, {'id': 1, 'mem': 12630294528, 'name': 'NVIDIA GeForce RTX 3080 Ti'}, {'id': 2, 'mem': 12624723968, 'name': 'NVIDIA GeForce RTX 3080 Ti'}, {'id': 3, 'mem': 12630294528, 'name': 'NVIDIA GeForce RTX 3080 Ti'}], 'hostname': '3DEM-Workstation', 'lane': 'default', 'monitor_port': None, 'name': '3DEM-Workstation', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, 'ssh_str': 'supervisor@3DEM-Workstation', 'title': 'Worker node 3DEM-Workstation', 'type': 'node', 'worker_bin_path': '/home/supervisor/cryosparc/cryosparc_worker/bin/cryosparcw'}]

Thanks @OleUns for these clarifications.
What is the output of the command
last reboot on em7?
Does the
sudo journalctl command on em7 have any output?

No problem!
I get the following outputs:

last reboot
reboot   system boot  6.5.0-28-generic Wed May  1 13:15   still running
reboot   system boot  6.5.0-28-generic Wed May  1 12:45 - 12:59  (00:14)
reboot   system boot  6.5.0-28-generic Wed May  1 12:04 - 12:16  (00:11)
reboot   system boot  6.5.0-28-generic Wed May  1 11:58 - 12:16  (00:17)
reboot   system boot  6.5.0-28-generic Wed May  1 11:54 - 12:16  (00:22)
reboot   system boot  6.5.0-28-generic Wed May  1 11:48 - 11:53  (00:04)
reboot   system boot  6.5.0-28-generic Wed May  1 11:44 - 11:53  (00:09)
reboot   system boot  6.5.0-26-generic Mon Mar 25 17:46 - 10:05 (36+15:19)
reboot   system boot  6.5.0-26-generic Fri Mar 22 23:52 - 16:04 (2+16:11)
reboot   system boot  6.5.0-26-generic Fri Mar 22 22:57 - 23:51  (00:54)

wtmp begins Fri Mar 22 22:57:12 2024

and for sudo journalctl I get an output, if this might be helpful for you I can post it.

Hey @wtempel the master crash occurred again. The master was running two local refinements in parallel and two worker nods on different workstations where running RBMC and heterogenous refinement (both since >7h). An additional difference is that cryoSPARC is now version 4.5 I run sudo journalctl | grep -i oom again and get the following output:

sudo journalctl | grep -i oom
[sudo] password for em7: 
maj 11 12:11:05 em7 systemd[1838]: vte-spawn-449fe2f4-9255-4970-9077-e2ddc08a40ff.scope: systemd-oomd killed 194 process(es) in this unit.
maj 11 12:11:05 em7 systemd-oomd[1041]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-449fe2f4-9255-4970-9077-e2ddc08a40ff.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 51.68% > 50.00% for > 20s with reclaim activity

Running cryosparcm status | grep -v LICENSE cryosparcm cli "get_scheduler_targets()":

cryosparcm status | grep -v LICENSE
cryosparcm cli "get_scheduler_targets()"
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/em7/Software/cryosparc/cryosparc_master
Current cryoSPARC version: v4.5.0
----------------------------------------------------------------------------

CryoSPARC process status:

unix:///tmp/cryosparc-supervisor-72b9c14670698f15a83f7a9d3c37a37e.sock refused connection

----------------------------------------------------------------------------
An error ocurred while checking license status
Could not get license verification status. Are all CryoSPARC processes RUNNING?
/home/em7/Software/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py:135: UserWarning: *** CommandClient: (http://em7:39002/api) URL Error [Errno 111] Connection refused, attempt 1 of 3. Retrying in 30 seconds
  system = self._get_callable("system.describe")()
/home/em7/Software/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py:135: UserWarning: *** CommandClient: (http://em7:39002/api) URL Error [Errno 111] Connection refused, attempt 2 of 3. Retrying in 30 seconds
  system = self._get_callable("system.describe")()
Traceback (most recent call last):
  File "/home/em7/Software/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 105, in func
    with make_json_request(self, "/api", data=data, _stacklevel=4) as request:
  File "/home/em7/Software/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/em7/Software/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 226, in make_request
    raise CommandError(error_reason, url=url, code=code, data=resdata)
cryosparc_tools.cryosparc.errors.CommandError: *** (http://em7:39002/api, code 500) URL Error [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/em7/Software/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/em7/Software/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/em7/Software/cryosparc/cryosparc_master/cryosparc_compute/client.py", line 57, in <module>
    cli = CommandClient(host=host, port=int(port))
  File "/home/em7/Software/cryosparc/cryosparc_master/cryosparc_compute/client.py", line 38, in __init__
    super().__init__(service, host, port, url, timeout, headers, cls=NumpyEncoder)
  File "/home/em7/Software/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 97, in __init__
    self._reload()  # attempt connection immediately to gather methods
  File "/home/em7/Software/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 135, in _reload
    system = self._get_callable("system.describe")()
  File "/home/em7/Software/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 108, in func
    raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://em7:39002, code 500) Encounted error from JSONRPC function "system.describe" with params ()

is consistent with a disruptive termination of CryoSPARC’s supservisord process, which could occur if

  • the computer experienced a power failure or “cold” restart
  • SIGKILL were sent to the supervisord process.

In the latter case, CryoSPARC server processes usually managed by supervisord may be left running and would need to be terminated manually (carefully avoiding SIGKILL signals (aka “kill -9”)) before removing the stale *.sock file (guide).
It is possible that the supervisord process was abruptly killed by systemd-oomd. I am not sure whether systemd-oomd recorded the process IDs of individual processes mentioned in

If such system log entries exist, you may want to compare them to the process IDs recorded inside the CryoSPARC supervisord log

cryosparcm log supervisord

.
To avoid the disruption of CryoSPARC master services by processing jobs running on the CryoSPARC master, you may want to consider installation of a comparatively light-weight, dedicated CryoSPARC master server that does not need to have any GPU resources and that would only run comparatively light-weight interactive jobs, but no GPU-accelerated jobs.