Cryosparc crashing - sock file

Hi @wtempel ,

I tried splitting the particle stack into smaller groups to run 2D classification individually, but unfortunately we were met with buffering of the cryosparc session. I assigned 3 2D class jobs (2.5 M particles each), each with 1 GPU and ultimately received the sock connection error:

cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/bell/programs/cryosparc_master
Current cryoSPARC version: v4.4.1
----------------------------------------------------------------------------

CryoSPARC process status:

unix:///tmp/cryosparc-supervisor-2bd2e4ee751475e1d6470e25365ba9c5.sock refused connection

----------------------------------------------------------------------------

So it seems I have not fully addressed this issue quite yet. Any suggestions on what I can try next?

Thanks again!

What are the outputs of these commands when the sock file refuses connection?

free -g
date
ps -eo user,pid,ppid,start,rsz,vsz,cmd | grep -e cryosparc_ -e mongo | grep -v grep
date
ls -l /tmp/cryosparc*.sock /tmp/mongodb-*.sock

Thanks for your reply @wtempel

Here is the requested output

bell@ub22-04:~/useful-scripts$ date
Mon Apr 15 05:07:15 PM EDT 2024
bell@ub22-04:~/useful-scripts$ ps -eo user,pid,ppid,start,rsz,vsz,cmd | grep -e cryosparc_ -e mongo | grep -v grep
bell@ub22-04:~/useful-scripts$ date
Mon Apr 15 05:07:50 PM EDT 2024
bell@ub22-04:~/useful-scripts$ ls -l /tmp/cryosparc*.sock /tmp/mongodb-*.sock
srwx------ 1 bell bell 0 Apr 15 14:17 /tmp/cryosparc-supervisor-2bd2e4ee751475e1d6470e25365ba9c5.sock
srwx------ 1 bell bell 0 Apr 15 14:17 /tmp/mongodb-39001.sock
bell@ub22-04:~/useful-scripts$

The empty ps -e output suggests that CryoSPARC-related processes have exited, but socket files where (unexpectedly) left behind.
What is the output of the command

cryosparcm log supervisord | tail -n 20

?

@wtempel here is the output.

this is while cryosparc is buffering/sock connection issues are happening

bell@ub22-04:~$ cryosparcm log supervisord | tail -n 20
2024-04-15 10:11:08,026 INFO spawned: 'app' with pid 62991
2024-04-15 10:11:09,695 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 10:11:09,855 INFO spawned: 'app_api' with pid 63009
2024-04-15 10:11:11,206 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:02,685 INFO RPC interface 'supervisor' initialized
2024-04-15 14:17:02,685 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-15 14:17:02,687 INFO daemonizing the supervisord process
2024-04-15 14:17:02,687 INFO supervisord started with pid 69353
2024-04-15 14:17:07,480 INFO spawned: 'database' with pid 69467
2024-04-15 14:17:09,355 INFO success: database entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:11,321 INFO spawned: 'command_core' with pid 69578
2024-04-15 14:17:17,012 INFO success: command_core entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-04-15 14:17:17,773 INFO spawned: 'command_vis' with pid 69611
2024-04-15 14:17:18,775 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:18,923 INFO spawned: 'command_rtp' with pid 69639
2024-04-15 14:17:19,924 INFO success: command_rtp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:23,709 INFO spawned: 'app' with pid 69696
2024-04-15 14:17:25,379 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:25,539 INFO spawned: 'app_api' with pid 69714
2024-04-15 14:17:26,827 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

Thanks for posting the supervisord log.
Please can you post the outputs of these commands

ps 69578 69467 69578
last reboot | head -n 3
dmesg -T | grep -i oom
dmesg -T | grep -e 69353 -e 69578

@wtempel

bell@ub22-04:~$ ps 69578 69467 69578
    PID TTY      STAT   TIME COMMAND
bell@ub22-04:~$ last reboot | head -n 3
reboot   system boot  6.2.0-39-generic Tue Apr  9 09:30   still running
reboot   system boot  6.2.0-39-generic Sun Apr  7 11:27   still running
reboot   system boot  6.2.0-39-generic Wed Mar 20 13:00 - 11:25 (17+22:24)
bell@ub22-04:~$ sudo dmesg -T | grep -i oom
bell@ub22-04:~$ sudo dmesg -T | grep -e 69353 -e 69578
bell@ub22-04:~$

Please can you confirm with the command
uptime -s

@wtempel sure:

bell@ub22-04:~$ uptime -s
2024-04-09 09:29:54

@carterwheat Unfortunately, I was not able to confirm the (only) hypothesis I had based on your problem description

and the commands’ outputs that you so patiently provided.
The hypothesis went like this:

  1. CryoSPARC was started as normal.
  2. CryoSPARC processes were abruptly killed due to some event (RAM or other system load?. A mere TERM signal would have allowed for the cleanup of the file)

The kernel “OOM killer” seemed to me a good candidate for part 2., but there appear to be no supporting log records. Please let us know if you have any additional information that would point to an alternative cause, such as if the CryoSPARC processes are running inside a container or are subject to some cluster workload manager.

@wtempel Thanks for all of your help. I will keep you updated if anything else comes up that may point us in the right direction.

-Carter