Cryosparc crashing - sock file

carterwheat · April 15, 2024, 6:24pm

I tried splitting the particle stack into smaller groups to run 2D classification individually, but unfortunately we were met with buffering of the cryosparc session. I assigned 3 2D class jobs (2.5 M particles each), each with 1 GPU and ultimately received the sock connection error:

cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/bell/programs/cryosparc_master
Current cryoSPARC version: v4.4.1
----------------------------------------------------------------------------

CryoSPARC process status:

unix:///tmp/cryosparc-supervisor-2bd2e4ee751475e1d6470e25365ba9c5.sock refused connection

----------------------------------------------------------------------------

So it seems I have not fully addressed this issue quite yet. Any suggestions on what I can try next?

Thanks again!

wtempel · April 15, 2024, 7:22pm

What are the outputs of these commands when the sock file refuses connection?

free -g
date
ps -eo user,pid,ppid,start,rsz,vsz,cmd | grep -e cryosparc_ -e mongo | grep -v grep
date
ls -l /tmp/cryosparc*.sock /tmp/mongodb-*.sock

carterwheat · April 15, 2024, 9:10pm

Thanks for your reply @wtempel

Here is the requested output

bell@ub22-04:~/useful-scripts$ date
Mon Apr 15 05:07:15 PM EDT 2024
bell@ub22-04:~/useful-scripts$ ps -eo user,pid,ppid,start,rsz,vsz,cmd | grep -e cryosparc_ -e mongo | grep -v grep
bell@ub22-04:~/useful-scripts$ date
Mon Apr 15 05:07:50 PM EDT 2024
bell@ub22-04:~/useful-scripts$ ls -l /tmp/cryosparc*.sock /tmp/mongodb-*.sock
srwx------ 1 bell bell 0 Apr 15 14:17 /tmp/cryosparc-supervisor-2bd2e4ee751475e1d6470e25365ba9c5.sock
srwx------ 1 bell bell 0 Apr 15 14:17 /tmp/mongodb-39001.sock
bell@ub22-04:~/useful-scripts$

wtempel · April 15, 2024, 9:34pm

The empty ps -e output suggests that CryoSPARC-related processes have exited, but socket files where (unexpectedly) left behind.
What is the output of the command

cryosparcm log supervisord | tail -n 20

?

carterwheat · April 16, 2024, 3:32pm

@wtempel here is the output.

this is while cryosparc is buffering/sock connection issues are happening

bell@ub22-04:~$ cryosparcm log supervisord | tail -n 20
2024-04-15 10:11:08,026 INFO spawned: 'app' with pid 62991
2024-04-15 10:11:09,695 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 10:11:09,855 INFO spawned: 'app_api' with pid 63009
2024-04-15 10:11:11,206 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:02,685 INFO RPC interface 'supervisor' initialized
2024-04-15 14:17:02,685 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-15 14:17:02,687 INFO daemonizing the supervisord process
2024-04-15 14:17:02,687 INFO supervisord started with pid 69353
2024-04-15 14:17:07,480 INFO spawned: 'database' with pid 69467
2024-04-15 14:17:09,355 INFO success: database entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:11,321 INFO spawned: 'command_core' with pid 69578
2024-04-15 14:17:17,012 INFO success: command_core entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-04-15 14:17:17,773 INFO spawned: 'command_vis' with pid 69611
2024-04-15 14:17:18,775 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:18,923 INFO spawned: 'command_rtp' with pid 69639
2024-04-15 14:17:19,924 INFO success: command_rtp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:23,709 INFO spawned: 'app' with pid 69696
2024-04-15 14:17:25,379 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:25,539 INFO spawned: 'app_api' with pid 69714
2024-04-15 14:17:26,827 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

wtempel · April 16, 2024, 5:37pm

Thanks for posting the supervisord log.
Please can you post the outputs of these commands

ps 69578 69467 69578
last reboot | head -n 3
dmesg -T | grep -i oom
dmesg -T | grep -e 69353 -e 69578

carterwheat · April 16, 2024, 7:46pm

@wtempel

bell@ub22-04:~$ ps 69578 69467 69578
    PID TTY      STAT   TIME COMMAND
bell@ub22-04:~$ last reboot | head -n 3
reboot   system boot  6.2.0-39-generic Tue Apr  9 09:30   still running
reboot   system boot  6.2.0-39-generic Sun Apr  7 11:27   still running
reboot   system boot  6.2.0-39-generic Wed Mar 20 13:00 - 11:25 (17+22:24)
bell@ub22-04:~$ sudo dmesg -T | grep -i oom
bell@ub22-04:~$ sudo dmesg -T | grep -e 69353 -e 69578
bell@ub22-04:~$

wtempel · April 16, 2024, 8:03pm

Please can you confirm with the command
uptime -s

carterwheat · April 16, 2024, 8:07pm

@wtempel sure:

bell@ub22-04:~$ uptime -s
2024-04-09 09:29:54

wtempel · April 17, 2024, 9:30pm

@carterwheat Unfortunately, I was not able to confirm the (only) hypothesis I had based on your problem description

and the commands’ outputs that you so patiently provided.
The hypothesis went like this:

CryoSPARC was started as normal.
CryoSPARC processes were abruptly killed due to some event (RAM or other system load?. A mere TERM signal would have allowed for the cleanup of the

carterwheat:

/tmp/cryosparc-supervisor-2bd2e4ee751475e1d6470e25365ba9c5.sock

file)

The kernel “OOM killer” seemed to me a good candidate for part 2., but there appear to be no supporting log records. Please let us know if you have any additional information that would point to an alternative cause, such as if the CryoSPARC processes are running inside a container or are subject to some cluster workload manager.

carterwheat · April 17, 2024, 10:18pm

@wtempel Thanks for all of your help. I will keep you updated if anything else comes up that may point us in the right direction.

-Carter

marygh · July 12, 2024, 6:45pm

Hi all,
I’m facing the same problem. I tried some of these suggestions, but none worked. Attached is the error I’m getting.

I’m running CryoSPARC on a local EXXACT workstation on Linux Rocky 9, if it helps. Any suggestions?

wtempel · July 12, 2024, 7:16pm

Welcome to the forum @marygh. Please can you post additional information:

marygh:

I tried some of these suggestions, but none worked.

Please describe what you have tried, and the respective outcomes.

Please post the outputs of these commands:

grep -e HOST -e PORT /home/cryosparc_user/software/cryosparc/cryosparc_master/config.sh
hostname -f
host sn4622120602
host $(hostname -f)
ls -l /tmp/cryosparc*sock
ps -eo pid,ppid,start,command | grep -e cryosparc_ -e mongo
last reboot
sudo journalctl | grep -i oom
tail -n 60 /home/cryosparc_user/software/cryosparc/cryosparc_master/run/supervisord.log

marygh · July 15, 2024, 5:00pm

This is the output for the commands:

wtempel · July 15, 2024, 6:17pm

@marygh Please can you work with your lab IT support to

register your network adapter for a stable DHCP reservation
create a DNS record for the computer’s permanent hostname
configure the computer’s hostname to be consistent with the DNS entry from the previous step

After these steps, you may want to

define CRYOSPARC_MASTER_HOSTNAME (inside cryosparc_master/config.sh) with the newly assigned, permanent full hostname
perform a complete shutdown and restart of CryoSPARC

Does CryoSPARC startup properly after these steps? If it does not, please post the outputs of these commands as text (instead of a screenshot):

grep -e HOST -e PORT /home/cryosparc_user/software/cryosparc/cryosparc_master/config.sh
hostname -f
host $(hostname -f)
cat /etc/hosts
ls -l /tmp/cryosparc*sock
ps -eo pid,ppid,start,command | grep -e cryosparc_ -e mongo

If CryoSPARC starts normally, you may have to reconfigure the worker component with the command
cryosparcw connect and appropriate parameters. To help us suggest appropriate parameters, please post the outputs of the following commands

cryosparcm cli "get_scheduler_targets()"
hostname -f
cat /etc/hosts

marygh · July 15, 2024, 6:48pm

Thank you! I just need to mention that this workstation is not connected to the Wi-Fi.

egreene · July 15, 2024, 7:17pm

Hi all,
I am observing the same thing as mentioned previously by other users. Briefly, I have a new Dell workstation (4x NVIDIA A4500, 48 core (intel xeon w7), 256 Gb DDR5 Ram, 5T SSD, 118T HDD). I have two cryosparc users currently and I have opted for the ‘single workstation’ installation for each (which had previously worked well in my postdoc lab). However, I am routinely getting the ‘socket refused connection’ error when both users are running jobs. User 1 port = 39000 and user 2 port = 39020. We have intentionally only pushed the system to 50% capacity (GPUs are under 10Gb typically; RAM usually has >100 Gb free at any given moment). User 1 has sudo access and User 2 does not.

I ran this command from the comment thread and got this result:

grep -e HOST -e PORT /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/config.sh

hostname -f

host COSE-EGREENE-LX.clients.ad.sfsu.edu

host $(hostname -f)

ls -l /tmp/cryosparc*sock

ps -eo pid,ppid,start,command | grep -e cryosparc_ -e mongo

last reboot

tail -n 60 /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/run/supervisord.log

export CRYOSPARC_MASTER_HOSTNAME="COSE-EGREENE-LX.clients.ad.sfsu.edu"

export CRYOSPARC_BASE_PORT=39020

COSE-EGREENE-LX.clients.ad.sfsu.edu

COSE-EGREENE-LX.clients.ad.sfsu.edu has address 130.212.214.209

COSE-EGREENE-LX.clients.ad.sfsu.edu has IPv6 address fe80::6c41:9d2b:e7a9:d5d6

COSE-EGREENE-LX.clients.ad.sfsu.edu has address 130.212.214.209

COSE-EGREENE-LX.clients.ad.sfsu.edu has IPv6 address fe80::6c41:9d2b:e7a9:d5d6

srwx------ 1 921270295@ad.sfsu.edu domain users@ad.sfsu.edu 0 Jul 15 11:42 /tmp/cryosparc-supervisor-714ae7c340d4df77be474f8627fd6c9c.sock

1120743 85729 11:42:12 python /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/supervisord.conf

1120879 1120743 11:42:16 mongod --auth --dbpath /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_database --port 39021 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all

1120994 1120743 11:42:20 python /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39022 cryosparc_command.command_core:start() -c /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/gunicorn.conf.py

1120995 1120994 11:42:20 python /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39022 cryosparc_command.command_core:start() -c /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/gunicorn.conf.py

1121024 1120743 11:42:26 python /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39023 -c /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/gunicorn.conf.py

1121039 1121024 11:42:26 python /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39023 -c /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/gunicorn.conf.py

1121048 1120743 11:42:27 python /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39025 -c /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/gunicorn.conf.py

1121060 1121048 11:42:27 python /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39025 -c /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/gunicorn.conf.py

1121094 1120743 11:42:31 /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js

1122908 1120995 11:51:01 bash /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_worker/bin/cryosparcw run --project P2 --job J29 --master_hostname COSE-EGREENE-LX.clients.ad.sfsu.edu --master_command_core_port 39022

1122918 1122908 11:51:01 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J29 --master_hostname COSE-EGREENE-LX.clients.ad.sfsu.edu --master_command_core_port 39022

1122921 1122918 11:51:01 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J29 --master_hostname COSE-EGREENE-LX.clients.ad.sfsu.edu --master_command_core_port 39022

1122945 1120995 11:51:04 bash /home/921270295@ad.sfsu.edu/.local/share/cryosparc/cryosparc_worker/bin/cryosparcw run --project P2 --job J30 --master_hostname COSE-EGREENE-LX.clients.ad.sfsu.edu --master_command_core_port 39022

1122955 1122945 11:51:04 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J30 --master_hostname COSE-EGREENE-LX.clients.ad.sfsu.edu --master_command_core_port 39022

1122957 1122955 11:51:04 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J30 --master_hostname COSE-EGREENE-LX.clients.ad.sfsu.edu --master_command_core_port 39022

1125517 1124882 12:08:52 grep --color=auto -e cryosparc_ -e mongo

reboot system boot 6.5.0-41-generic Tue Jul 2 12:04 still running

reboot system boot 6.5.0-41-generic Mon Jul 1 13:38 still running

reboot system boot 6.5.0-41-generic Thu Jun 27 11:18 still running

reboot system boot 6.5.0-35-generic Mon Jun 10 10:32 still running

reboot system boot 6.5.0-35-generic Wed Jun 5 14:51 still running

reboot system boot 6.5.0-35-generic Wed Jun 5 14:22 - 14:31 (00:08)

reboot system boot 6.5.0-35-generic Wed Jun 5 09:52 - 14:31 (04:38)

reboot system boot 6.5.0-35-generic Mon Jun 3 18:00 - 14:31 (1+20:30)

reboot system boot 6.5.0-35-generic Mon Jun 3 14:06 - 17:54 (03:47)

reboot system boot 6.5.0-28-generic Wed May 1 13:29 - 17:54 (33+04:24)

reboot system boot 6.5.0-28-generic Thu Apr 25 08:53 - 13:27 (6+04:33)

reboot system boot 6.5.0-28-generic Wed Apr 24 09:22 - 08:49 (23:27)

reboot system boot 6.5.0-27-generic Tue Apr 16 11:27 - 09:09 (7+21:42)

reboot system boot 6.5.0-26-generic Wed Apr 3 12:53 - 08:47 (7+19:54)

reboot system boot 6.5.0-26-generic Wed Mar 27 17:40 - 12:40 (6+18:59)

reboot system boot 6.5.0-26-generic Thu Mar 21 10:21 - 14:46 (6+04:24)

reboot system boot 6.5.0-26-generic Wed Mar 20 14:58 - 15:05 (00:06)

reboot system boot 6.5.0-26-generic Wed Mar 20 13:52 - 13:54 (00:01)

reboot system boot 6.5.0-26-generic Wed Mar 20 13:05 - 13:50 (00:45)

reboot system boot 6.5.0-26-generic Wed Mar 20 12:59 - 13:04 (00:04)

reboot system boot 6.5.0-26-generic Wed Mar 20 12:36 - 12:58 (00:22)

reboot system boot 6.5.0-26-generic Tue Mar 19 15:19 - 12:35 (21:15)

wtmp begins Tue Mar 19 15:19:43 2024

2024-07-10 17:30:36,938 CRIT Server 'unix_http_server' running without any HTTP authentication checking

2024-07-10 17:30:36,941 INFO daemonizing the supervisord process

2024-07-10 17:30:36,942 INFO supervisord started with pid 710428

2024-07-10 17:30:41,479 INFO spawned: 'database' with pid 710536

2024-07-10 17:30:42,674 INFO success: database entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-10 17:30:44,904 INFO spawned: 'command_core' with pid 710644

2024-07-10 17:30:50,557 INFO success: command_core entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)

2024-07-10 17:30:51,078 INFO spawned: 'command_vis' with pid 710677

2024-07-10 17:30:52,799 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-10 17:30:52,952 INFO spawned: 'command_rtp' with pid 710706

2024-07-10 17:30:54,609 INFO success: command_rtp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-10 17:30:55,969 INFO spawned: 'app' with pid 710720

2024-07-10 17:30:57,436 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-10 17:30:57,553 INFO spawned: 'app_api' with pid 710740

2024-07-10 17:30:59,356 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-12 13:56:34,780 INFO waiting for app to stop

2024-07-12 13:56:34,780 INFO waiting for app_api to stop

2024-07-12 13:56:34,780 INFO waiting for command_core to stop

2024-07-12 13:56:34,780 INFO waiting for command_rtp to stop

2024-07-12 13:56:34,780 INFO waiting for command_vis to stop

2024-07-12 13:56:34,780 INFO waiting for database to stop

2024-07-12 13:56:34,799 WARN stopped: app (terminated by SIGTERM)

2024-07-12 13:56:34,802 WARN stopped: app_api (terminated by SIGTERM)

2024-07-12 13:56:35,132 INFO stopped: database (exit status 0)

2024-07-12 13:56:35,486 INFO stopped: command_vis (exit status 0)

2024-07-12 13:56:35,487 INFO stopped: command_rtp (exit status 0)

2024-07-12 13:56:37,086 INFO waiting for command_core to stop

2024-07-12 13:56:37,186 INFO stopped: command_core (exit status 0)

2024-07-12 14:29:56,282 INFO RPC interface 'supervisor' initialized

2024-07-12 14:29:56,282 CRIT Server 'unix_http_server' running without any HTTP authentication checking

2024-07-12 14:29:56,284 INFO daemonizing the supervisord process

2024-07-12 14:29:56,285 INFO supervisord started with pid 875815

2024-07-12 14:30:00,554 INFO spawned: 'database' with pid 875922

2024-07-12 14:30:02,090 INFO success: database entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-12 14:30:03,728 INFO spawned: 'command_core' with pid 876030

2024-07-12 14:30:09,326 INFO success: command_core entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)

2024-07-12 14:30:09,784 INFO spawned: 'command_vis' with pid 876100

2024-07-12 14:30:11,531 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-12 14:30:11,653 INFO spawned: 'command_rtp' with pid 876131

2024-07-12 14:30:13,274 INFO success: command_rtp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-12 14:30:14,728 INFO spawned: 'app' with pid 876183

2024-07-12 14:30:16,233 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-12 14:30:16,365 INFO spawned: 'app_api' with pid 876204

2024-07-12 14:30:17,367 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-15 11:42:12,511 INFO RPC interface 'supervisor' initialized

2024-07-15 11:42:12,511 CRIT Server 'unix_http_server' running without any HTTP authentication checking

2024-07-15 11:42:12,513 INFO daemonizing the supervisord process

2024-07-15 11:42:12,514 INFO supervisord started with pid 1120743

2024-07-15 11:42:16,712 INFO spawned: 'database' with pid 1120879

2024-07-15 11:42:18,319 INFO success: database entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-15 11:42:20,410 INFO spawned: 'command_core' with pid 1120994

2024-07-15 11:42:26,033 INFO success: command_core entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)

2024-07-15 11:42:26,446 INFO spawned: 'command_vis' with pid 1121024

2024-07-15 11:42:27,446 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-15 11:42:27,571 INFO spawned: 'command_rtp' with pid 1121048

2024-07-15 11:42:29,116 INFO success: command_rtp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-15 11:42:30,185 INFO spawned: 'app' with pid 1121073

2024-07-15 11:42:31,660 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2024-07-15 11:42:31,767 INFO spawned: 'app_api' with pid 1121094

2024-07-15 11:42:33,727 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

We get the socket refusal after running homogeneous refinement jobs which fail at different times. We are now trying to just get one dataset analyzed at a time but, of course, would like to increase throughput. Any advice is greatly appreciated! Thank you!!

wtempel · July 15, 2024, 7:22pm

Is it connected to a wired network?

wtempel · July 15, 2024, 7:54pm

@egreene May I ask

What are the CryoSPARC versions of the two instances?
What is the exact text of the socket refusal message(s)?
Where (which log file or location in the UI) do observe the message(s)?

marygh · July 15, 2024, 9:43pm

No, it doesn’t have a network connection