Refused connection when cryosparc is running

Hi Wtempel,

Thank you for the help.
Instead of upgrading Cryosparc 4.4.0 from the old version, I made a fresh installation, but the same problem still happened after cryosparc ran for ~28 hours (I tried twice).

Here is the output of your instructed procedure.

$cryosparcm stop
CryoSPARC is running.
Stopping cryoSPARC
unix:///tmp/cryosparc-supervisor-5fccf1c670aab55f9d50ce55f18e4c54.sock refused connection

$ ps -w -U user1 -opid,ppid,start,cmd | grep -e cryosparc -e mongo | grep -v grep

NO OUTPUT.

THEN I DELETED THE SOCK FILE:

$ rm /tmp/cryosparc-supervisor-5fccf1c670aab55f9d50ce55f18e4c54.sock

$ cryosparcm start
Starting cryoSPARC System master process..
CryoSPARC is not already running.
configuring database
    configuration complete
database: started
checkdb success
command_core: started
    command_core connection succeeded
    command_core startup successful
command_vis: started
command_rtp: started
    command_rtp connection succeeded
    command_rtp startup successful
app: started
app_api: started
-----------------------------------------------------

CryoSPARC master started.
 From this machine, access CryoSPARC and CryoSPARC Live at
    http://localhost:61000

 From other machines on the network, access CryoSPARC and CryoSPARC Live at
    http://cryo:61000


Startup can take several minutes. Point your browser to the address
and refresh until you see the cryoSPARC web interface.



ps -weopid,ppid,start,cmd | grep -e cryosparc -e mongo | grep -v grep



$ ps -weopid,ppid,start,cmd | grep -e cryosparc -e mongo | grep -v grep
  82204    2765 12:58:51 python /home/jz/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /home/jz/cryosparc/cryosparc_master/supervisord.conf
  82319   82204 12:58:57 mongod --auth --dbpath /home/jz/cryosparc/cryosparc_database --port 61001 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
  82430   82204 12:59:01 python -c import cryosparc_command.command_core as serv; serv.start(port=61002)
  82468   82204 12:59:08 python -c import cryosparc_command.command_vis as serv; serv.start(port=61003)
  82492   82204 12:59:09 python -c import cryosparc_command.command_rtp as serv; serv.start(port=61005)
  82556   82204 12:59:14 /home/jz/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
    82589   82430 12:59:18 bash /home/jz/cryosparc/cryosparc_worker/bin/cryosparcw run --project P1 --job J33 --master_hostname cryo --master_command_core_port 61002
  82604   82589 12:59:18 python -c import cryosparc_compute.run as run; run.run() --project P1 --job J33 --master_hostname cryo --master_command_core_port 61002
  82606   82604 12:59:18 python -c import cryosparc_compute.run as run; run.run() --project P1 --job J33 --master_hostname cryo --master_command_core_port 61002
  82609   82430 12:59:20 bash /home/jz/cryosparc/cryosparc_worker/bin/cryosparcw run --project P1 --job J34 --master_hostname cryo --master_command_core_port 61002
  82624   82609 12:59:20 python -c import cryosparc_compute.run as run; run.run() --project P1 --job J34 --master_hostname cryo --master_command_core_port 61002
  82626   82624 12:59:20 python -c import cryosparc_compute.run as run; run.run() --project P1 --job J34 --master_hostname cryo --master_command_core_port 61002


$ ls -l /tmp/cryosparc*.sock /tmp/mongodb-*.sock
srwx------ 1 jz jz 0 12月  6 12:58 /tmp/cryosparc-supervisor-5fccf1c670aab55f9d50ce55f18e4c54.sock
srwx------ 1 jz jz 0 12月  6 12:58 /tmp/mongodb-61001.sock


$ free -g
               total        used        free      shared  buff/cache   available
Mem:             503          13          41           0         448         485
Swap:              1           0           1

I only run cryosparc on the workstation, so there should be enough RAM.

I checked the command_core log file and there are some errors:

2023-12-06 13:18:27,427 run                  ERROR    | Encountered exception while running background task
2023-12-06 13:18:27,427 run                  ERROR    | Traceback (most recent call last):
2023-12-06 13:18:27,427 run                  ERROR    |   File "cryosparc_master/cryosparc_command/core.py", line 1115, in cryosparc_master.cryosparc_command.core.background_tasks_worker
2023-12-06 13:18:27,427 run                  ERROR    |   File "/home/jz/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py", line 186, in wrapper
2023-12-06 13:18:27,427 run                  ERROR    |     return func(*args, **kwargs)
2023-12-06 13:18:27,427 run                  ERROR    |   File "/home/jz/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py", line 232, in wrapper
2023-12-06 13:18:27,427 run                  ERROR    |     return func(*args, **kwargs)
2023-12-06 13:18:27,427 run                  ERROR    |   File "/home/jz/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 3924, in dump_job_database
2023-12-06 13:18:27,427 run                  ERROR    |     rc.dump_job_database(project_uid = project_uid, job_uid = job_uid, job_completed = job_completed, migration = migration, abs_export_dir = abs_export_dir, logger = logger)
2023-12-06 13:18:27,427 run                  ERROR    |   File "/home/jz/cryosparc/cryosparc_master/cryosparc_compute/jobs/runcommon.py", line 444, in dump_job_database
2023-12-06 13:18:27,427 run                  ERROR    |     file_object = gridfs.get(objectid.ObjectId(object_id)).read()
2023-12-06 13:18:27,427 run                  ERROR    |   File "/home/jz/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/gridfs/__init__.py", line 153, in get
2023-12-06 13:18:27,427 run                  ERROR    |     gout._ensure_file()
2023-12-06 13:18:27,427 run                  ERROR    |   File "/home/jz/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/gridfs/grid_file.py", line 484, in _ensure_file
2023-12-06 13:18:27,427 run                  ERROR    |     raise NoFile(
2023-12-06 13:18:27,427 run                  ERROR    | gridfs.errors.NoFile: no file in gridfs collection Collection(Database(MongoClient(host=['cryo:61001'], document_class=dict, tz_aware=False, connect=False, authsource='admin'), 'meteor'), 'fs.files') with _id ObjectId('656ffffcfe82afe930c2f357')

Thanks @cornpeasant for this info.

This command would have displayed processes owned by (hypothetical) Linux user user1, and needs to be modified to match your circumstances (details).

Please can you similarly check the database log for errors.
(cryosparcm log database)

@cornpeasant Please can you confirm that certain ports based on the configured CRYOSPARC_MASTER_HOSTNAME and CRYOSPARC_BASE_PORT are accessible.
For example, what is the output of the command

curl cryo:61006

?

Thanks @cornpeasant. I cannot spot in the screenshot a reason for the UI failure. The next step would be an analysis of browser logs.
With CryoSPARC running, please

  1. enable browser debugging
  2. re-load the UI at http://localhost:61000
  3. email us the the HAR network output

Hi wtempel,
I found that the issue was very likely due to high cache/buffer usage and extremely low available memory on the Linux system. I now clear the cache/buffer every hour using a script, and cryoSPARC can run smoothly so far.

1 Like

Hi! May I ask if you are willing to share the script you use for clearing the cache/buffer? I am also experiencing a “refused connection” issue; it happened when I was transferring particles into SSD cache. Thank you!

what effect does ssh and vpn tunnel could possibly have?
I never used to have this problem. We recently installed Synology NAS and I have been facing this issue since then constantly. Could there be a connection between the two?

Welcome to the forum @Ana .
Please can you post the outputs of the following commands:

cryosparcm env | grep -e HOSTNAME -e PORT
hostname -f
host $(hostname -f)
cat /etc/hosts

and confirm the error messages you observed and the commands that triggered them.

I have not seen any errors. This are the outputs

cryosparc_user@sn4622119118:~$ cryosparcm env | grep -e HOSTNAME -e PORT
export "CRYOSPARC_MASTER_HOSTNAME=sn4622119118"
export "CRYOSPARC_COMMAND_VIS_PORT=39003"
export "CRYOSPARC_COMMAND_RTP_PORT=39005"
export "CRYOSPARC_HTTP_APP_PORT=39000"
export "CRYOSPARC_HOSTNAME_CHECK=sn4622119118"
export "CRYOSPARC_MONGO_PORT=39001"
export "CRYOSPARC_HTTP_LIVEAPP_LEGACY_PORT=39006"
export "CRYOSPARC_COMMAND_CORE_PORT=39002"
export "CRYOSPARC_BASE_PORT=39000"
export "CRYOSPARC_FORCE_HOSTNAME=false"
cryosparc_user@sn4622119118:~$ hostname -f
sn4622119118
cryosparc_user@sn4622119118:~$ host $(hostname -f)
sn4622119118 has address xx
sn4622119118 has address xx
sn4622119118 has IPv6 address xx
cryosparc_user@sn4622119118:~$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 sn4622119394

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Please can you provide details (symptoms, screenshots, triggering actions) for “this issue”.

This usually happens when I run either 2D classification or homogeneous refinement job, I haven’t noticed it happening with extraction.
Terminal gets closed and cryosparc gets disconnected, a loading sign appears on the screen and when I refresh the page, it says “unable to connect”.
image
If I try to restart cryoSPARC, I get a message "unix:///tmp/cryosparc-supervisor-206773da3c7c06e952eddaffaea9188d.sock refused connection
".
The only two ways I am able to fix it is either restart the computer or remove the sock file and restart cryosparc. When I log into the cryoSPARC, the job I was running before has an error message that reads:
“Job is unresponsive - no heartbeat received in 180 seconds.”

CryoSPARC master processes might be disrupted due to physical or configured thresholds on RAM usage.

  1. Is this a single workstation (combined master/worker on single host) CryoSPARC instance?
  2. What are the outputs of these commands (the first one requires admin access) on the CryoSPARC master host.
    sudo journalctl | grep -i oom 
    free -h
    nvidia-smi --query-gpu=index,name --format=csv
    cryosparcm log supervisord | tail -n 40
    

Yes, this is a single workstation
Here are the outputs of the commands

Another thing I noticed recently is that sometimes when I try to run interactive jobs this is an error message I get:
“Unable to queue P5 J327: ServerError: enqueue job error - P5 J327 is an interactive job and must be queued on the master node”
This never used to happen

Thanks @Ana for posting the information.

Jul 11 12:51:32 sn4622119118 systemd-oomd[1771]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-1f03abf9-d45b-4872-8707-eded864790df.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 71.69% > 50.00% for > 20s with reclaim activity
Jul 11 12:51:32 sn4622119118 systemd[34133]: vte-spawn-1f03abf9-d45b-4872-8707-eded864790df.scope: systemd-oomd killed 184 process(es) in this unit.
Jul 11 13:25:10 sn4622119118 systemd-oomd[1771]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-ac9ed476-5ea3-426b-a9b0-03e39ecb7b79.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 78.28% > 50.00% for > 20s with reclaim activity
Jul 11 13:25:10 sn4622119118 systemd[34133]: vte-spawn-ac9ed476-5ea3-426b-a9b0-03e39ecb7b79.scope: systemd-oomd killed 154 process(es) in this unit.
Jul 11 13:39:52 sn4622119118 systemd-oomd[1771]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-1a967f77-46b7-4d25-8d88-8c31a0f98cda.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 73.26% > 50.00% for > 20s with reclaim activity
Jul 11 13:39:52 sn4622119118 systemd[34133]: vte-spawn-1a967f77-46b7-4d25-8d88-8c31a0f98cda.scope: systemd-oomd killed 71 process(es) in this unit.
[..]
Jul 11 15:53:23 sn4622119118 systemd-oomd[1966]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-74e968b6-b1af-4207-aa00-506291ef058e.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 78.39% > 50.00% for > 20s with reclaim activity
Jul 11 15:53:23 sn4622119118 systemd[2883]: vte-spawn-74e968b6-b1af-4207-aa00-506291ef058e.scope: systemd-oomd killed 59 process(es) in this unit.
Jul 11 16:43:51 sn4622119118 systemd-oomd[1966]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-b7606e22-0cc2-4019-b48b-9e768c56a254.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 75.31% > 50.00% for > 20s with reclaim activity
Jul 11 16:43:51 sn4622119118 systemd[2883]: vte-spawn-b7606e22-0cc2-4019-b48b-9e768c56a254.scope: systemd-oomd killed 23 process(es) in this unit.
Jul 12 16:06:18 sn4622119118 systemd-oomd[1966]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-55cfc509-e689-44dd-8eb8-3509459f29ab.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 54.87% > 50.00% for > 20s with reclaim activity
Jul 12 16:06:18 sn4622119118 systemd[2883]: vte-spawn-55cfc509-e689-44dd-8eb8-3509459f29ab.scope: systemd-oomd killed 198 process(es) in this unit.
Jul 12 16:19:15 sn4622119118 systemd-oomd[1966]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-a54a40c7-8154-44d4-8ac0-3775e7fe3fb2.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 80.66% > 50.00% for > 20s with reclaim activity
Jul 12 16:19:15 sn4622119118 systemd[2883]: vte-spawn-a54a40c7-8154-44d4-8ac0-3775e7fe3fb2.scope: systemd-oomd killed 113 process(es) in this unit.
Jul 12 17:11:27 sn4622119118 systemd-oomd[1966]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-4cebc8d9-ea3a-4f7f-9b9f-d5ecd0cd2299.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 63.57% > 50.00% for > 20s with reclaim activity
Jul 12 17:11:27 sn4622119118 systemd[2883]: vte-spawn-4cebc8d9-ea3a-4f7f-9b9f-d5ecd0cd2299.scope: systemd-oomd killed 124 process(es) in this unit.
[..]
Jul 12 18:19:44 sn4622119118 systemd-oomd[1930]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-f67df135-9fab-4b6e-9c3b-7c6b3502bd45.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 53.98% > 50.00% for > 20s with reclaim activity
Jul 12 18:19:44 sn4622119118 systemd[2781]: vte-spawn-f67df135-9fab-4b6e-9c3b-7c6b3502bd45.scope: systemd-oomd killed 69 process(es) in this unit.
[..]
Jul 12 19:33:23 sn4622119118 systemd-oomd[1928]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-2ce4f9ab-9129-4267-b060-ea0626d0a2e5.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 84.29% > 50.00% for > 20s with reclaim activity
Jul 12 19:33:23 sn4622119118 systemd[2836]: vte-spawn-2ce4f9ab-9129-4267-b060-ea0626d0a2e5.scope: systemd-oomd killed 146 process(es) in this unit.
[..]
Jul 15 10:38:01 sn4622119118 systemd-oomd[1927]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-2ab401dc-78bc-46e1-adec-fbaccf435459.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 77.04% > 50.00% for > 20s with reclaim activity
Jul 15 10:38:01 sn4622119118 systemd[2835]: vte-spawn-2ab401dc-78bc-46e1-adec-fbaccf435459.scope: systemd-oomd killed 145 process(es) in this unit.
Jul 15 17:10:23 sn4622119118 systemd-oomd[1927]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-c29da3e7-5fa7-4ea5-af5c-a814cfa080f4.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 68.23% > 50.00% for > 20s with reclaim activity
Jul 15 17:10:23 sn4622119118 systemd[2835]: vte-spawn-c29da3e7-5fa7-4ea5-af5c-a814cfa080f4.scope: systemd-oomd killed 94 process(es) in this unit.

Do these timestamps in any way correlate with events you described?

Yes, these are the days and approximate times I’ve been facing this problem.

Based on the information provided:

I hypothesize that certain combinations of workloads trigger the (configurable?) conditions

[quote="wtempel, post:23, topic:12832"]
`> 50.00% for > 20s with reclaim activity`
[/quote]
for systemd-oomd to `kill` processes, including some CryoSPARC processes in a way that prevents those processes from "cleaning up after themselves". This _could_ explain the presence of an orphaned
/tmp/cryosparc-supervisor-206773da3c7c06e952eddaffaea9188d.sock

file. Before removing the file, please confirm that the corresponding supervisord process is in fact no longer running (related discussion).
A google.com search indicates that some consider systemd-oomd as aggressive and suggest a variety of interventions. May I suggest a consultation with your system administrator about a potential reconfiguration, and the avoidance of workload patterns that exhaust available system RAM.

1 Like

Hello, may I ask if you are willing to share the script you use for clearing the cache/buffer? I am also experiencing a “refused connection” issue for Patch Motion Correction and 2D classification.

Many thanks!