Patch Motion Correction Fails_Job is unresponsive - no heartbeat received in 180 seconds

Dear All,

I am using cryosparc version v4.2.1 and I usually have an issue with “Job is unresponsive - no heartbeat received in 180 seconds.”.
Patch motion correction job runs for a while and then the Linux terminal running “cryosparcm start” job terminated automatically and job fails. I have noticed the cryosparc-supervisor-*.sock file created right after cryosparc started and it didn’t delete automatically and I always have to remove it manually.
Sometime when a job fails restarting the system and restarting the job works.
I would appreciate any suggestions to solve these issues.

Thank you,
Ali,

@morteza Please can you:

  • describe your CryoSPARC instance:
    • single workstation (combined master/worker) or
    • connected workers or
    • connected cluster
  • email us the error report of an affected job
  1. Under what circumstances did you manually remove the cryosparc-supervisor-*.sock file. The file should be removed automatically by CryoSPARC when the instance is being stopped. It must not be removed manually unless it is ensured that all CryoSPARC processes have been terminated. To check for CryoSPARC processes, You may run, under the Linux account that owns CryoSPARC processes:
    ps xww | grep -e cryosparc -e mongo
  2. Please post the output of
    cryosparcm status | grep MASTER_HOSTNAME
  3. Please post the output of
    cryosparcm cli "get_scheduler_targets()"
  4. Thank you for emailing us the CryoSPARC instance logs. As a follow-up, please can you post the output of
    cryosparcm joblog P15 J5

Dear @wtempel

Thank you for your reply.

  1. I delete the cryosparc-supervisor-*.sock file because it doesn’t remove automatically by cryoSPARC and at the end of the run the cryoSPARC terminal closes and job fails. When the terminal automatically closes the cryoSPARC web interface turns white and doesn’t show any content. At this point when I type cryosparcm start, it says cryoSPARC is already running so I have to restart cryoSPARC becasue it asks to restart. By runing cryosparcm restart it shows:

CryoSPARC is running.
Stopping cryoSPARC
unix:///tmp/cryosparc-supervisor-*.sock refused connection

I need to remove manually the cryosparc-supervisor-.sock file in order to restart cryosparc.
By manually removing cryosparc-supervisor-
.sock file cryosparc restart works.

Now I removed cryosparc-supervisor-*.sock file and restarted cryosparc and then run:
ps xww | grep -e cryosparc -e mongo

The uotput of this command is:
3912 ? Sl 0:00 /opt/google/chrome/chrome_crashpad_handler --monitor-self --monitor-self-annotation=ptype=crashpad-handler --database=/home/cryosparcusr/.config/google-chrome/Crash Reports --metrics-dir=/home/cryosparcusr/.config/google-chrome --url=https://clients2.google.com/cr/report --annotation=channel= --annotation=lsb-release=Ubuntu 22.04.1 LTS --annotation=plat=Linux --annotation=prod=Chrome_Linux --annotation=ver=112.0.5615.121 --initial-client-fd=5 --shared-client-connection
3916 ? Sl 0:00 /opt/google/chrome/chrome_crashpad_handler --no-periodic-tasks --monitor-self-annotation=ptype=crashpad-handler --database=/home/cryosparcusr/.config/google-chrome/Crash Reports --url=https://clients2.google.com/cr/report --annotation=channel= --annotation=lsb-release=Ubuntu 22.04.1 LTS --annotation=plat=Linux --annotation=prod=Chrome_Linux --annotation=ver=112.0.5615.121 --initial-client-fd=4 --shared-client-connection
93332 ? Ss 0:00 python /home/cryosparcusr/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /home/cryosparcusr/cryosparc/cryosparc_master/supervisord.conf
93439 ? Sl 0:03 mongod --auth --dbpath /home/cryosparcusr/cryosparc/cryosparc_database --port 39001 --oplogSize 64 --replSet meteor --nojournal --wiredTigerCacheSizeGB 4 --bind_ip_all
93543 ? Sl 0:17 python -c import cryosparc_command.command_core as serv; serv.start(port=39002)
93575 ? Sl 0:06 python -c import cryosparc_command.command_vis as serv; serv.start(port=39003)
93605 ? Sl 0:04 python -c import cryosparc_command.command_rtp as serv; serv.start(port=39005)
93669 ? Sl 0:03 /home/cryosparcusr/cryosparc/cryosparc_master/cryosparc_app/api/nodejs/bin/node ./bundle/main.js
94195 pts/1 S+ 0:00 grep --color=auto -e cryosparc -e mongo

  1. Here is the output of command:
    export CRYOSPARC_MASTER_HOSTNAME=“Gambit”

  2. Here is the output of thecommand:
    [{‘cache_path’: ‘/mnt/cs_scratch/’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 11543379968, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}], ‘hostname’: ‘Gambit’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘Gambit’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], ‘GPU’: [0], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7]}, ‘ssh_str’: ‘cryosparcusr@Gambit’, ‘title’: ‘Worker node Gambit’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/cryosparcusr/cryosparc/cryosparc_worker/bin/cryosparcw’}]

  3. I deleted P15 J5. But I ran J6 same as J5 and it did fail. In this case I didn’t remove .sock file manually. I emailed you new CryoSPARC instance logs and joblog.

Thank you,
Ali,

Removal of the cryosparc-supervisor-*.sock under these circumstances is premature, and may disrupt future CryoSPARC operation. Before manual removal of the file, you must also ensure that the corresponding supervisord process has been terminated.
cryosparcm stop is supposed to terminate the supervisord process, but sometimes fails. Always check with the ps command after running cryosparcm stop and, if necessary
kill -TERM the supervisord process before manually removing the socket file.

These processes are expected (with different process IDs) after a successful CryoSPARC start or restart. If these processes are present after cryosparcm stop, you can try
kill -TERM 93332 (replace the process ID with the actual ID of the supervisord process) and wait for 10 seconds. Very likely

  • the next run of the ps xww command will no longer show the supervisord and the other processes, which were children of the supervisord process
  • /tmp/cryosparc-supervisor-*.sock will no longer be present. Note that /tmp may contain multiple cryosparc-supervisor-*.sock files if multiple CryoSPARC instances are running on the server (which is allowed if certain rules are followed).

Thanks. I received those.

@morteza I have a few more follow-up questions.
While CryoSPARC is up and running, please can you run these commands on Gambit:

host Gambit
host gambit
curl Gambit:39001

host Gambit output:
Gambit has address 127.0.1.1
host gambit output:
gambit has address 127.0.1.1
curl Gambit:39001 output:
It looks like you are trying to access MongoDB over HTTP on the native driver port.

Please can you clarify what you meant with the following terms:

  • end of the run
  • cryoSPARC terminal
  • terminal automatically closes

A side note: The logs you sent us helped us find a bug that could cause the generation of job error reports to fail. This bug will be fixed in an upcoming software release.

while the job is running or when it is about to finish, the Linux terminal closes automatically and the web interface of cryoSPARC turns white till I restart cryoSPARC. When I restart the cryoSPARC (cryosparcm restart) the job that was running has failed.

Thank you for this information. I have a few follow-up questions.
What Linux distribution and version is running on Gambit?
Which commands were run in the terminal that closed automatically?
What kind of terminal app are you using?
Do you know why the terminal is closing automatically?

Linux distribution:
Linux Gambit 5.19.0-40-generic #41~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 31 16:00:14 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

The terminal is open when I run cryosparcm start. When it runs for a while it suddenly closes automatically and job fails.
The cryosparc does not remove cryosparc-supervisor-*.sock file automatically whenever I start cryosparc and lunch a job.

The terminal is GNOME Terminal, Version 3.44.0 for GNOME 42

I don’t know why our terminal is closing automatically.

Does it refer to cryosparcm start or the terminal?

You may want to investigate the cause. The cause may provide clues on how to resolve the problems you experience when running CryoSPARC.

@wtempel thank you for your suggestions.
I have a question about cryosparc-supervisor.*.sock file? Why cryosparc does not remove it automatically?

The cryosparc-supervisor-*.sock is expected to be removed automatically during “orderly” shutdown of CryoSPARC master services. In the present case, the shutdown is either not initiated at all or disrupted.
The shutdown is typically initiated with the
cryosparcm stop command. Sometimes,
cryosparcm stop fails to terminate CryoSPARC master processes.
It is therefore a good idea to confirm successful shutdown and the absence of CryoSPARC-related supervisord and mongod processes with a command like

ps -u $USER -opid,ppid,cmd | grep -e cryosparc -e mongo

(This command must be run under the Linux account that usually runs CryoSPARC master processes.)