Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>)

MiaoXiaoPu1121 · September 6, 2023, 2:37am

Dear community, i am running some NU-refine jobs, and all of them are ended to fail by the last notice of

**** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****

I want to ask what is “heartbeat monitor” and what is going on ? job queued two days ago run well and ended nornally, one day ago all fail.

Any advice can be helpful.

wtempel · September 6, 2023, 3:12pm

There are various possible causes for this signal being sent.

How long did the jobs run before the signal was sent?
Does the problem occur on the latest version of CryoSPARC (4.3.1)?
Were there any server or network reconfigurations?

MiaoXiaoPu1121 · September 6, 2023, 4:12pm

Hi, Thanks for your reply,

I have queued many jobs and the shortest job last about 10 min before it was killed , and the longest about 10hours.
After i encounter this , i did update to the latest version ,and the killer signal is still there after the queue.
I am not quite sure about this , but i dont think it was reconfigured.

However, I tried to add

export CRYOSPARC_HEARTBEAT_SECONDS=600

to the /cryosparc_master/config.sh early this day, and tried to queue 4 of my jobs; it seems like the killer signal was not sent during this time, and 3 of these jobs done completely after 5~10 hours running (the last one is still running, seems working normally to me).

Will this situation be helpful about solving this problem?

wtempel · September 7, 2023, 8:43pm

MiaoXiaoPu1121:

I tried to add
export CRYOSPARC_HEARTBEAT_SECONDS=600
to the /cryosparc_master/config.sh early this day, and tried to queue 4 of my jobs; it seems like the killer signal was not sent during this time, and 3 of these jobs done completely after 5~10 hours running (the last one is still running, seems working normally to me).

It is good you found a way to make your jobs to completion. Knowing that increasing CRYOSPARC_HEARTBEAT_SECONDS had this effect suggests several possibilities:

The worker/job is not sending heartbeats for some reason. You can find a history of sent heartbeats in the job log (Metadata Log). Were there gaps in heartbeats being send regularly? Was the worker under heavy load when such gaps (possibly) occurred?
Or: The worker did send regular heartbeats, but the master either
- did not receive them (network issues?)
- processed incoming heartbeats with a delay or not at all (due to heavy load?)

MiaoXiaoPu1121 · September 8, 2023, 4:20am

Hi, thanks for your reply:

i check the job log, and i think the heartbeats did been send regularly until it sent the killer signal. May you please have a check again?

gpufft: creating new cufft plan (plan id 19   pid 193243) 
	gpu_id  3 
	ndims   2 
	dims    420 420 0 
	inembed 420 422 0 
	istride 1 
	idist   177240 
	onembed 420 211 0 
	ostride 1 
	odist   88620 
	batch   54 
	type    R2C 
	wkspc   automatic 
	Python traceback:

========= sending heartbeat at 2023-09-06 01:38:11.454672
========= sending heartbeat at 2023-09-06 01:38:21.489652
========= sending heartbeat at 2023-09-06 01:38:31.527659
========= sending heartbeat at 2023-09-06 01:38:41.568657
========= sending heartbeat at 2023-09-06 01:38:51.607653
========= sending heartbeat at 2023-09-06 01:39:01.645653
========= sending heartbeat at 2023-09-06 01:39:11.683656
========= sending heartbeat at 2023-09-06 01:39:22.086654
========= sending heartbeat at 2023-09-06 01:39:32.125660
========= sending heartbeat at 2023-09-06 01:40:13.905664
========= sending heartbeat at 2023-09-06 01:40:23.947648
========= sending heartbeat at 2023-09-06 01:40:33.984647
========= sending heartbeat at 2023-09-06 01:40:44.027649
========= sending heartbeat at 2023-09-06 01:40:54.071650
========= sending heartbeat at 2023-09-06 01:41:04.117806
========= sending heartbeat at 2023-09-06 01:41:14.152627
========= sending heartbeat at 2023-09-06 01:41:24.186634
========= sending heartbeat at 2023-09-06 01:41:34.214624
========= sending heartbeat at 2023-09-06 01:41:44.248629
========= sending heartbeat at 2023-09-06 01:41:54.283628
========= sending heartbeat at 2023-09-06 01:42:04.327638
========= sending heartbeat at 2023-09-06 01:42:14.362637
========= sending heartbeat at 2023-09-06 01:42:53.036643
========= sending heartbeat at 2023-09-06 01:43:03.073638
========= sending heartbeat at 2023-09-06 01:43:13.106628
========= sending heartbeat at 2023-09-06 01:43:23.150635
========= sending heartbeat at 2023-09-06 01:43:33.186631
========= sending heartbeat at 2023-09-06 01:44:04.993830
========= sending heartbeat at 2023-09-06 01:44:15.051623
========= sending heartbeat at 2023-09-06 01:44:25.088628
========= sending heartbeat at 2023-09-06 01:44:35.130627
========= sending heartbeat at 2023-09-06 01:44:45.246629
========= sending heartbeat at 2023-09-06 01:44:55.288632
========= sending heartbeat at 2023-09-06 01:46:00.262630
========= sending heartbeat at 2023-09-06 01:46:10.307636
========= sending heartbeat at 2023-09-06 01:46:20.347628
========= sending heartbeat at 2023-09-06 01:46:55.069650
========= sending heartbeat at 2023-09-06 01:47:05.109656
========= sending heartbeat at 2023-09-06 01:47:15.168941
========= sending heartbeat at 2023-09-06 01:47:35.201662
========= sending heartbeat at 2023-09-06 01:47:45.246650
========= sending heartbeat at 2023-09-06 01:47:55.280660
========= sending heartbeat at 2023-09-06 01:48:05.316183
========= sending heartbeat at 2023-09-06 01:48:15.347657
========= sending heartbeat at 2023-09-06 01:48:25.387629
========= sending heartbeat at 2023-09-06 01:48:40.813628
========= sending heartbeat at 2023-09-06 01:48:50.853635
========= sending heartbeat at 2023-09-06 01:49:00.906633
========= sending heartbeat at 2023-09-06 01:49:10.940625
========= sending heartbeat at 2023-09-06 01:49:43.186626
========= sending heartbeat at 2023-09-06 01:49:53.226624
========= sending heartbeat at 2023-09-06 01:50:03.265632
========= sending heartbeat at 2023-09-06 01:50:13.301631
========= sending heartbeat at 2023-09-06 01:50:23.347625
========= sending heartbeat at 2023-09-06 01:50:33.392669
========= sending heartbeat at 2023-09-06 01:50:43.447655
========= sending heartbeat at 2023-09-06 01:50:53.484658
========= sending heartbeat at 2023-09-06 01:51:03.522646
========= sending heartbeat at 2023-09-06 01:51:18.144658
========= sending heartbeat at 2023-09-06 01:51:28.186667
========= sending heartbeat at 2023-09-06 01:51:38.244654
========= sending heartbeat at 2023-09-06 01:51:48.282653
========= sending heartbeat at 2023-09-06 01:51:58.314646
========= sending heartbeat at 2023-09-06 01:52:08.346634
========= sending heartbeat at 2023-09-06 01:52:18.384632
========= sending heartbeat at 2023-09-06 01:52:28.422630
========= sending heartbeat at 2023-09-06 01:52:38.453630
========= sending heartbeat at 2023-09-06 01:52:48.486635
========= sending heartbeat at 2023-09-06 01:52:58.523649
========= sending heartbeat at 2023-09-06 01:53:08.554644
========= sending heartbeat at 2023-09-06 01:53:18.586635
========= sending heartbeat at 2023-09-06 01:53:29.827621
========= sending heartbeat at 2023-09-06 01:53:39.859637
========= sending heartbeat at 2023-09-06 01:53:49.889640
========= sending heartbeat at 2023-09-06 01:54:18.321626
========= sending heartbeat at 2023-09-06 01:54:28.633638
========= sending heartbeat at 2023-09-06 01:54:38.673630
========= sending heartbeat at 2023-09-06 01:54:49.973633
========= sending heartbeat at 2023-09-06 01:55:00.024636
========= sending heartbeat at 2023-09-06 01:55:19.229638
========= sending heartbeat at 2023-09-06 01:55:29.265627
========= sending heartbeat at 2023-09-06 01:55:39.306630
========= sending heartbeat at 2023-09-06 01:55:49.350632
========= sending heartbeat at 2023-09-06 01:55:59.387622
========= sending heartbeat at 2023-09-06 01:56:27.262638
========= sending heartbeat at 2023-09-06 01:56:37.306633
========= sending heartbeat at 2023-09-06 01:56:47.347637
========= sending heartbeat at 2023-09-06 01:56:57.385771
========= sending heartbeat at 2023-09-06 01:57:21.692054
========= sending heartbeat at 2023-09-06 01:57:31.729636
========= sending heartbeat at 2023-09-06 01:57:41.768628
========= sending heartbeat at 2023-09-06 01:57:51.806627
========= sending heartbeat at 2023-09-06 01:58:01.846632
========= sending heartbeat at 2023-09-06 01:58:11.885629
========= sending heartbeat at 2023-09-06 01:58:21.923622
========= sending heartbeat at 2023-09-06 01:58:31.955639
========= sending heartbeat at 2023-09-06 01:58:41.986654
========= sending heartbeat at 2023-09-06 01:58:52.028648
========= sending heartbeat at 2023-09-06 01:59:02.065661
========= sending heartbeat at 2023-09-06 01:59:12.106664
========= sending heartbeat at 2023-09-06 01:59:22.148660
========= sending heartbeat at 2023-09-06 01:59:32.182661
========= sending heartbeat at 2023-09-06 01:59:42.227656
========= sending heartbeat at 2023-09-06 01:59:52.260667
========= sending heartbeat at 2023-09-06 02:00:02.294666
========= sending heartbeat at 2023-09-06 02:00:12.332654
========= sending heartbeat at 2023-09-06 02:00:22.370657
========= sending heartbeat at 2023-09-06 02:00:51.399648
========= sending heartbeat at 2023-09-06 02:01:01.432653
========= sending heartbeat at 2023-09-06 02:01:11.464662
========= sending heartbeat at 2023-09-06 02:02:23.477674
========= sending heartbeat at 2023-09-06 02:02:55.428655
========= sending heartbeat at 2023-09-06 02:03:05.471672
========= sending heartbeat at 2023-09-06 02:03:15.507651
========= sending heartbeat at 2023-09-06 02:03:25.546656
========= sending heartbeat at 2023-09-06 02:03:35.583647
========= sending heartbeat at 2023-09-06 02:03:45.614656
========= sending heartbeat at 2023-09-06 02:03:55.652654
========= sending heartbeat at 2023-09-06 02:04:05.688650
========= sending heartbeat at 2023-09-06 02:04:15.750655
========= sending heartbeat at 2023-09-06 02:04:25.786655
========= sending heartbeat at 2023-09-06 02:04:35.824653
========= sending heartbeat at 2023-09-06 02:04:45.862661
========= sending heartbeat at 2023-09-06 02:04:55.898657
========= sending heartbeat at 2023-09-06 02:05:34.330657
========= sending heartbeat at 2023-09-06 02:06:16.118655
========= sending heartbeat at 2023-09-06 02:06:26.168651
========= sending heartbeat at 2023-09-06 02:06:36.212653
========= sending heartbeat at 2023-09-06 02:06:46.250657
========= sending heartbeat at 2023-09-06 02:06:56.288659
========= sending heartbeat at 2023-09-06 02:07:06.334705
========= sending heartbeat at 2023-09-06 02:07:16.367634
========= sending heartbeat at 2023-09-06 02:07:26.405634
========= sending heartbeat at 2023-09-06 02:07:36.443637
========= sending heartbeat at 2023-09-06 02:07:46.474636
========= sending heartbeat at 2023-09-06 02:07:56.507634
========= sending heartbeat at 2023-09-06 02:08:06.545639
========= sending heartbeat at 2023-09-06 02:08:16.584634
========= sending heartbeat at 2023-09-06 02:08:26.624655
========= sending heartbeat at 2023-09-06 02:08:36.662666
========= sending heartbeat at 2023-09-06 02:08:46.693656
========= sending heartbeat at 2023-09-06 02:08:56.727663
========= sending heartbeat at 2023-09-06 02:09:06.768659
========= sending heartbeat at 2023-09-06 02:09:16.827656
========= sending heartbeat at 2023-09-06 02:09:26.865998
========= sending heartbeat at 2023-09-06 02:09:58.551656
========= sending heartbeat at 2023-09-06 02:10:08.585665
========= sending heartbeat at 2023-09-06 02:10:18.617662
========= sending heartbeat at 2023-09-06 02:10:28.663663
========= sending heartbeat at 2023-09-06 02:10:38.702659
========= sending heartbeat at 2023-09-06 02:10:48.747651
========= sending heartbeat at 2023-09-06 02:10:58.776653
========= sending heartbeat at 2023-09-06 02:11:08.812649
========= sending heartbeat at 2023-09-06 02:11:18.850655
========= sending heartbeat at 2023-09-06 02:11:28.894650
========= sending heartbeat at 2023-09-06 02:11:38.927654
========= sending heartbeat at 2023-09-06 02:11:48.966633
========= sending heartbeat at 2023-09-06 02:11:59.005638
========= sending heartbeat at 2023-09-06 02:12:09.046663
========= sending heartbeat at 2023-09-06 02:12:19.085655
========= sending heartbeat at 2023-09-06 02:12:29.123656
========= sending heartbeat at 2023-09-06 02:13:03.276631
========= sending heartbeat at 2023-09-06 02:13:13.327629
========= sending heartbeat at 2023-09-06 02:13:23.371636
========= sending heartbeat at 2023-09-06 02:13:33.403636
========= sending heartbeat at 2023-09-06 02:13:43.437648
========= sending heartbeat at 2023-09-06 02:13:53.474659
========= sending heartbeat at 2023-09-06 02:14:35.621653
========= sending heartbeat at 2023-09-06 02:14:45.654657
========= sending heartbeat at 2023-09-06 02:15:14.143655
========= sending heartbeat at 2023-09-06 02:15:24.187641
========= sending heartbeat at 2023-09-06 02:15:34.225633
========= sending heartbeat at 2023-09-06 02:15:44.262630
========= sending heartbeat at 2023-09-06 02:15:54.293625
========= sending heartbeat at 2023-09-06 02:16:04.328623
========= sending heartbeat at 2023-09-06 02:16:14.367628
========= sending heartbeat at 2023-09-06 02:16:24.408627
========= sending heartbeat at 2023-09-06 02:16:34.446639
========= sending heartbeat at 2023-09-06 02:16:44.486641
========= sending heartbeat at 2023-09-06 02:16:54.526640
========= sending heartbeat at 2023-09-06 02:17:04.574659
========= sending heartbeat at 2023-09-06 02:17:14.606656
========= sending heartbeat at 2023-09-06 02:17:24.652661
========= sending heartbeat at 2023-09-06 02:17:34.690655
========= sending heartbeat at 2023-09-06 02:17:44.726649
========= sending heartbeat at 2023-09-06 02:17:54.786661
========= sending heartbeat at 2023-09-06 02:18:04.824666
========= sending heartbeat at 2023-09-06 02:18:14.862660
========= sending heartbeat at 2023-09-06 02:18:24.906657
========= sending heartbeat at 2023-09-06 02:18:34.952659
========= sending heartbeat at 2023-09-06 02:18:44.987586
========= sending heartbeat at 2023-09-06 02:18:55.025663
========= sending heartbeat at 2023-09-06 02:19:05.071656
========= sending heartbeat at 2023-09-06 02:19:15.112664
/home/yuexin/cryosparc/cryosparc_worker/bin/cryosparcw: line 165: 193242 Terminated              python -c "import cryosparc_compute.run as run; run.run()" "$@"

I am not aware of how to check the master’s condition, can you give me some instructions about this?

wtempel · September 8, 2023, 2:07pm

I should have asked earlier: Are NU refinement jobs running on the same computer as where cryosparc_master processes run (or where you typically run cryosparcm commands) or are refinement jobs running on a separate worker computer?

Please can you post the output of the commands

free -g
grep -B 4 "model name" /proc/cpuinfo | tail -n 5

How many users are concurrently logged on to this CryoSPARC installation? Are there significant non-CryoSPARC workloads running on the master?

MiaoXiaoPu1121 · September 9, 2023, 9:25am

Hi, we are running these jobs on the same computer, it is not a cluster rather a single work station.

The output the command you suggest is

[yuexin@yxtaiyi CS-h2aub]$ free -g
              total        used        free      shared  buff/cache   available
Mem:            502          29           9           5         463         466
Swap:             3           3           0
[yuexin@yxtaiyi CS-h2aub]$ grep -B 4 "model name" /proc/cpuinfo | tail -n 5
processor       : 63
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz

There are 3 users can log on to this CryoSPARC, but currently only i am using CryoSPARC, the other users currently do not use.

And yes, here seems another job is running on this work station , which is a MD job using GROMACS

MiaoXiaoPu1121 · September 9, 2023, 9:56am

And the GROMACS is runing using several cpus , but no gpu.

Here is the CPU(s) section of top command, hoping to provide more information.

%Cpu(s): 99.3 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

wtempel · September 11, 2023, 12:25am

Maybe GROMACS uses all of the CPUs, leaving no resources for the CryoSPARC to timely process heartbeats?

MiaoXiaoPu1121 · September 12, 2023, 8:15am

Yes, you maybe right, we will keep monitor these process, Thanks for your help and advice !~!~!~!~!~!

kpsleung · September 29, 2023, 1:03pm

We unfortunately also encountered this problem. We have newly installed cryoSPARC on a single workstation and it ran smoothly. Today there were some jobs ended unexpectedly, and I noticed that cryoSPARC python programs were killed and cryosparcm process was stopped.

The message **** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) **** was found in the failed jobs. In job.log, there was no killer signal:

gpufft: creating new cufft plan (plan id 3   pid 23647)
	gpu_id  1 
	ndims   2 
	dims    360 360 0 
	inembed 360 360 0 
	istride 1 
	idist   129600 
	onembed 360 360 0 
	ostride 1 
	odist   129600 
	batch   161 
	type    C2C 
	wkspc   automatic 
	Python traceback:

========= sending heartbeat at 2023-09-29 19:31:59.604551
========= sending heartbeat at 2023-09-29 19:32:09.619710
========= sending heartbeat at 2023-09-29 19:32:19.629312
========= sending heartbeat at 2023-09-29 19:32:29.645932
========= sending heartbeat at 2023-09-29 19:32:39.662549
========= sending heartbeat at 2023-09-29 19:32:49.679726
========= sending heartbeat at 2023-09-29 19:32:59.693998
========= sending heartbeat at 2023-09-29 19:33:09.711632
========= sending heartbeat at 2023-09-29 19:33:19.726038
========= sending heartbeat at 2023-09-29 19:33:29.737936
========= sending heartbeat at 2023-09-29 19:33:39.754463
========= sending heartbeat at 2023-09-29 19:33:49.771771
========= sending heartbeat at 2023-09-29 19:33:59.790043

Only me are using cryoSPARC and no other CPU-consuming program was running alongside. CPU was also not fully occupied.

Adding export CRYOSPARC_HEARTBEAT_SECONDS=600 to the /cryosparc_master/config.sh doesn’t solve the problem.

Is there any way to troubleshoot?

wtempel · September 29, 2023, 2:21pm

@kpsleung Please can you provide additional information:

job type
CryoSPARC version and patch
RAM size on the workstation
free -G
for the job whose log excerpt you posted above, the time killed_at
any relevant information from the Linux system logs that coincides with or slightly precedes the killed_at time

kpsleung · September 30, 2023, 4:49am

I was running mulitple job types at the same time, e.g. 4x NU jobs or 2dclass+3dclass+NU.

This is the output from free -g:

               total        used        free      shared  buff/cache   available
Mem:             251          63           3           0         185         185
Swap:              0           0           0

Apparently from the log, the systemd-oomd murdered cryosparc due to its heavy RAM usage:

(base) wcyl@wcyl-WS-C621E-SAGE-Series:/var/log$ journalctl --since "2023-09-30 11:41" --until "2023-09-30 11:43"
Sep 30 11:41:07 wcyl-WS-C621E-SAGE-Series systemd-oomd[1363]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-778ab6c8-3644-4348-ba33-9a082024578c.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 65.12% > 50.00% for > 20s with reclaim activity
Sep 30 11:41:07 wcyl-WS-C621E-SAGE-Series systemd[3160]: vte-spawn-778ab6c8-3644-4348-ba33-9a082024578c.scope: systemd-oomd killed 308 process(es) in this unit.
Sep 30 11:41:08 wcyl-WS-C621E-SAGE-Series systemd[3160]: vte-spawn-778ab6c8-3644-4348-ba33-9a082024578c.scope: Consumed 11h 11min 56.300s CPU time.

I thought this was due to no swap space assigned, so I added 8GB. However, the problem persists. Is 8GB swap space enough for a 256 GB RAM system?

As the log also suggested that the memory pressure limit went beyond 50%, I went to /usr/lib/systemd/system/user@.service.d/10-oomd-user-service-defaults.conf to change ManagedOOMMemoryPressureLimit to 95%, as opposed to some suggestion from googling to disable the oom killer entirely. I am now monitoring if the problem is still there.

Thank you for your help!

wtempel · October 13, 2023, 8:53pm

I heard that a computer using swap space becomes virtually unusably slow. Therefore I would avoid swapping altogether.
We discussed possible workarounds, within our team (other than possibly: “you may need more RAM if you run against the Nyquist limit with these workarounds and you need higher resolution information”).
One can reduce the RAM requirements of refinement and classification jobs by explicitly downsampling particles.
Parameter-based memory “savings” without an explicit downsampling job are available in 2D Classification (larger number in Maximum resolution (A) ) and 3D Classification (larger number in Target resolution (A) ). You may have to experiment with the savings effect of increasing these values, as box size would be reduced only “internally”.