Dear community, i am running some NU-refine jobs, and all of them are ended to fail by the last notice of
**** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****
I want to ask what is “heartbeat monitor” and what is going on ? job queued two days ago run well and ended nornally, one day ago all fail.
Any advice can be helpful.
wtempel
September 6, 2023, 3:12pm
#2
There are various possible causes for this signal being sent.
How long did the jobs run before the signal was sent?
Does the problem occur on the latest version of CryoSPARC (4.3.1)?
Were there any server or network reconfigurations?
Hi, Thanks for your reply,
I have queued many jobs and the shortest job last about 10 min before it was killed , and the longest about 10hours.
After i encounter this , i did update to the latest version ,and the killer signal is still there after the queue.
I am not quite sure about this , but i dont think it was reconfigured.
However, I tried to add
export CRYOSPARC_HEARTBEAT_SECONDS=600
to the /cryosparc_master/config.sh early this day, and tried to queue 4 of my jobs; it seems like the killer signal was not sent during this time, and 3 of these jobs done completely after 5~10 hours running (the last one is still running, seems working normally to me).
Will this situation be helpful about solving this problem?
wtempel
September 7, 2023, 8:43pm
#4
MiaoXiaoPu1121:
I tried to add
export CRYOSPARC_HEARTBEAT_SECONDS=600
to the /cryosparc_master/config.sh early this day, and tried to queue 4 of my jobs; it seems like the killer signal was not sent during this time, and 3 of these jobs done completely after 5~10 hours running (the last one is still running, seems working normally to me).
It is good you found a way to make your jobs to completion. Knowing that increasing CRYOSPARC_HEARTBEAT_SECONDS
had this effect suggests several possibilities:
The worker/job is not sending heartbeats for some reason. You can find a history of sent heartbeats in the job log (Metadata Log ). Were there gaps in heartbeats being send regularly? Was the worker under heavy load when such gaps (possibly) occurred?
Or: The worker did send regular heartbeats, but the master either
did not receive them (network issues?)
processed incoming heartbeats with a delay or not at all (due to heavy load?)
Hi, thanks for your reply:
i check the job log, and i think the heartbeats did been send regularly until it sent the killer signal. May you please have a check again?
gpufft: creating new cufft plan (plan id 19 pid 193243)
gpu_id 3
ndims 2
dims 420 420 0
inembed 420 422 0
istride 1
idist 177240
onembed 420 211 0
ostride 1
odist 88620
batch 54
type R2C
wkspc automatic
Python traceback:
========= sending heartbeat at 2023-09-06 01:38:11.454672
========= sending heartbeat at 2023-09-06 01:38:21.489652
========= sending heartbeat at 2023-09-06 01:38:31.527659
========= sending heartbeat at 2023-09-06 01:38:41.568657
========= sending heartbeat at 2023-09-06 01:38:51.607653
========= sending heartbeat at 2023-09-06 01:39:01.645653
========= sending heartbeat at 2023-09-06 01:39:11.683656
========= sending heartbeat at 2023-09-06 01:39:22.086654
========= sending heartbeat at 2023-09-06 01:39:32.125660
========= sending heartbeat at 2023-09-06 01:40:13.905664
========= sending heartbeat at 2023-09-06 01:40:23.947648
========= sending heartbeat at 2023-09-06 01:40:33.984647
========= sending heartbeat at 2023-09-06 01:40:44.027649
========= sending heartbeat at 2023-09-06 01:40:54.071650
========= sending heartbeat at 2023-09-06 01:41:04.117806
========= sending heartbeat at 2023-09-06 01:41:14.152627
========= sending heartbeat at 2023-09-06 01:41:24.186634
========= sending heartbeat at 2023-09-06 01:41:34.214624
========= sending heartbeat at 2023-09-06 01:41:44.248629
========= sending heartbeat at 2023-09-06 01:41:54.283628
========= sending heartbeat at 2023-09-06 01:42:04.327638
========= sending heartbeat at 2023-09-06 01:42:14.362637
========= sending heartbeat at 2023-09-06 01:42:53.036643
========= sending heartbeat at 2023-09-06 01:43:03.073638
========= sending heartbeat at 2023-09-06 01:43:13.106628
========= sending heartbeat at 2023-09-06 01:43:23.150635
========= sending heartbeat at 2023-09-06 01:43:33.186631
========= sending heartbeat at 2023-09-06 01:44:04.993830
========= sending heartbeat at 2023-09-06 01:44:15.051623
========= sending heartbeat at 2023-09-06 01:44:25.088628
========= sending heartbeat at 2023-09-06 01:44:35.130627
========= sending heartbeat at 2023-09-06 01:44:45.246629
========= sending heartbeat at 2023-09-06 01:44:55.288632
========= sending heartbeat at 2023-09-06 01:46:00.262630
========= sending heartbeat at 2023-09-06 01:46:10.307636
========= sending heartbeat at 2023-09-06 01:46:20.347628
========= sending heartbeat at 2023-09-06 01:46:55.069650
========= sending heartbeat at 2023-09-06 01:47:05.109656
========= sending heartbeat at 2023-09-06 01:47:15.168941
========= sending heartbeat at 2023-09-06 01:47:35.201662
========= sending heartbeat at 2023-09-06 01:47:45.246650
========= sending heartbeat at 2023-09-06 01:47:55.280660
========= sending heartbeat at 2023-09-06 01:48:05.316183
========= sending heartbeat at 2023-09-06 01:48:15.347657
========= sending heartbeat at 2023-09-06 01:48:25.387629
========= sending heartbeat at 2023-09-06 01:48:40.813628
========= sending heartbeat at 2023-09-06 01:48:50.853635
========= sending heartbeat at 2023-09-06 01:49:00.906633
========= sending heartbeat at 2023-09-06 01:49:10.940625
========= sending heartbeat at 2023-09-06 01:49:43.186626
========= sending heartbeat at 2023-09-06 01:49:53.226624
========= sending heartbeat at 2023-09-06 01:50:03.265632
========= sending heartbeat at 2023-09-06 01:50:13.301631
========= sending heartbeat at 2023-09-06 01:50:23.347625
========= sending heartbeat at 2023-09-06 01:50:33.392669
========= sending heartbeat at 2023-09-06 01:50:43.447655
========= sending heartbeat at 2023-09-06 01:50:53.484658
========= sending heartbeat at 2023-09-06 01:51:03.522646
========= sending heartbeat at 2023-09-06 01:51:18.144658
========= sending heartbeat at 2023-09-06 01:51:28.186667
========= sending heartbeat at 2023-09-06 01:51:38.244654
========= sending heartbeat at 2023-09-06 01:51:48.282653
========= sending heartbeat at 2023-09-06 01:51:58.314646
========= sending heartbeat at 2023-09-06 01:52:08.346634
========= sending heartbeat at 2023-09-06 01:52:18.384632
========= sending heartbeat at 2023-09-06 01:52:28.422630
========= sending heartbeat at 2023-09-06 01:52:38.453630
========= sending heartbeat at 2023-09-06 01:52:48.486635
========= sending heartbeat at 2023-09-06 01:52:58.523649
========= sending heartbeat at 2023-09-06 01:53:08.554644
========= sending heartbeat at 2023-09-06 01:53:18.586635
========= sending heartbeat at 2023-09-06 01:53:29.827621
========= sending heartbeat at 2023-09-06 01:53:39.859637
========= sending heartbeat at 2023-09-06 01:53:49.889640
========= sending heartbeat at 2023-09-06 01:54:18.321626
========= sending heartbeat at 2023-09-06 01:54:28.633638
========= sending heartbeat at 2023-09-06 01:54:38.673630
========= sending heartbeat at 2023-09-06 01:54:49.973633
========= sending heartbeat at 2023-09-06 01:55:00.024636
========= sending heartbeat at 2023-09-06 01:55:19.229638
========= sending heartbeat at 2023-09-06 01:55:29.265627
========= sending heartbeat at 2023-09-06 01:55:39.306630
========= sending heartbeat at 2023-09-06 01:55:49.350632
========= sending heartbeat at 2023-09-06 01:55:59.387622
========= sending heartbeat at 2023-09-06 01:56:27.262638
========= sending heartbeat at 2023-09-06 01:56:37.306633
========= sending heartbeat at 2023-09-06 01:56:47.347637
========= sending heartbeat at 2023-09-06 01:56:57.385771
========= sending heartbeat at 2023-09-06 01:57:21.692054
========= sending heartbeat at 2023-09-06 01:57:31.729636
========= sending heartbeat at 2023-09-06 01:57:41.768628
========= sending heartbeat at 2023-09-06 01:57:51.806627
========= sending heartbeat at 2023-09-06 01:58:01.846632
========= sending heartbeat at 2023-09-06 01:58:11.885629
========= sending heartbeat at 2023-09-06 01:58:21.923622
========= sending heartbeat at 2023-09-06 01:58:31.955639
========= sending heartbeat at 2023-09-06 01:58:41.986654
========= sending heartbeat at 2023-09-06 01:58:52.028648
========= sending heartbeat at 2023-09-06 01:59:02.065661
========= sending heartbeat at 2023-09-06 01:59:12.106664
========= sending heartbeat at 2023-09-06 01:59:22.148660
========= sending heartbeat at 2023-09-06 01:59:32.182661
========= sending heartbeat at 2023-09-06 01:59:42.227656
========= sending heartbeat at 2023-09-06 01:59:52.260667
========= sending heartbeat at 2023-09-06 02:00:02.294666
========= sending heartbeat at 2023-09-06 02:00:12.332654
========= sending heartbeat at 2023-09-06 02:00:22.370657
========= sending heartbeat at 2023-09-06 02:00:51.399648
========= sending heartbeat at 2023-09-06 02:01:01.432653
========= sending heartbeat at 2023-09-06 02:01:11.464662
========= sending heartbeat at 2023-09-06 02:02:23.477674
========= sending heartbeat at 2023-09-06 02:02:55.428655
========= sending heartbeat at 2023-09-06 02:03:05.471672
========= sending heartbeat at 2023-09-06 02:03:15.507651
========= sending heartbeat at 2023-09-06 02:03:25.546656
========= sending heartbeat at 2023-09-06 02:03:35.583647
========= sending heartbeat at 2023-09-06 02:03:45.614656
========= sending heartbeat at 2023-09-06 02:03:55.652654
========= sending heartbeat at 2023-09-06 02:04:05.688650
========= sending heartbeat at 2023-09-06 02:04:15.750655
========= sending heartbeat at 2023-09-06 02:04:25.786655
========= sending heartbeat at 2023-09-06 02:04:35.824653
========= sending heartbeat at 2023-09-06 02:04:45.862661
========= sending heartbeat at 2023-09-06 02:04:55.898657
========= sending heartbeat at 2023-09-06 02:05:34.330657
========= sending heartbeat at 2023-09-06 02:06:16.118655
========= sending heartbeat at 2023-09-06 02:06:26.168651
========= sending heartbeat at 2023-09-06 02:06:36.212653
========= sending heartbeat at 2023-09-06 02:06:46.250657
========= sending heartbeat at 2023-09-06 02:06:56.288659
========= sending heartbeat at 2023-09-06 02:07:06.334705
========= sending heartbeat at 2023-09-06 02:07:16.367634
========= sending heartbeat at 2023-09-06 02:07:26.405634
========= sending heartbeat at 2023-09-06 02:07:36.443637
========= sending heartbeat at 2023-09-06 02:07:46.474636
========= sending heartbeat at 2023-09-06 02:07:56.507634
========= sending heartbeat at 2023-09-06 02:08:06.545639
========= sending heartbeat at 2023-09-06 02:08:16.584634
========= sending heartbeat at 2023-09-06 02:08:26.624655
========= sending heartbeat at 2023-09-06 02:08:36.662666
========= sending heartbeat at 2023-09-06 02:08:46.693656
========= sending heartbeat at 2023-09-06 02:08:56.727663
========= sending heartbeat at 2023-09-06 02:09:06.768659
========= sending heartbeat at 2023-09-06 02:09:16.827656
========= sending heartbeat at 2023-09-06 02:09:26.865998
========= sending heartbeat at 2023-09-06 02:09:58.551656
========= sending heartbeat at 2023-09-06 02:10:08.585665
========= sending heartbeat at 2023-09-06 02:10:18.617662
========= sending heartbeat at 2023-09-06 02:10:28.663663
========= sending heartbeat at 2023-09-06 02:10:38.702659
========= sending heartbeat at 2023-09-06 02:10:48.747651
========= sending heartbeat at 2023-09-06 02:10:58.776653
========= sending heartbeat at 2023-09-06 02:11:08.812649
========= sending heartbeat at 2023-09-06 02:11:18.850655
========= sending heartbeat at 2023-09-06 02:11:28.894650
========= sending heartbeat at 2023-09-06 02:11:38.927654
========= sending heartbeat at 2023-09-06 02:11:48.966633
========= sending heartbeat at 2023-09-06 02:11:59.005638
========= sending heartbeat at 2023-09-06 02:12:09.046663
========= sending heartbeat at 2023-09-06 02:12:19.085655
========= sending heartbeat at 2023-09-06 02:12:29.123656
========= sending heartbeat at 2023-09-06 02:13:03.276631
========= sending heartbeat at 2023-09-06 02:13:13.327629
========= sending heartbeat at 2023-09-06 02:13:23.371636
========= sending heartbeat at 2023-09-06 02:13:33.403636
========= sending heartbeat at 2023-09-06 02:13:43.437648
========= sending heartbeat at 2023-09-06 02:13:53.474659
========= sending heartbeat at 2023-09-06 02:14:35.621653
========= sending heartbeat at 2023-09-06 02:14:45.654657
========= sending heartbeat at 2023-09-06 02:15:14.143655
========= sending heartbeat at 2023-09-06 02:15:24.187641
========= sending heartbeat at 2023-09-06 02:15:34.225633
========= sending heartbeat at 2023-09-06 02:15:44.262630
========= sending heartbeat at 2023-09-06 02:15:54.293625
========= sending heartbeat at 2023-09-06 02:16:04.328623
========= sending heartbeat at 2023-09-06 02:16:14.367628
========= sending heartbeat at 2023-09-06 02:16:24.408627
========= sending heartbeat at 2023-09-06 02:16:34.446639
========= sending heartbeat at 2023-09-06 02:16:44.486641
========= sending heartbeat at 2023-09-06 02:16:54.526640
========= sending heartbeat at 2023-09-06 02:17:04.574659
========= sending heartbeat at 2023-09-06 02:17:14.606656
========= sending heartbeat at 2023-09-06 02:17:24.652661
========= sending heartbeat at 2023-09-06 02:17:34.690655
========= sending heartbeat at 2023-09-06 02:17:44.726649
========= sending heartbeat at 2023-09-06 02:17:54.786661
========= sending heartbeat at 2023-09-06 02:18:04.824666
========= sending heartbeat at 2023-09-06 02:18:14.862660
========= sending heartbeat at 2023-09-06 02:18:24.906657
========= sending heartbeat at 2023-09-06 02:18:34.952659
========= sending heartbeat at 2023-09-06 02:18:44.987586
========= sending heartbeat at 2023-09-06 02:18:55.025663
========= sending heartbeat at 2023-09-06 02:19:05.071656
========= sending heartbeat at 2023-09-06 02:19:15.112664
/home/yuexin/cryosparc/cryosparc_worker/bin/cryosparcw: line 165: 193242 Terminated python -c "import cryosparc_compute.run as run; run.run()" "$@"
I am not aware of how to check the master’s condition, can you give me some instructions about this?
wtempel
September 8, 2023, 2:07pm
#6
I should have asked earlier: Are NU refinement jobs running on the same computer as where cryosparc_master
processes run (or where you typically run cryosparcm
commands) or are refinement jobs running on a separate worker computer?
Please can you post the output of the commands
free -g
grep -B 4 "model name" /proc/cpuinfo | tail -n 5
How many users are concurrently logged on to this CryoSPARC installation? Are there significant non-CryoSPARC workloads running on the master?
Hi, we are running these jobs on the same computer, it is not a cluster rather a single work station.
The output the command you suggest is
[yuexin@yxtaiyi CS-h2aub]$ free -g
total used free shared buff/cache available
Mem: 502 29 9 5 463 466
Swap: 3 3 0
[yuexin@yxtaiyi CS-h2aub]$ grep -B 4 "model name" /proc/cpuinfo | tail -n 5
processor : 63
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
There are 3 users can log on to this CryoSPARC, but currently only i am using CryoSPARC, the other users currently do not use.
And yes, here seems another job is running on this work station , which is a MD job using GROMACS
And the GROMACS is runing using several cpus , but no gpu.
Here is the CPU(s) section of top command, hoping to provide more information.
%Cpu(s): 99.3 us, 0.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
wtempel
September 11, 2023, 12:25am
#9
MiaoXiaoPu1121:
%Cpu(s): 99.3 us,
Maybe GROMACS uses all of the CPUs, leaving no resources for the CryoSPARC to timely process heartbeats?
Yes, you maybe right, we will keep monitor these process, Thanks for your help and advice !~!~!~!~!~!
We unfortunately also encountered this problem. We have newly installed cryoSPARC on a single workstation and it ran smoothly. Today there were some jobs ended unexpectedly, and I noticed that cryoSPARC python programs were killed and cryosparcm process was stopped.
The message **** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****
was found in the failed jobs. In job.log
, there was no killer signal:
gpufft: creating new cufft plan (plan id 3 pid 23647)
gpu_id 1
ndims 2
dims 360 360 0
inembed 360 360 0
istride 1
idist 129600
onembed 360 360 0
ostride 1
odist 129600
batch 161
type C2C
wkspc automatic
Python traceback:
========= sending heartbeat at 2023-09-29 19:31:59.604551
========= sending heartbeat at 2023-09-29 19:32:09.619710
========= sending heartbeat at 2023-09-29 19:32:19.629312
========= sending heartbeat at 2023-09-29 19:32:29.645932
========= sending heartbeat at 2023-09-29 19:32:39.662549
========= sending heartbeat at 2023-09-29 19:32:49.679726
========= sending heartbeat at 2023-09-29 19:32:59.693998
========= sending heartbeat at 2023-09-29 19:33:09.711632
========= sending heartbeat at 2023-09-29 19:33:19.726038
========= sending heartbeat at 2023-09-29 19:33:29.737936
========= sending heartbeat at 2023-09-29 19:33:39.754463
========= sending heartbeat at 2023-09-29 19:33:49.771771
========= sending heartbeat at 2023-09-29 19:33:59.790043
Only me are using cryoSPARC and no other CPU-consuming program was running alongside. CPU was also not fully occupied.
Adding export CRYOSPARC_HEARTBEAT_SECONDS=600
to the /cryosparc_master/config.sh doesn’t solve the problem.
Is there any way to troubleshoot?
wtempel
September 29, 2023, 2:21pm
#12
@kpsleung Please can you provide additional information:
job type
CryoSPARC version and patch
RAM size on the workstation
free -G
for the job whose log excerpt you posted above, the time killed_at
any relevant information from the Linux system logs that coincides with or slightly precedes the killed_at
time
I was running mulitple job types at the same time, e.g. 4x NU jobs or 2dclass+3dclass+NU.
This is the output from free -g
:
total used free shared buff/cache available
Mem: 251 63 3 0 185 185
Swap: 0 0 0
Apparently from the log, the systemd-oomd
murdered cryosparc due to its heavy RAM usage:
(base) wcyl@wcyl-WS-C621E-SAGE-Series:/var/log$ journalctl --since "2023-09-30 11:41" --until "2023-09-30 11:43"
Sep 30 11:41:07 wcyl-WS-C621E-SAGE-Series systemd-oomd[1363]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-778ab6c8-3644-4348-ba33-9a082024578c.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 65.12% > 50.00% for > 20s with reclaim activity
Sep 30 11:41:07 wcyl-WS-C621E-SAGE-Series systemd[3160]: vte-spawn-778ab6c8-3644-4348-ba33-9a082024578c.scope: systemd-oomd killed 308 process(es) in this unit.
Sep 30 11:41:08 wcyl-WS-C621E-SAGE-Series systemd[3160]: vte-spawn-778ab6c8-3644-4348-ba33-9a082024578c.scope: Consumed 11h 11min 56.300s CPU time.
I thought this was due to no swap space assigned, so I added 8GB. However, the problem persists. Is 8GB swap space enough for a 256 GB RAM system?
As the log also suggested that the memory pressure limit went beyond 50%, I went to /usr/lib/systemd/system/user@.service.d/10-oomd-user-service-defaults.conf
to change ManagedOOMMemoryPressureLimit
to 95%, as opposed to some suggestion from googling to disable the oom killer entirely. I am now monitoring if the problem is still there.
Thank you for your help!