I have queued many jobs and the shortest job last about 10 min before it was killed , and the longest about 10hours.
After i encounter this , i did update to the latest version ,and the killer signal is still there after the queue.
I am not quite sure about this , but i dont think it was reconfigured.
However, I tried to add
export CRYOSPARC_HEARTBEAT_SECONDS=600
to the /cryosparc_master/config.sh early this day, and tried to queue 4 of my jobs; it seems like the killer signal was not sent during this time, and 3 of these jobs done completely after 5~10 hours running (the last one is still running, seems working normally to me).
Will this situation be helpful about solving this problem?
It is good you found a way to make your jobs to completion. Knowing that increasing CRYOSPARC_HEARTBEAT_SECONDS had this effect suggests several possibilities:
The worker/job is not sending heartbeats for some reason. You can find a history of sent heartbeats in the job log (MetadataLog). Were there gaps in heartbeats being send regularly? Was the worker under heavy load when such gaps (possibly) occurred?
Or: The worker did send regular heartbeats, but the master either
did not receive them (network issues?)
processed incoming heartbeats with a delay or not at all (due to heavy load?)
I should have asked earlier: Are NU refinement jobs running on the same computer as where cryosparc_master processes run (or where you typically run cryosparcm commands) or are refinement jobs running on a separate worker computer?
We unfortunately also encountered this problem. We have newly installed cryoSPARC on a single workstation and it ran smoothly. Today there were some jobs ended unexpectedly, and I noticed that cryoSPARC python programs were killed and cryosparcm process was stopped.
The message **** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) **** was found in the failed jobs. In job.log, there was no killer signal:
gpufft: creating new cufft plan (plan id 3 pid 23647)
gpu_id 1
ndims 2
dims 360 360 0
inembed 360 360 0
istride 1
idist 129600
onembed 360 360 0
ostride 1
odist 129600
batch 161
type C2C
wkspc automatic
Python traceback:
========= sending heartbeat at 2023-09-29 19:31:59.604551
========= sending heartbeat at 2023-09-29 19:32:09.619710
========= sending heartbeat at 2023-09-29 19:32:19.629312
========= sending heartbeat at 2023-09-29 19:32:29.645932
========= sending heartbeat at 2023-09-29 19:32:39.662549
========= sending heartbeat at 2023-09-29 19:32:49.679726
========= sending heartbeat at 2023-09-29 19:32:59.693998
========= sending heartbeat at 2023-09-29 19:33:09.711632
========= sending heartbeat at 2023-09-29 19:33:19.726038
========= sending heartbeat at 2023-09-29 19:33:29.737936
========= sending heartbeat at 2023-09-29 19:33:39.754463
========= sending heartbeat at 2023-09-29 19:33:49.771771
========= sending heartbeat at 2023-09-29 19:33:59.790043
Only me are using cryoSPARC and no other CPU-consuming program was running alongside. CPU was also not fully occupied.
Adding export CRYOSPARC_HEARTBEAT_SECONDS=600 to the /cryosparc_master/config.sh doesn’t solve the problem.
I was running mulitple job types at the same time, e.g. 4x NU jobs or 2dclass+3dclass+NU.
This is the output from free -g:
total used free shared buff/cache available
Mem: 251 63 3 0 185 185
Swap: 0 0 0
Apparently from the log, the systemd-oomd murdered cryosparc due to its heavy RAM usage:
(base) wcyl@wcyl-WS-C621E-SAGE-Series:/var/log$ journalctl --since "2023-09-30 11:41" --until "2023-09-30 11:43"
Sep 30 11:41:07 wcyl-WS-C621E-SAGE-Series systemd-oomd[1363]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-778ab6c8-3644-4348-ba33-9a082024578c.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 65.12% > 50.00% for > 20s with reclaim activity
Sep 30 11:41:07 wcyl-WS-C621E-SAGE-Series systemd[3160]: vte-spawn-778ab6c8-3644-4348-ba33-9a082024578c.scope: systemd-oomd killed 308 process(es) in this unit.
Sep 30 11:41:08 wcyl-WS-C621E-SAGE-Series systemd[3160]: vte-spawn-778ab6c8-3644-4348-ba33-9a082024578c.scope: Consumed 11h 11min 56.300s CPU time.
I thought this was due to no swap space assigned, so I added 8GB. However, the problem persists. Is 8GB swap space enough for a 256 GB RAM system?
As the log also suggested that the memory pressure limit went beyond 50%, I went to /usr/lib/systemd/system/user@.service.d/10-oomd-user-service-defaults.conf to change ManagedOOMMemoryPressureLimit to 95%, as opposed to some suggestion from googling to disable the oom killer entirely. I am now monitoring if the problem is still there.
I heard that a computer using swap space becomes virtually unusably slow. Therefore I would avoid swapping altogether.
We discussed possible workarounds, within our team (other than possibly: “you may need more RAM if you run against the Nyquist limit with these workarounds and you need higher resolution information”).
One can reduce the RAM requirements of refinement and classification jobs by explicitly downsampling particles.
Parameter-based memory “savings” without an explicit downsampling job are available in 2D Classification (larger number in Maximum resolution (A) ) and 3D Classification (larger number in Target resolution (A) ). You may have to experiment with the savings effect of increasing these values, as box size would be reduced only “internally”.