Dear all,
Maybe you have experienced the same problem and know solution. During usually Motion corr, large 2D classifications and NU refinements at very random time points cryosparc stops and there is no heartbeat during 180 second and job is finishing with error. There is always one same socket blocking cryosparc. Removing it and restarting cryosparc/PC helps, but it comes back at very random points and more and more often last months.
We have latest CryoSPARC build - v4.4.1+240110, Ubuntu 22.04.3 LTS, Intel core i9-7940 3.10 Ghz, 2x Nvidia RTX 2080 TI, 1x Quadro P4000, 128 GB RAM.
Any advices are highly valuable. And we would provide any additional information if needed.
Thank you for your help!
Update 17/04/24 - I have also noticed that it shows 60 GB RAM is available after restarting CryoSPARC only, and after PC restarting - normal 120 GB RAM available. And if the reappeared and removed socket, it is showing 60 GB again. We will check if it is related to RAM.
Welcome to the forum @ArtemS.
This description makes me wonder if
- a job of jobs with a large aggregate memory footprint exhausts the system RAM.
- the kernel’s “OOM killer” terminates CryoSPARC-related processes, bypassing CryoSPARC process and socket management.
Is this without any CryoSPARC job running? Please can you run these commands after restarting the computer and starting CryoSPARC:
cat /sys/kernel/mm/transparent_hugepage/enabled
free -g
ps -eo pid,ppid,start,vsz,rsz,cmd | grep -e cryosparc_ -e mongo
sudo journalctl | grep -i oom
Please also let us know if this CryoSPARC instance is a “standalone” combined master/worker.
Dear @wtempel, thank you so much for your reply!
It indeed looks like we have OOM killer
apr 16 14:40:54 sparc systemd-oomd[2428]: Killed /user.slice/user-1002.slice/user@1002.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-bf5ce946-1eda-4208-aec0-31257e820a16.scope due to memory pressure for /user.slice/user-1002.slice/user@1002.service being 77.18% > 50.00% for > 20s with reclaim activity
apr 16 14:40:54 sparc systemd[3930]: vte-spawn-bf5ce946-1eda-4208-aec0-31257e820a16.scope: systemd-oomd killed 149 process(es) in this unit.
apr 16 17:58:44 sparc systemd-oomd[2428]: Killed /user.slice/user-1002.slice/user@1002.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-1f85f781-b9ed-4a56-a8ae-0953983f200c.scope due to memory pressure for /user.slice/user-1002.slice/user@1002.service being 74.02% > 50.00% for > 20s with reclaim activity
apr 16 17:58:44 sparc systemd[3930]: vte-spawn-1f85f781-b9ed-4a56-a8ae-0953983f200c.scope: systemd-oomd killed 131 process(es) in this unit.
apr 16 18:01:36 sparc systemd[1]: Stopping Userspace Out-Of-Memory (OOM) Killer...
apr 16 18:01:36 sparc systemd[1]: systemd-oomd.service: Deactivated successfully.
apr 16 18:01:36 sparc systemd[1]: Stopped Userspace Out-Of-Memory (OOM) Killer.
apr 16 18:01:36 sparc systemd[1]: systemd-oomd.service: Consumed 24.347s CPU time.
apr 16 18:04:56 sparc systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer...
apr 16 18:04:56 sparc systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
apr 16 20:12:16 sparc systemd-oomd[2302]: Killed /user.slice/user-1002.slice/user@1002.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-ca2fdfa9-ddb4-4593-a31a-3129fd58ec77.scope due to memory pressure for /user.slice/user-1002.slice/user@1002.service being 65.92% > 50.00% for > 20s with reclaim activity
apr 16 20:12:16 sparc systemd[3860]: vte-spawn-ca2fdfa9-ddb4-4593-a31a-3129fd58ec77.scope: systemd-oomd killed 135 process(es) in this unit.
apr 17 11:47:11 sparc systemd-oomd[2302]: Killed /user.slice/user-1002.slice/user@1002.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-6b6ebd8f-6ec3-437a-ab59-a11122d79cc1.scope due to memory pressure for /user.slice/user-1002.slice/user@1002.service being 71.21% > 50.00% for > 20s with reclaim activity
apr 17 11:47:11 sparc systemd[3860]: vte-spawn-6b6ebd8f-6ec3-437a-ab59-a11122d79cc1.scope: systemd-oomd killed 80 process(es) in this unit.
apr 17 11:58:38 sparc systemd[1]: Stopping Userspace Out-Of-Memory (OOM) Killer...
apr 17 11:58:38 sparc systemd[1]: systemd-oomd.service: Deactivated successfully.
apr 17 11:58:38 sparc systemd[1]: Stopped Userspace Out-Of-Memory (OOM) Killer.
apr 17 11:58:38 sparc systemd[1]: systemd-oomd.service: Consumed 1min 47.760s CPU time.
apr 17 12:02:35 sparc systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer...
apr 17 12:02:35 sparc systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
We have not tested it, I have seen it only in CryoSPARC job messages that indicate amount of used and available RAM.
We have CryoSPARC as “standalone”, combined master/worker on the same PC.
We have also run Mem86 test for our RAM memory, no errors, unfortunately. But we are still probably buy new RAM memory. We also thinking to test each half of RAM stick, since 64 GB should be enough for jobs (maximum box size around 420-450 pixels, ~300k particles).
A word of caution: Before removing a CryoSPARC-related socket file, always confirm first that all corresponding processes have exited.