Hi,
we do have an issue when running refinements on our cluster. It seems like the job seems to need more memory than it specified in the submission script and then dies due to cluster regulations.
[4375679.214795] python invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[4375679.214801] python cpuset=step_batch mems_allowed=0-1
[4375679.214805] CPU: 8 PID: 230566 Comm: python Tainted: P OE ------------ 3.10.0-693.17.1.el7.x86_64 #1
[4375679.214807] Hardware name: Cray Inc. S2600BPB/S2600BPB, BIOS SE5C620.86B.00.01.0009.C0004.101920170742 10/19/2017
[4375679.214809] Call Trace:
[4375679.214820] [<ffffffff816a6071>] dump_stack+0x19/0x1b
[4375679.214823] [<ffffffff816a1466>] dump_header+0x90/0x229
[4375679.214829] [<ffffffff811f599e>] ? mem_cgroup_reclaim+0x4e/0x120
[4375679.214836] [<ffffffff81187dc6>] ? find_lock_task_mm+0x56/0xc0
[4375679.214838] [<ffffffff811f36a8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
[4375679.214842] [<ffffffff81188274>] oom_kill_process+0x254/0x3d0
[4375679.214845] [<ffffffff811f73c6>] mem_cgroup_oom_synchronize+0x546/0x570
[4375679.214848] [<ffffffff811f6840>] ? mem_cgroup_charge_common+0xc0/0xc0
[4375679.214851] [<ffffffff81188b04>] pagefault_out_of_memory+0x14/0x90
[4375679.214856] [<ffffffff8169f82e>] mm_fault_error+0x68/0x12b
[4375679.214862] [<ffffffff816b3a21>] __do_page_fault+0x391/0x450
[4375679.214866] [<ffffffff816b3b15>] do_page_fault+0x35/0x90
[4375679.214869] [<ffffffff816af8f8>] page_fault+0x28/0x30
[4375679.214872] Task in /slurm/uid_12043/job_57823/step_batch killed as a result of limit of /slurm/uid_12043/job_57823
[4375679.214875] memory: usage 24914164kB, limit 24914164kB, failcnt 155151515
[4375679.214877] memory+swap: usage 24914164kB, limit 9007199254740988kB, failcnt 0
[4375679.214878] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[4375679.214879] Memory cgroup stats for /slurm/uid_12043/job_57823: cache:0KB rss:328KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:328KB inactive_file:0KB active_file:0KB unevictable:0KB
[4375679.214894] Memory cgroup stats for /slurm/uid_12043/job_57823/step_batch: cache:4096KB rss:24909740KB rss_huge:2048KB mapped_file:4096KB swap:0KB inactive_anon:4096KB active_anon:24909700KB inactive_file:0KB active_file:0KB unevictable:0KB
[4375679.214904] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[4375679.214981] [230525] 0 230525 82681 1134 65 0 -1000 slurmstepd
[4375679.214984] [230555] 12043 230555 28286 380 10 0 0 bash
[4375679.214986] [230556] 12043 230556 28319 404 10 0 0 bash
[4375679.214988] [230562] 12043 230562 117251 22667 116 0 0 python
[4375679.214991] [230566] 12043 230566 10342015 6242537 12437 0 0 python
[4375679.215006] Memory cgroup out of memory: Kill process 271422 (python) score 1004 or sacrifice child
[4375679.215010] Killed process 230566 (python) total-vm:41368060kB, anon-rss:24830160kB, file-rss:135892kB, shmem-rss:4096kB
what can we do about this?