Memory issues on cluster

Hi,

we do have an issue when running refinements on our cluster. It seems like the job seems to need more memory than it specified in the submission script and then dies due to cluster regulations.

[4375679.214795] python invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[4375679.214801] python cpuset=step_batch mems_allowed=0-1
[4375679.214805] CPU: 8 PID: 230566 Comm: python Tainted: P           OE  ------------   3.10.0-693.17.1.el7.x86_64 #1
[4375679.214807] Hardware name: Cray Inc. S2600BPB/S2600BPB, BIOS SE5C620.86B.00.01.0009.C0004.101920170742 10/19/2017
[4375679.214809] Call Trace:
[4375679.214820]  [<ffffffff816a6071>] dump_stack+0x19/0x1b
[4375679.214823]  [<ffffffff816a1466>] dump_header+0x90/0x229
[4375679.214829]  [<ffffffff811f599e>] ? mem_cgroup_reclaim+0x4e/0x120
[4375679.214836]  [<ffffffff81187dc6>] ? find_lock_task_mm+0x56/0xc0
[4375679.214838]  [<ffffffff811f36a8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
[4375679.214842]  [<ffffffff81188274>] oom_kill_process+0x254/0x3d0
[4375679.214845]  [<ffffffff811f73c6>] mem_cgroup_oom_synchronize+0x546/0x570
[4375679.214848]  [<ffffffff811f6840>] ? mem_cgroup_charge_common+0xc0/0xc0
[4375679.214851]  [<ffffffff81188b04>] pagefault_out_of_memory+0x14/0x90
[4375679.214856]  [<ffffffff8169f82e>] mm_fault_error+0x68/0x12b
[4375679.214862]  [<ffffffff816b3a21>] __do_page_fault+0x391/0x450
[4375679.214866]  [<ffffffff816b3b15>] do_page_fault+0x35/0x90
[4375679.214869]  [<ffffffff816af8f8>] page_fault+0x28/0x30
[4375679.214872] Task in /slurm/uid_12043/job_57823/step_batch killed as a result of limit of /slurm/uid_12043/job_57823
[4375679.214875] memory: usage 24914164kB, limit 24914164kB, failcnt 155151515
[4375679.214877] memory+swap: usage 24914164kB, limit 9007199254740988kB, failcnt 0
[4375679.214878] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[4375679.214879] Memory cgroup stats for /slurm/uid_12043/job_57823: cache:0KB rss:328KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:328KB inactive_file:0KB active_file:0KB unevictable:0KB
[4375679.214894] Memory cgroup stats for /slurm/uid_12043/job_57823/step_batch: cache:4096KB rss:24909740KB rss_huge:2048KB mapped_file:4096KB swap:0KB inactive_anon:4096KB active_anon:24909700KB inactive_file:0KB active_file:0KB unevictable:0KB
[4375679.214904] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[4375679.214981] [230525]     0 230525    82681     1134      65        0         -1000 slurmstepd
[4375679.214984] [230555] 12043 230555    28286      380      10        0             0 bash
[4375679.214986] [230556] 12043 230556    28319      404      10        0             0 bash
[4375679.214988] [230562] 12043 230562   117251    22667     116        0             0 python
[4375679.214991] [230566] 12043 230566 10342015  6242537   12437        0             0 python
[4375679.215006] Memory cgroup out of memory: Kill process 271422 (python) score 1004 or sacrifice child
[4375679.215010] Killed process 230566 (python) total-vm:41368060kB, anon-rss:24830160kB, file-rss:135892kB, shmem-rss:4096kB

what can we do about this?

We now hardcoded the memory usage to 32 GB and it run indeed further. The slurm output tells us it indeed consumes more mory the the specified 24 GB. However now we got stock with a new error:

Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1490, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 1072, in cryosparc2_compute.engine.engine.process.work
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 392, in cryosparc2_compute.engine.engine.EngineThread.find_and_set_best_pose_shift
ValueError: invalid entry in index array

Any thoughts?

can anyone help with this?

Hi @david.haselbach, can you confirm which type of refinement job this is? is it a “legacy refinement” or a “new refinement” (i.e. with CTF refinement, in v2.12+) ? Also can you tell us the GPU model and CUDA version that was running on the node where the invalid entry in index array error occurred?

It’s definitely true that some of the newer job types use more memory than they should (i.e. more than is requested from SLURM). We are working on optimizing the memory usage to fit back within the requested amounts.

it was a legacy refinement.
The index error happend on node which has 8x NVIDIA GP100GL [Tesla P100 PCIe 12GB] cards.
Our crysoparc worker is compiled against Cuda 9.2.88.

Hi @david.haselbach,

Is there a chance the particles going into this refinement job came from Topaz in the latest cryoSPARC versions? This issue may be related: Topaz 2D Class problem

Hi @apunjani

no it was regular autopicking.

Best,

David

Hi - We’re having the same issue with particles that were picked with the traditional template picker and doing refinement with the new homogeneous refinement. Our slurm job seems to fail after about iteration 5 with an out of memory message, but no sign that the system actually ran out of memory. We are using 2x Tesla V100 GPUs. Is there something in the sbatch script for slurm that we need to tweak?

@hansenbry the CPU RAM usage of new homogeneous refinement has been substantially reduced in v2.13 (out today) so could you try this and see if that helps?
It’s likely though that it’s a good idea to increase the CPU memory requirement specified in the sbatch script as David has done, since depending on parameters, the jobs do sometimes need more RAM than the default value… unfortunately we haven’t yet had a chance to go through all the jobs and pre-compute the amount of CPU RAM that will be needed before the job runs.

Hi we again run into memory issues and have hardcoded even higher memory now. Is there a possibility to have the memory as advanced options such that it can be provided by the user. This would help us a lot.

Best,

David

Hi @david.haselbach,
We can consider this, but would e.g. your users typically know how much memory to request for a given job/input params/data?

I guess most user’s wouldn’t know exactly, but at least there is the possibility to find it out via trial and error. We really have a number of refinements that just die with the automatically set memory and just run through when we hardcode the memory in the submission script. And changing of this can only done by our administrator which leads to quite some time lack, sometimes.