Memory issues on cluster

david.haselbach · January 9, 2020, 2:50pm

Hi,

we do have an issue when running refinements on our cluster. It seems like the job seems to need more memory than it specified in the submission script and then dies due to cluster regulations.

[4375679.214795] python invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[4375679.214801] python cpuset=step_batch mems_allowed=0-1
[4375679.214805] CPU: 8 PID: 230566 Comm: python Tainted: P           OE  ------------   3.10.0-693.17.1.el7.x86_64 #1
[4375679.214807] Hardware name: Cray Inc. S2600BPB/S2600BPB, BIOS SE5C620.86B.00.01.0009.C0004.101920170742 10/19/2017
[4375679.214809] Call Trace:
[4375679.214820]  [<ffffffff816a6071>] dump_stack+0x19/0x1b
[4375679.214823]  [<ffffffff816a1466>] dump_header+0x90/0x229
[4375679.214829]  [<ffffffff811f599e>] ? mem_cgroup_reclaim+0x4e/0x120
[4375679.214836]  [<ffffffff81187dc6>] ? find_lock_task_mm+0x56/0xc0
[4375679.214838]  [<ffffffff811f36a8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
[4375679.214842]  [<ffffffff81188274>] oom_kill_process+0x254/0x3d0
[4375679.214845]  [<ffffffff811f73c6>] mem_cgroup_oom_synchronize+0x546/0x570
[4375679.214848]  [<ffffffff811f6840>] ? mem_cgroup_charge_common+0xc0/0xc0
[4375679.214851]  [<ffffffff81188b04>] pagefault_out_of_memory+0x14/0x90
[4375679.214856]  [<ffffffff8169f82e>] mm_fault_error+0x68/0x12b
[4375679.214862]  [<ffffffff816b3a21>] __do_page_fault+0x391/0x450
[4375679.214866]  [<ffffffff816b3b15>] do_page_fault+0x35/0x90
[4375679.214869]  [<ffffffff816af8f8>] page_fault+0x28/0x30
[4375679.214872] Task in /slurm/uid_12043/job_57823/step_batch killed as a result of limit of /slurm/uid_12043/job_57823
[4375679.214875] memory: usage 24914164kB, limit 24914164kB, failcnt 155151515
[4375679.214877] memory+swap: usage 24914164kB, limit 9007199254740988kB, failcnt 0
[4375679.214878] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[4375679.214879] Memory cgroup stats for /slurm/uid_12043/job_57823: cache:0KB rss:328KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:328KB inactive_file:0KB active_file:0KB unevictable:0KB
[4375679.214894] Memory cgroup stats for /slurm/uid_12043/job_57823/step_batch: cache:4096KB rss:24909740KB rss_huge:2048KB mapped_file:4096KB swap:0KB inactive_anon:4096KB active_anon:24909700KB inactive_file:0KB active_file:0KB unevictable:0KB
[4375679.214904] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[4375679.214981] [230525]     0 230525    82681     1134      65        0         -1000 slurmstepd
[4375679.214984] [230555] 12043 230555    28286      380      10        0             0 bash
[4375679.214986] [230556] 12043 230556    28319      404      10        0             0 bash
[4375679.214988] [230562] 12043 230562   117251    22667     116        0             0 python
[4375679.214991] [230566] 12043 230566 10342015  6242537   12437        0             0 python
[4375679.215006] Memory cgroup out of memory: Kill process 271422 (python) score 1004 or sacrifice child
[4375679.215010] Killed process 230566 (python) total-vm:41368060kB, anon-rss:24830160kB, file-rss:135892kB, shmem-rss:4096kB

what can we do about this?

david.haselbach · January 10, 2020, 8:26am

We now hardcoded the memory usage to 32 GB and it run indeed further. The slurm output tells us it indeed consumes more mory the the specified 24 GB. However now we got stock with a new error:

Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1490, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 1072, in cryosparc2_compute.engine.engine.process.work
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 392, in cryosparc2_compute.engine.engine.EngineThread.find_and_set_best_pose_shift
ValueError: invalid entry in index array

Any thoughts?

david.haselbach · January 14, 2020, 7:09am

can anyone help with this?

apunjani · January 15, 2020, 7:49pm

Hi @david.haselbach, can you confirm which type of refinement job this is? is it a “legacy refinement” or a “new refinement” (i.e. with CTF refinement, in v2.12+) ? Also can you tell us the GPU model and CUDA version that was running on the node where the invalid entry in index array error occurred?

It’s definitely true that some of the newer job types use more memory than they should (i.e. more than is requested from SLURM). We are working on optimizing the memory usage to fit back within the requested amounts.

david.haselbach · January 21, 2020, 6:45am

it was a legacy refinement.
The index error happend on node which has 8x NVIDIA GP100GL [Tesla P100 PCIe 12GB] cards.
Our crysoparc worker is compiled against Cuda 9.2.88.

apunjani · January 22, 2020, 8:13pm

Hi @david.haselbach,

Is there a chance the particles going into this refinement job came from Topaz in the latest cryoSPARC versions? This issue may be related: Topaz 2D Class problem

david.haselbach · January 23, 2020, 8:13am

Hi @apunjani

no it was regular autopicking.

Best,

David

hansenbry · January 24, 2020, 9:53pm

Hi - We’re having the same issue with particles that were picked with the traditional template picker and doing refinement with the new homogeneous refinement. Our slurm job seems to fail after about iteration 5 with an out of memory message, but no sign that the system actually ran out of memory. We are using 2x Tesla V100 GPUs. Is there something in the sbatch script for slurm that we need to tweak?

apunjani · January 29, 2020, 8:32pm

@hansenbry the CPU RAM usage of new homogeneous refinement has been substantially reduced in v2.13 (out today) so could you try this and see if that helps?
It’s likely though that it’s a good idea to increase the CPU memory requirement specified in the sbatch script as David has done, since depending on parameters, the jobs do sometimes need more RAM than the default value… unfortunately we haven’t yet had a chance to go through all the jobs and pre-compute the amount of CPU RAM that will be needed before the job runs.

david.haselbach · May 4, 2020, 6:14pm

Hi we again run into memory issues and have hardcoded even higher memory now. Is there a possibility to have the memory as advanced options such that it can be provided by the user. This would help us a lot.

Best,

David

apunjani · May 11, 2020, 3:18pm

Hi @david.haselbach,
We can consider this, but would e.g. your users typically know how much memory to request for a given job/input params/data?

david.haselbach · May 11, 2020, 4:00pm

I guess most user’s wouldn’t know exactly, but at least there is the possibility to find it out via trial and error. We really have a number of refinements that just die with the automatically set memory and just run through when we hardcode the memory in the submission script. And changing of this can only done by our administrator which leads to quite some time lack, sometimes.

KiSchnelle · April 26, 2022, 11:15am

Has the memory calculation been reworked in the meantime?

Cause i also regularly run into the problem when using bigger box-sizes. I even have the default ram_gb already multiplied by 2 but sometimes if you go on really large box-sizes (500-1000) even a factor of 16 was not enough.

I mean i can change it myself easily and typically just add a project_uid if statement if thats the case but then also small jobs like pickers and extraction take that much ram.

Have you found a good workaround?

ebirn · April 28, 2022, 1:14pm

we observe the same issue, on sum jobs (must be parameter / input dependent):
Cryosparc estimages mem_gb with 24.0 - the job will run out of memory. When we submit with much larger resources, we see a peak memory consumption of ca 36GB. Job typ is “new_local_refine”.

wtempel · April 28, 2022, 2:31pm

@ebirn Please can you post non-default job parameters for this job and the box size of your particles?

sascha · April 30, 2022, 10:00am

I’m using cs on the cluster maintained by @ebirn.

For the “Local Refinement (New!)” jobs that run into the issue mentioned above I am using the default settings only with particles, a map and a mask as input. The particle box size is 560px and the particle number is 42k. This requires a total of 68 Gb of Ram.
For comparison, it requires 42Gb of Ram when I run the same job with the same particles but with a box size of 480px.

The rather big particle box size is required since it otherwise runs into Nyquist limit.

The initial particles were picked by the template picker and extracted in cs.

KiSchnelle · May 2, 2022, 11:35am

For us for example:

Homo Refinement New
Settings: all default
Box-size: 882
NrParticles: 300k
Cryosparc Ram: 0,1,2 of 512GB so i guess 24GB
Slurm MaxRSS: 164890628K ~157GB

ebirn · May 23, 2022, 9:35am

@wtempel see the example by @sascha and maybe @david.haselbach can also provide more information

wtempel · May 24, 2022, 1:32pm

The job type-specific RAM usage estimates are for what we consider “typical” use cases.
For larger than “typical” cases, assuming the actual availability of required RAM resources:

slurm must be configured to allow such jobs
a dedicated “large_mem” cluster lane should be added to your cryoSPARC instance (cryosparcm cluster connect) with a suitably multiplied #SBATCH --mem= parameter inside cluster_script.sh, like in this example. Adding a lane instead of replacing the existing lane has the advantage that cryoSPARC jobs with smaller (“typical”) memory requirements won’t have to “wait” for the availability of large memory resources.

bsobol · September 15, 2022, 12:16pm