Extract job gets stuck

david.haselbach · November 22, 2019, 8:05am

Hi I do have a problem with a extract job.
I am trying it since a while now and it always get stuck at random positions in the extraction procedure. It does not go on but it still will give the heartbeat. looking closer its seems like it runs out of memory but I am wondering why. Is there a memory leak? How much memory does an extract job require?

Here is some log output:

[510076.339136] Task in /slurm/uid_12043/job_2064245/step_batch/task_0 killed as a result of limit of /slurm/uid_12043/job_2064245/step_batch
[510076.342843] memory: usage 6681180kB, limit 7372800kB, failcnt 16044041
[510076.344818] memory+swap: usage 6681180kB, limit 9007199254740988kB, failcnt 0
[510076.346953] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[510076.348794] Memory cgroup stats for /slurm/uid_12043/job_2064245/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[510076.354501] Memory cgroup stats for /slurm/uid_12043/job_2064245/step_batch/task_0: cache:26016KB rss:6339016KB rss_huge:2621440KB mapped_file:16960KB swap:0KB inactive_anon:16424KB active_anon:6342420KB inactive_file:7556KB active_file:1956KB unevictable:0KB
[510076.361480] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[510076.363888] [ 1125] 12043 1125 28298 382 11 0 0 bash
[510076.366210] [ 1126] 12043 1126 28331 405 11 0 0 bash
[510076.368538] [ 1132] 12043 1132 132184 20135 126 0 0 python
[510076.370898] [ 1135] 12043 1135 1265934 796451 1788 0 0 python
[510076.373238] [ 1184] 12043 1184 7543374 647471 1581 0 0 python
[510076.375588] Memory cgroup out of memory: Kill process 1254 (python) score 433 or sacrifice child
[510076.378129] Killed process 1184 (python), UID 12043, total-vm:30173496kB, anon-rss:2491160kB, file-rss:102944kB, shmem-rss:8220kB
[510918.881746] slurm.epilog.cl (3928): drop_caches: 3

Best,

David

ArturB · November 27, 2019, 9:00am

Hi,

For me it was the same thing, but only when using multiple GPUs.
With One GPU involved in the process (default value) it works every single time.

best
ab

david.haselbach · November 28, 2019, 12:19pm

Thanks for the advice but I was already using only one GPU.

david.haselbach · November 29, 2019, 6:10am

But I restarted the job now only with cpu and now it runs through.

stephan · December 12, 2019, 7:34pm

Hi @david.haselbach,

It looks like your cluster scheduler has a hard cap on the amount of memory a task can use equal to the amount it has requested. It might be possible that our memory estimates for the extraction job are incorrect. You can modify the memory requirements required by the extract from micrographs job by modifying the memory_per_gpu variable inside cryosparc2_worker/cryosparc2_compute/jobs/extraction/build.py@builder_extract_micrographs_multi() to be larger than 4000MB. Maybe you can try setting this to 8000MB, and given your cluster can satisfy the request, you can avoid the cluster killing your job?