Job process terminated abnormally during Homogeneous Refinement

Hello,

My Homogeneous Refinement job keeps failing soon after iteration 001 begins (~2 hours after job starts). The only information I get about this error is the following:

[CPU: 25.0 MB Avail: 28.54 GB] ====== Job process terminated abnormally.

Info:

Current cryoSPARC version: v4.4.1

NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2

Linux esc309516 5.15.0-102-generic #112~20.04.1-Ubuntu SMP Thu Mar 14 14:28:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

total used free shared buff/cache available
Mem: 31 3 26 0 0 27
Swap: 1 0 1

Your help would be much appreciated

Thank you!

Welcome to the forum @NMCCANN2 .
Please can you provide additional details:

  • particle box size
  • how many particles
  • output of the commands
    nvidia-smi --query-gpu=index,name --format=csv
    cryosparcm joblog P99 J123 | tail -n 30
    sudo dmesg -T | grep -i oom
    
    where P99 and J123 should be replaced with the actual project and job IDs, respectively.

Hi @wtempel thanks for your response.

  • Particle box size = 512 px
  • 105,649 particles

index, name
0, NVIDIA RTX A5000

========= sending heartbeat at 2024-04-10 15:26:04.646111
========= sending heartbeat at 2024-04-10 15:26:14.788967
========= sending heartbeat at 2024-04-10 15:26:24.958891
========= sending heartbeat at 2024-04-10 15:26:35.058158
========= sending heartbeat at 2024-04-10 15:26:45.227041
========= sending heartbeat at 2024-04-10 15:26:55.399605
========= sending heartbeat at 2024-04-10 15:27:05.520274
========= sending heartbeat at 2024-04-10 15:27:15.618699
========= sending heartbeat at 2024-04-10 15:27:25.821621
========= sending heartbeat at 2024-04-10 15:27:35.954142
========= sending heartbeat at 2024-04-10 15:27:46.174951
========= sending heartbeat at 2024-04-10 15:27:56.301866
========= sending heartbeat at 2024-04-10 15:28:06.459722
========= sending heartbeat at 2024-04-10 15:28:16.623816
========= sending heartbeat at 2024-04-10 15:28:26.852826
========= sending heartbeat at 2024-04-10 15:28:36.927826
========= sending heartbeat at 2024-04-10 15:28:47.045277
========= sending heartbeat at 2024-04-10 15:28:57.097423
========= sending heartbeat at 2024-04-10 15:29:07.190983
========= sending heartbeat at 2024-04-10 15:29:17.400361
========= sending heartbeat at 2024-04-10 15:29:27.519494
========= sending heartbeat at 2024-04-10 15:29:37.625820
========= sending heartbeat at 2024-04-10 15:29:47.757256
========= sending heartbeat at 2024-04-10 15:29:57.787177
========= sending heartbeat at 2024-04-10 15:30:07.802660
========= sending heartbeat at 2024-04-10 15:30:17.812576
========= sending heartbeat at 2024-04-10 15:30:27.827730
========= sending heartbeat at 2024-04-10 15:30:37.842889
========= main process now complete at 2024-04-10 15:30:44.728494.
========= monitor process now complete at 2024-04-10 15:30:45.126978

[Tue Apr 9 16:54:40 2024] WTJourn.Flusher invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[Tue Apr 9 16:54:40 2024] oom_kill_process.cold+0xb/0x10
[Tue Apr 9 16:54:40 2024] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Tue Apr 9 16:54:40 2024] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1001.slice/session-5.scope,task=python,pid=3466,uid=1001
[Tue Apr 9 16:54:40 2024] Out of memory: Killed process 3466 (python) total-vm:55687656kB, anon-rss:28414772kB, file-rss:65244kB, shmem-rss:2048kB, UID:1001 pgtables:61056kB oom_score_adj:0
[Tue Apr 9 20:01:02 2024] cron invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[Tue Apr 9 20:01:02 2024] oom_kill_process.cold+0xb/0x10
[Tue Apr 9 20:01:02 2024] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Tue Apr 9 20:01:02 2024] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1001.slice/session-12.scope,task=python,pid=5256,uid=1001
[Tue Apr 9 20:01:02 2024] Out of memory: Killed process 5256 (python) total-vm:55668288kB, anon-rss:28887752kB, file-rss:67904kB, shmem-rss:3552kB, UID:1001 pgtables:61340kB oom_score_adj:0
[Wed Apr 10 12:58:56 2024] avahi-daemon invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[Wed Apr 10 12:58:56 2024] oom_kill_process.cold+0xb/0x10
[Wed Apr 10 12:58:56 2024] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Wed Apr 10 12:58:56 2024] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1001.slice/session-44.scope,task=python,pid=9454,uid=1001
[Wed Apr 10 12:58:56 2024] Out of memory: Killed process 9454 (python) total-vm:55680528kB, anon-rss:29109204kB, file-rss:67472kB, shmem-rss:5084kB, UID:1001 pgtables:61380kB oom_score_adj:0
[Wed Apr 10 15:30:42 2024] Timer invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=100
[Wed Apr 10 15:30:42 2024] oom_kill_process.cold+0xb/0x10
[Wed Apr 10 15:30:42 2024] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Wed Apr 10 15:30:42 2024] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1001.slice/session-50.scope,task=python,pid=11056,uid=1001
[Wed Apr 10 15:30:42 2024] Out of memory: Killed process 11056 (python) total-vm:55668556kB, anon-rss:29041684kB, file-rss:66640kB, shmem-rss:2448kB, UID:1001 pgtables:61048kB oom_score_adj:0

There may be too little system RAM on the computer for this particular refinement job.

Our group has ran homogeneous refinements before on this computer without having this issue. Could it be something about my data set that’s bringing this issue up?

Hello @wtempel

I was able to resolve this issue by reducing the box size from the extract micrographs job.

Thanks for your help