====== Job process terminated abnormally in Non-Uniform Refinement

Hi,
I am doing final nonuniform refinement in version 3.1. with box size of 460 pix ( 0.453A/pixelsize). The job ends with following error. I guess its memory issue, I tried reducing the GPU batch size of images to 1, is there any way to overcome this problem.
Thanks

[CPU: 3.62 GB] Using Max Alignment Radius 6.948 (30.000A)
[CPU: 3.62 GB] Auto batchsize: 4625 in each split
[CPU: 5.12 GB] – THR 0 BATCH 3 NUM 1158 TOTAL 30.483477 ELAPSED 132.19477 –
[CPU: 8.40 GB] Processed 9250.000 images in 133.928s.
[CPU: 9.17 GB] Computing FSCs…
[CPU: 10.62 GB] Done in 100.312s
[CPU: 10.62 GB] Using Filter Radius 35.878 (5.810A) | Previous: 6.948 (30.000A)
[CPU: 13.53 GB] Running local cross validation for A …
[CPU: 24.3 MB] ====== Job process terminated abnormally.

Hi @structure, it looks like the job may have been killed by another user or process, or killed by the cluster scheduler because it ran out of memory. How much system RAM do you have and what kind of GPUs are you using?
The job log may also have more information.

Hi, @spunjani,

The above jobs was running on node with has following configuration: 4 RTX 6000s (24 GB), CPU Features: AVX, AVX2, AVX512. I tried homogenous and non homogenous refinement and all fails with similar message.

Log file ends like this:
========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= sending heartbeat

========= main process now complete.

========= monitor process now complete.