3D Flex refinement "Job process terminated abnormally"

Hi,

I was running 3D flex refinement on a smaller dataset (box size 400, particle number 9K), and it ran through. I would like to try an unbinned dataset with more particle (box size 506, particle num 16K). At first, we are not able to run because of memory issue, after we add more memory to GPU, it can run but got terminated in the middle, give us the following message:


RUNNING THE L-BFGS-B CODE

       * * *

Machine precision = 2.220D-16
N = 129554216 M = 10
This problem is unconstrained.

At X0 0 variables are exactly at the bounds

At iterate 0 f= 2.05504D+09 |proj g|= 2.99145D+03

python[53182]: segfault at 7f5a39578df0 ip 00007f7470618ff8 sp 00007ffd8f2724d0 error 6 in _lbfgsb.cpython-38-x86_64-linux-gnu.so[7f7470609000+17000]

Looking forward to any suggestions!

Thanks!
Kai

Hello @ksheng. Please can you email us the job report for this job.

Using the cluster JOBID (see Event Log), can you find out if the job was killed by the cluster resource manager and, if so, why?

I don’t think it was killed by the manager. We need to allocate the job to specific nodes, that was why it was held at first before we free the space.

This may be worth confirming with reference to the cluster job historic records/logs.
I wonder if increasing the
#SBATCH --mem= option from the current configuration (how? 1, 2) might help?

We tried on a node with 256GB RAM and RTXA6000 46GB RAM. It failed with python error (probably because it was running on AMD EPYC ?)

Here are the error message:
/var/log/messages-20230903:Aug 30 11:05:15 nodeb319 kernel: python[28860]: segfault at 7f3ff9ca0df0 ip 00007f5a2b5e8ff8 sp 00007ffcfff3a410 error 6 in _lbfgsb.cpython-38-x86_64-linux-gnu.so[7f5a2b5d9000+17000]

Your cluster manager may kill a job if that job exceeds the (possibly much smaller) RAM size specified by the #SBATCH --mem= option inside the cluster script. This may not be the root cause, but I would like to see this possibility eliminated first.

The name of the library suggests a connection to the job. I will check with our team about this lead. The timestamp of this kernel message does not match the timestamp of the abnormal job termination as closely as I would expect, even allowing for possibly different time zone conventions of the timestamps.

Sorry for the confusion, we ran this job and similar jobs a couple of times. They have led to same situation.

A brief update: The Structura team has seen segmentation faults in 3DFlex like the one you described before. Unfortunately, we have not yet established a pattern that would help us identify the root cause, which may be related to the platform on which a job is running. Do you have access to other compute nodes with, perhaps, a different x86-64 CPU model or a different OS (or OS version) where you could try the job?
Also ensure that your cluster script template does not restrict RAM usage more tightly than you intended.