I was running 3D flex refinement on a smaller dataset (box size 400, particle number 9K), and it ran through. I would like to try an unbinned dataset with more particle (box size 506, particle num 16K). At first, we are not able to run because of memory issue, after we add more memory to GPU, it can run but got terminated in the middle, give us the following message:
We tried on a node with 256GB RAM and RTXA6000 46GB RAM. It failed with python error (probably because it was running on AMD EPYC ?)
Here are the error message:
/var/log/messages-20230903:Aug 30 11:05:15 nodeb319 kernel: python: segfault at 7f3ff9ca0df0 ip 00007f5a2b5e8ff8 sp 00007ffcfff3a410 error 6 in _lbfgsb.cpython-38-x86_64-linux-gnu.so[7f5a2b5d9000+17000]
Your cluster manager may kill a job if that job exceeds the (possibly much smaller) RAM size specified by the #SBATCH --mem= option inside the cluster script. This may not be the root cause, but I would like to see this possibility eliminated first.
The name of the library suggests a connection to the job. I will check with our team about this lead. The timestamp of this kernel message does not match the timestamp of the abnormal job termination as closely as I would expect, even allowing for possibly different time zone conventions of the timestamps.
A brief update: The Structura team has seen segmentation faults in 3DFlex like the one you described before. Unfortunately, we have not yet established a pattern that would help us identify the root cause, which may be related to the platform on which a job is running. Do you have access to other compute nodes with, perhaps, a different x86-64 CPU model or a different OS (or OS version) where you could try the job?
Also ensure that your cluster script template does not restrict RAM usage more tightly than you intended.