3D Flex refinement "Job process terminated abnormally"

Hi,

I was running 3D flex refinement on a smaller dataset (box size 400, particle number 9K), and it ran through. I would like to try an unbinned dataset with more particle (box size 506, particle num 16K). At first, we are not able to run because of memory issue, after we add more memory to GPU, it can run but got terminated in the middle, give us the following message:


RUNNING THE L-BFGS-B CODE

       * * *

Machine precision = 2.220D-16
N = 129554216 M = 10
This problem is unconstrained.

At X0 0 variables are exactly at the bounds

At iterate 0 f= 2.05504D+09 |proj g|= 2.99145D+03

python[53182]: segfault at 7f5a39578df0 ip 00007f7470618ff8 sp 00007ffd8f2724d0 error 6 in _lbfgsb.cpython-38-x86_64-linux-gnu.so[7f7470609000+17000]

Looking forward to any suggestions!

Thanks!
Kai

Hello @ksheng. Please can you email us the job report for this job.

Using the cluster JOBID (see Event Log), can you find out if the job was killed by the cluster resource manager and, if so, why?

I don’t think it was killed by the manager. We need to allocate the job to specific nodes, that was why it was held at first before we free the space.

This may be worth confirming with reference to the cluster job historic records/logs.
I wonder if increasing the
#SBATCH --mem= option from the current configuration (how? 1, 2) might help?

We tried on a node with 256GB RAM and RTXA6000 46GB RAM. It failed with python error (probably because it was running on AMD EPYC ?)

Here are the error message:
/var/log/messages-20230903:Aug 30 11:05:15 nodeb319 kernel: python[28860]: segfault at 7f3ff9ca0df0 ip 00007f5a2b5e8ff8 sp 00007ffcfff3a410 error 6 in _lbfgsb.cpython-38-x86_64-linux-gnu.so[7f5a2b5d9000+17000]

Your cluster manager may kill a job if that job exceeds the (possibly much smaller) RAM size specified by the #SBATCH --mem= option inside the cluster script. This may not be the root cause, but I would like to see this possibility eliminated first.

The name of the library suggests a connection to the job. I will check with our team about this lead. The timestamp of this kernel message does not match the timestamp of the abnormal job termination as closely as I would expect, even allowing for possibly different time zone conventions of the timestamps.

Sorry for the confusion, we ran this job and similar jobs a couple of times. They have led to same situation.

A brief update: The Structura team has seen segmentation faults in 3DFlex like the one you described before. Unfortunately, we have not yet established a pattern that would help us identify the root cause, which may be related to the platform on which a job is running. Do you have access to other compute nodes with, perhaps, a different x86-64 CPU model or a different OS (or OS version) where you could try the job?
Also ensure that your cluster script template does not restrict RAM usage more tightly than you intended.

We are experiencing this issue also on a stand alone machine with 500 gb RAM and 4 NVIDIA GeForce RTX 3090 GPU cards with 24 gb memory.

OS: Cent OS Linux 7
Kernel: Linux 3.10.0-1160.71.1.el7.x86_64

We are running CryoSPARC v4.5.1

At 520 box size we see the job process terminated abnormally message.
The stack is quite large as we symmetry expanded (900k). However, this doesn’t seem to be related to particle number, as running the flex reconstruct on a 10k subset also fails in the same manner.

From a similar thread it seems box size can be the issue. Using the binned blob from flex train (128 pix), the flex reconstruct runs fine. And very encouraging is the GSFSC curve, suggesting we have a lot of resolution to gain once we unbin.

So we have attempted 2x Bin (260 box) and this works!

But ultimately, it would be great to run this without binning as from what I can tell, this particle is a perfect use case for Flex Refine and initial results are very promising! We’re testing some tighter cropping of the particle and hopefully this will help…

Any further suggestions on this topic? Or would any more information on our end help?

Thanks!

@maxm Thank you for reporting your observations. Please can you post

  • error messages from the Event Log
  • error messages from the job log (Metadata|Log)
  • the output of the command
    sudo journalctl | grep -i oom