Flex Reconstruction Failing

hansenbry · September 5, 2024, 1:54pm

Hi - I’ve been trying to do a flex reconstruction on a data set, but it keeps failing with a relatively non-descript message of :

====== Job process terminated abnormally.

The box size is 512 so I tried resampling the data so that it was down to a 256 box size and had the same result. I also tried limiting the number of particles to 300k. I’m a bit stuck on where to troubleshoot from here and welcome any suggestions.

wtempel · September 5, 2024, 5:56pm

Please can you post the outputs of these commands

on the CryoSPARC master host
```
cryosparcm joblog P99 J199 | tail -n 30
cryosparcm eventlog P99 J199 | tail -n 30
```
where you substitute the failing job’s actual project and job IDs
on the CryoSPARC worker where the job ran
```
free -h
sudo journalctl | grep -i oom
```

hansenbry · September 10, 2024, 2:00pm

Sorry it took me so long to get this to you, but here are the outputs:

[cryo1@ai-rmlcryoprd1 ~]$ cryosparcm joblog P46 J646 | tail -n 30
========= sending heartbeat at 2024-09-04 19:17:19.297183
========= sending heartbeat at 2024-09-04 19:17:29.312589
========= sending heartbeat at 2024-09-04 19:17:39.328624
========= sending heartbeat at 2024-09-04 19:17:49.344805
========= sending heartbeat at 2024-09-04 19:17:59.360472
========= sending heartbeat at 2024-09-04 19:18:09.376554
========= sending heartbeat at 2024-09-04 19:18:19.392352
========= sending heartbeat at 2024-09-04 19:18:29.408222
========= sending heartbeat at 2024-09-04 19:18:39.423971
========= sending heartbeat at 2024-09-04 19:18:49.439714
========= sending heartbeat at 2024-09-04 19:18:59.455397
========= sending heartbeat at 2024-09-04 19:19:09.472774
========= sending heartbeat at 2024-09-04 19:19:19.490515
========= sending heartbeat at 2024-09-04 19:19:29.506251
========= sending heartbeat at 2024-09-04 19:19:39.521974
RUNNING THE L-BFGS-B CODE

       * * *

Machine precision = 2.220D-16
N = 134217728 M = 10
This problem is unconstrained.

At X0 0 variables are exactly at the bounds

At iterate 0 f= 3.93447D+10 |proj g|= 2.56484D+04
========= sending heartbeat at 2024-09-04 19:19:49.537588
========= sending heartbeat at 2024-09-04 19:19:59.554370
========= main process now complete at 2024-09-04 19:20:02.221192.
========= monitor process now complete at 2024-09-04 19:20:02.276084.

[cryo1@ai-rmlcryoprd1 ~]$ cryosparcm eventlog P46 J646 | tail -n 30
[CPU RAM used: 180 MB] GPU : [0]
[CPU RAM used: 180 MB] RAM : [0, 1, 2, 3, 4, 5, 6, 7]
[CPU RAM used: 180 MB] SSD : False
[CPU RAM used: 180 MB] --------------------------------------------------------------
[CPU RAM used: 180 MB] Importing job module for job type flex_highres…
[CPU RAM used: 446 MB] Job ready to run
[CPU RAM used: 446 MB] ***************************************************************
[CPU RAM used: 519 MB] ====== 3D Flex Load Checkpoint =======
[CPU RAM used: 519 MB] Loading checkpoint from J645/J645_train_checkpoint_017600.tar …
[CPU RAM used: 956 MB] Initializing torch…
[CPU RAM used: 956 MB] Initializing model from checkpoint…
Input tetramesh
[CPU RAM used: 1081 MB] Upscaling deformation model to match input volume size…
Upsampled mask
Upsampled tetramesh
[CPU RAM used: 4111 MB] ====== Load particle data =======
[CPU RAM used: 4214 MB] Reading in all particle data on the fly from files…
[CPU RAM used: 4214 MB] Loading a ParticleStack with 300000 items…
[CPU RAM used: 4359 MB] Done.
[CPU RAM used: 4359 MB] Preparing all particle CTF data…
[CPU RAM used: 4360 MB] Parameter “Force re-do GS split” was off. Using input split…
[CPU RAM used: 4360 MB] Split A contains 150000 particles
[CPU RAM used: 4360 MB] Split B contains 150000 particles
[CPU RAM used: 4360 MB] Setting up particle poses…
[CPU RAM used: 4360 MB] ====== High resolution flexible refinement =======
[CPU RAM used: 4360 MB] Max num L-BFGS iterations was set to 20
[CPU RAM used: 4360 MB] Starting L-BFGS.
[CPU RAM used: 4360 MB] Reconstructing half-map A
[CPU RAM used: 4360 MB] Iteration 0 : 149000 / 150000 particles
[CPU RAM used: 190 MB] ====== Job process terminated abnormally.

[******@ai-rmlcpu22 ~]$ free -h
total used free shared buff/cache available
Mem: 1.0T 68G 935G 472M 3.1G 936G
Swap: 15G 30M 15G

[*****@ai-rmlcpu22 ~]$ sudo journalctl | grep -i oom
[sudo] password for ****:
[*****@ai-rmlcpu22 ~]$

hbridges1 · September 25, 2024, 3:15pm

Hi @hansenbry,

Thanks for your question and for sending the outputs of those commands. The error you are seeing may be related to the box size of the reconstruction as users have reported seeing this error when the box size is larger than 440. What we suggest is that the particles are cropped in real space and /or Fourier space so that the box is a maximum of 440 before they are input to Flex Data prep, and that the downstream jobs are re-run.

Downsample Particles, specifying Fourier crop to box size less than or equal to 440. Consider a Crop / pad to box size of less than 512, if the current 512 pixel box has a “generous margin” around the particles.
Confirm with a Homogeneous Reconstruction Only and the downsampled particles job that real- and Fourier-space crop parameters from the previous step were appropriate.
Input the downsampled particles to 3D Flex Data Prep.
Use a Volume Tools job to create a soft-padded mask of the consensus volume output by 3D Flex Data Prep.
Use the outputs of the preceding steps for 3D Flex Mesh Prep, Training and Reconstruction.

As you mentioned that you already tried downsampling the particles to a box of 256, at what stage were your particles downsampled? If your workflow differed from that shown above could you try this out and see if it resolves the issue for you?

hansenbry · September 25, 2024, 7:06pm

When I reduced them down to a 256 box size I just did step 1. So I’ll try your full workflow and let you know how it goes.