3DFLEX reconstruction fails

Hi, 3dflex reconstruction job fails consistently on several nodes with different OS, number of CPUs, GPUs, and RAM. From the job log file (attached) it looks like the problem is lbgfsb library. I am running the job on CS version 4.7.1. Error in job window just says ‘Job process terminated abnormally.’ The job crashes right away starting iteration 0. Here is the bottom text in job window before it crashes:
“Starting L-BFGS.
[2025-07-12 2:40:24.56]
[CPU: 3.00 GB Avail: 58.34 GB]
Reconstructing half-map A
[2025-07-12 2:40:24.57]
[CPU: 3.00 GB Avail: 58.34 GB]
Iteration 0 : 11000 / 11486 particles”
This type of jobs did run before successfully, not sure where the difference is.
Thank you, Michael

================= CRYOSPARCW =======  2025-07-12 02:36:57.453937  =========
Project PYYY Job J299
Master cryosparc.host.XXXX Port 39002
===========================================================================
MAIN PROCESS PID 2390072
========= now starting main process at 2025-07-12 02:36:57.454404
flex_refine.run_highres cryosparc_compute.jobs.jobregister
MONITOR PROCESS PID 2390074
========= monitor process now waiting for main process
========= sending heartbeat at 2025-07-12 02:36:59.822120
========= sending heartbeat at 2025-07-12 02:37:09.837006
<string>:1: DeprecationWarning: Please import `map_coordinates` from the `scipy.ndimage` namespace; the `scipy.ndimage.interpolation` namespace is deprecated and will be removed in SciPy 2.0.0.
========= sending heartbeat at 2025-07-12 02:37:19.851340
========= sending heartbeat at 2025-07-12 02:37:29.866546
***************************************************************
Transparent hugepages setting: [always] madvise never

Running job  J299  of type  flex_highres
Running job on hostname %s vds1-2.ZZZ.edu
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'vds1-2.ZZZ.edu', 'lane': 'vds12', 'lane_type': 'node', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7]}, 'target': {'cache_path': '/mnt/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25651445760, 'name': 'Quadro M6000 24GB'}], 'hostname': 'vds1-2.ZZZ.edu', 'lane': 'vds12', 'monitor_port': None, 'name': 'vds1-2.ZZZ.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'GPU': [0], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7]}, 'ssh_str': 'cryosparc@vds1-2.ZZZ.edu', 'title': 'Worker node vds1-2.ZZZ.edu', 'type': 'node', 'worker_bin_path': '/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/bin/cryosparcw'}}
2025-07-12 02:37:30,741 run_with_executor    INFO     | Resolving 5 source path(s) for caching
2025-07-12 02:37:30,744 run_with_executor    INFO     | Resolved 5 sources in 0.00 seconds
2025-07-12 02:37:30,774 allocate             INFO     | Cache allocation start. Active run IDs: P324-J3128-1752288435, P356-J99-1752188587, P324-J3142-1752200292, P291-J299-1752315930, P369-J16-1752283829, P369-J17-1752283831, P369-J18-1752283965, P367-J112-1752313630
2025-07-12 02:37:30,921 refresh              INFO     | Refreshed cache drive in 0.15 seconds
2025-07-12 02:37:30,924 allocate             INFO     | Deleted 0 cached files, encountered 0 errors
2025-07-12 02:37:30,925 allocate             INFO     | Allocated 5 stub cache files; creating links
2025-07-12 02:37:30,925 allocate             INFO     | Cache allocation complete
2025-07-12 02:37:30,925 run_with_executor    INFO     | Cache allocation ran in 0.16 seconds
2025-07-12 02:37:30,925 run_with_executor    INFO     | Found 0 SSD hit(s)
2025-07-12 02:37:30,925 run_with_executor    INFO     | Transferring 5 file(s)...
========= sending heartbeat at 2025-07-12 02:37:39.881532
========= sending heartbeat at 2025-07-12 02:37:49.897564
========= sending heartbeat at 2025-07-12 02:37:59.912318
========= sending heartbeat at 2025-07-12 02:38:09.922795
========= sending heartbeat at 2025-07-12 02:38:19.937535
========= sending heartbeat at 2025-07-12 02:38:29.952531
========= sending heartbeat at 2025-07-12 02:38:39.967721
========= sending heartbeat at 2025-07-12 02:38:49.983529
========= sending heartbeat at 2025-07-12 02:38:59.999001
========= sending heartbeat at 2025-07-12 02:39:10.014530
========= sending heartbeat at 2025-07-12 02:39:20.029725
========= sending heartbeat at 2025-07-12 02:39:30.044572
========= sending heartbeat at 2025-07-12 02:39:40.059560
========= sending heartbeat at 2025-07-12 02:39:50.074556
========= sending heartbeat at 2025-07-12 02:40:00.089605
========= sending heartbeat at 2025-07-12 02:40:10.104571
========= sending heartbeat at 2025-07-12 02:40:20.118961
2025-07-12 02:40:24,024 run_with_executor    INFO     | Transferred /mnt/gimli/data1/CS-ryadel-thawed/J293/J293_particles_fullres_batch_00001.mrc to SSD key 5cd2792b66f7ab256ae52c477447385e66bb67ad
2025-07-12 02:40:24,033 run_with_executor    INFO     | Transferred /mnt/gimli/data1/CS-ryadel-thawed/J293/J293_particles_fullres_batch_00002.mrc to SSD key 024f0e3472f304fc4fe61419fa6b389472db4c4d
2025-07-12 02:40:24,034 run_with_executor    INFO     | Transferred /mnt/gimli/data1/CS-ryadel-thawed/J293/J293_particles_fullres_batch_00000.mrc to SSD key f421d75b434df829ea1f3924bedbdc2b6e724434
2025-07-12 02:40:24,036 run_with_executor    INFO     | Transferred /mnt/gimli/data1/CS-ryadel-thawed/J293/J293_particles_fullres_batch_00003.mrc to SSD key 9ff70eea41fce6f7703e500410233fa50206c9e6
2025-07-12 02:40:24,078 run_with_executor    INFO     | Transferred /mnt/gimli/data1/CS-ryadel-thawed/J293/J293_particles_fullres_batch_00004.mrc to SSD key 8dc9e1eb6ad0da477f4b34332622f99e7b49f429
2025-07-12 02:40:24,080 run_with_executor    INFO     | Unlocked 5 file(s)
2025-07-12 02:40:24,080 run_with_executor    INFO     | Requested files successfully cached to SSD
2025-07-12 02:40:24,086 run_with_executor    INFO     | SSD cache complete
<string>:1: DeprecationWarning: Please import `fmin_l_bfgs_b` from the `scipy.optimize` namespace; the `scipy.optimize.lbfgsb` namespace is deprecated and will be removed in SciPy 2.0.0.
========= sending heartbeat at 2025-07-12 02:40:30.126824
========= sending heartbeat at 2025-07-12 02:40:40.141551
WARNING: io_uring support disabled (not supported by kernel), I/O performance may degrade
========= sending heartbeat at 2025-07-12 02:40:50.157184
========= sending heartbeat at 2025-07-12 02:41:00.166910
========= sending heartbeat at 2025-07-12 02:41:10.182090
========= sending heartbeat at 2025-07-12 02:41:20.197354
========= sending heartbeat at 2025-07-12 02:41:30.212532
========= sending heartbeat at 2025-07-12 02:41:40.248945
========= sending heartbeat at 2025-07-12 02:41:50.264636
========= sending heartbeat at 2025-07-12 02:42:00.279625
========= sending heartbeat at 2025-07-12 02:42:10.294383
========= sending heartbeat at 2025-07-12 02:42:20.309303
========= sending heartbeat at 2025-07-12 02:42:30.322979
========= sending heartbeat at 2025-07-12 02:42:40.337645
========= sending heartbeat at 2025-07-12 02:42:50.343555
========= sending heartbeat at 2025-07-12 02:43:00.358514
========= sending heartbeat at 2025-07-12 02:43:10.373574
========= sending heartbeat at 2025-07-12 02:43:20.388796
========= sending heartbeat at 2025-07-12 02:43:30.400423
========= sending heartbeat at 2025-07-12 02:43:40.415534
========= sending heartbeat at 2025-07-12 02:43:50.457963
========= sending heartbeat at 2025-07-12 02:44:00.473288
========= sending heartbeat at 2025-07-12 02:44:10.488252
========= sending heartbeat at 2025-07-12 02:44:20.503232
========= sending heartbeat at 2025-07-12 02:44:30.517971
========= sending heartbeat at 2025-07-12 02:44:40.532586
========= sending heartbeat at 2025-07-12 02:44:50.547482
========= sending heartbeat at 2025-07-12 02:45:00.562315
========= sending heartbeat at 2025-07-12 02:45:10.576976
========= sending heartbeat at 2025-07-12 02:45:20.593652
========= sending heartbeat at 2025-07-12 02:45:30.608996
========= sending heartbeat at 2025-07-12 02:45:40.623177
========= sending heartbeat at 2025-07-12 02:45:50.638134
========= sending heartbeat at 2025-07-12 02:46:00.652536
========= sending heartbeat at 2025-07-12 02:46:10.664932
========= sending heartbeat at 2025-07-12 02:46:20.678182
========= sending heartbeat at 2025-07-12 02:46:30.693289
========= sending heartbeat at 2025-07-12 02:46:40.708165
========= sending heartbeat at 2025-07-12 02:46:50.719097
========= sending heartbeat at 2025-07-12 02:47:00.734049
========= sending heartbeat at 2025-07-12 02:47:10.749144
Received SIGSEGV (addr=00007f322db180b0)
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(traceback_signal_handler+0x113)[0x7f3da1838a03]
/lib64/libpthread.so.0(+0x12990)[0x7f3db8037990]
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_lbfgsb.cpython-310-x86_64-linux-gnu.so(+0x975b)[0x7f3da144e75b]
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_lbfgsb.cpython-310-x86_64-linux-gnu.so(+0xf822)[0x7f3da1454822]
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_lbfgsb.cpython-310-x86_64-linux-gnu.so(+0x10a7f)[0x7f3da1455a7f]
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_lbfgsb.cpython-310-x86_64-linux-gnu.so(+0x4794)[0x7f3da1449794]
python(_PyObject_MakeTpCall+0x26b)[0x560adba10a6b]
python(_PyEval_EvalFrameDefault+0x54a6)[0x560adba0c9d6]
python(_PyFunction_Vectorcall+0x6c)[0x560adba17a2c]
python(PyObject_Call+0xbc)[0x560adba23f1c]
python(_PyEval_EvalFrameDefault+0x2d83)[0x560adba0a2b3]
python(_PyFunction_Vectorcall+0x6c)[0x560adba17a2c]
python(PyVectorcall_Call+0xc5)[0x560adba24295]
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/jobs/flex_refine/flexmod.cpython-310-x86_64-linux-gnu.so(+0x94e30)[0x7f3d87d70e30]
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0xd224)[0x7f3db86a6224]
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/jobs/flex_refine/run_highres.cpython-310-x86_64-linux-gnu.so(+0xc2fe)[0x7f3d9f5b72fe]
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/jobs/flex_refine/run_highres.cpython-310-x86_64-linux-gnu.so(+0x2f717)[0x7f3d9f5da717]
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0x2399a)[0x7f3db86bc99a]
/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0x15581)[0x7f3db86ae581]
python(_PyEval_EvalFrameDefault+0x4c12)[0x560adba0c142]
python(+0x1d7c60)[0x560adbaaac60]
python(PyEval_EvalCode+0x87)[0x560adbaaaba7]
python(+0x20812a)[0x560adbadb12a]
python(+0x203523)[0x560adbad6523]
python(PyRun_StringFlags+0x7d)[0x560adbace91d]
python(PyRun_SimpleStringFlags+0x3c)[0x560adbace75c]
python(Py_RunMain+0x26b)[0x560adbacd66b]
python(Py_BytesMain+0x37)[0x560adba9e1f7]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x7f3db74ff7e5]
python(+0x1cb0f1)[0x560adba9e0f1]
rax 0000000000000001  rbx 00007f397fe2dfb0  rcx 00000000056e7508  rdx 0000000000000000  
rsi 00000000056e7508  rdi 00000000056e7507  rbp 00007f35ae387010  rsp 00007fff2e930dd0  
r8  00000000056e7508  r9  00007f322db180b0  r10 00000000056e7508  r11 0000000000000001  
r12 00007f35ef65f010  r13 00007f37a1ea8290  r14 fffffffffa918af7  r15 00007f322db180b8  
0f af d6 66 0f 28 d1 4c 01 f2 0f 1f 00 48 63 7c 8d 00 f2 0f 10 04 cb 48 ff c1 48 01 d7
f2 41 0f 10 74 fd 00 f2 0f 59 f0 f2 41 0f 59 04 fc f2 0f 58 d6 f2 0f 58 c8 4c 39 c1 75
d2 f2 0f 59 cb 99
-->   f2 41 0f 11 11 f7 3c 24 f2 43 0f 11 0c d9 49 83 c1 08 8d 42 01 4d 39 f9 75 93 8b
b4 24 a8 00 00 00 4c 8b 74 24 70 44 8b 14 24 8d 04 36 4c 8b 4c 24 08 4c 89 f1 48 8d 94
24 b0 00 00 00 4c 8d bc

========= main process now complete at 2025-07-12 02:47:20.760007.
========= monitor process now complete at 2025-07-12 02:47:20.789438.

Forgot to mention that the cryosparc was installed on Rocky 8 and Rocky 9 systems. Job failed on both.
Michael

Are both systems the same spec?

Despite optimsations in recent releases, 3D flex eats memory. 64GB may well not be enough. Check dmesg to see whether you have any memory or OOM warning/errors.

Thanks for the reply @rbs_sci. The log file is from my smallest system with 64Gb of RAM. The job failed on systems with 512 Gb, 648Gb, and even on 1Tb system. I checked dmesg on the 64Gb system, there was no OOM or memory warnings there. The 3d flex job fails right at the beginning of using l-BFGS algorithm (_lbfgsb in the error messages in log). And as you saw, the number of images was small.
I have another job type failing now, the RBMC one, again, even on 1Tb system with 8 A6000 (84Gb of VRAM). Will probably open yet another topic for that one. The annoying thing about the latter, it fails after running for several iterations, taking 16-20 hours before failure.
Michael