Hi,
I had a 3D classification job that had been running for 90h, and had completed 420/620 O-EM iterations. Up until the last iteration everything looked fine - diff maps and volume slices looked fine.
In the second to last iteration something happened, and all of a sudden the volumes looked very different - more contrasty, for want of a better description.
Then in the last iteration, the job crashed, with the following error in the log:
========= sending heartbeat at 2025-02-14 01:01:19.287513
========= sending heartbeat at 2025-02-14 01:01:29.304226
========= sending heartbeat at 2025-02-14 01:01:39.314524
Received SIGSEGV (addr=0000000000000000)
/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(traceback_signal_handler+0x113)[0x7f620e7d69f3]
/lib64/libpthread.so.0(+0xf630)[0x7f621e4dc630]
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/pyfftw.cpython-310-x86_64-linux-gnu.so(+0x281ac8)[0x7f61fec0cac8]
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/pyfftw.cpython-310-x86_64-linux-gnu.so(+0x302f5c)[0x7f61fec8df5c]
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/pyfftw.cpython-310-x86_64-linux-gnu.so(+0x181ee5)[0x7f61feb0cee5]
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/pyfftw.cpython-310-x86_64-linux-gnu.so(+0x1349d1)[0x7f61feabf9d1]
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/pyfftw.cpython-310-x86_64-linux-gnu.so(+0x1325df)[0x7f61feabd5df]
/lib64/libpthread.so.0(+0x7ea5)[0x7f621e4d4ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f621daf4b0d]
rax 0000000000000000 rbx 00007f582c6e5020 rcx 0000000000000008 rdx 00007f582c6e50f0
rsi 00007f61fee89478 rdi 00007f582c6e5028 rbp 0000000000000010 rsp 00007f5f20918d88
r8 fffffffffffffff8 r9 0000564eabc27380 r10 0000000000000001 r11 0000000000000000
r12 0000000000000000 r13 000000000000000d r14 fffffffffffffff0 r15 0000564ea4387da0
ea 0f 8d 29 05 00 00 48 8d 35 e2 c9 27 00 49 89 e8 48 8d 0c ad 00 00 00 00 48 c1 e5 03
49 f7 d8 49 89 ee 44 0f 28 25 8d e2 26 00 4c 8b 26 49 c1 e0 02 49 f7 de 49 c1 e4 03 0f
1f 80 00 00 00 00
--> 0f 28 00 4a 8d 1c 02 4c 8d 1c 0f 41 0f 28 cc 0f 12 1a 0f 16 1b 41 0f 57 dc f3 0f
12 d0 49 8b 71 10 49 83 c2 02 0f 28 e3 f3 0f 16 c0 44 0f 28 f3 0f 28 58 10 0f 12 3f 41
0f 16 3b 0f 5c e7 48 c1
========= main process now complete at 2025-02-14 01:01:49.332212.
========= monitor process now complete at 2025-02-14 01:01:49.499201.
Thoughts? This was a 40-class run, 6.2M particles, 20 O-EM epochs, batch size 5000, filter res 2.5, force hard classification off. In the iteration where things went pear-shaped I didn’t notice the class distribution changing significantly. Some kind of numerical instability?
Cheers
Oli