3D classification crash

olibclarke · February 14, 2025, 1:29pm

Hi,

I had a 3D classification job that had been running for 90h, and had completed 420/620 O-EM iterations. Up until the last iteration everything looked fine - diff maps and volume slices looked fine.

In the second to last iteration something happened, and all of a sudden the volumes looked very different - more contrasty, for want of a better description.

Then in the last iteration, the job crashed, with the following error in the log:

========= sending heartbeat at 2025-02-14 01:01:19.287513
========= sending heartbeat at 2025-02-14 01:01:29.304226
========= sending heartbeat at 2025-02-14 01:01:39.314524
Received SIGSEGV (addr=0000000000000000)
/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(traceback_signal_handler+0x113)[0x7f620e7d69f3]
/lib64/libpthread.so.0(+0xf630)[0x7f621e4dc630]
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/pyfftw.cpython-310-x86_64-linux-gnu.so(+0x281ac8)[0x7f61fec0cac8]
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/pyfftw.cpython-310-x86_64-linux-gnu.so(+0x302f5c)[0x7f61fec8df5c]
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/pyfftw.cpython-310-x86_64-linux-gnu.so(+0x181ee5)[0x7f61feb0cee5]
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/pyfftw.cpython-310-x86_64-linux-gnu.so(+0x1349d1)[0x7f61feabf9d1]
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/pyfftw.cpython-310-x86_64-linux-gnu.so(+0x1325df)[0x7f61feabd5df]
/lib64/libpthread.so.0(+0x7ea5)[0x7f621e4d4ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f621daf4b0d]
rax 0000000000000000  rbx 00007f582c6e5020  rcx 0000000000000008  rdx 00007f582c6e50f0  
rsi 00007f61fee89478  rdi 00007f582c6e5028  rbp 0000000000000010  rsp 00007f5f20918d88  
r8  fffffffffffffff8  r9  0000564eabc27380  r10 0000000000000001  r11 0000000000000000  
r12 0000000000000000  r13 000000000000000d  r14 fffffffffffffff0  r15 0000564ea4387da0  
ea 0f 8d 29 05 00 00 48 8d 35 e2 c9 27 00 49 89 e8 48 8d 0c ad 00 00 00 00 48 c1 e5 03
49 f7 d8 49 89 ee 44 0f 28 25 8d e2 26 00 4c 8b 26 49 c1 e0 02 49 f7 de 49 c1 e4 03 0f
1f 80 00 00 00 00
-->   0f 28 00 4a 8d 1c 02 4c 8d 1c 0f 41 0f 28 cc 0f 12 1a 0f 16 1b 41 0f 57 dc f3 0f
12 d0 49 8b 71 10 49 83 c2 02 0f 28 e3 f3 0f 16 c0 44 0f 28 f3 0f 28 58 10 0f 12 3f 41
0f 16 3b 0f 5c e7 48 c1

========= main process now complete at 2025-02-14 01:01:49.332212.
========= monitor process now complete at 2025-02-14 01:01:49.499201.

Thoughts? This was a 40-class run, 6.2M particles, 20 O-EM epochs, batch size 5000, filter res 2.5, force hard classification off. In the iteration where things went pear-shaped I didn’t notice the class distribution changing significantly. Some kind of numerical instability?

Cheers
Oli

wtempel · February 14, 2025, 3:43pm

@olibclarke Please can you post the output of this command, replacing P99, J199 with the relevant identifiers:

cryosparcm cli "get_job('P99', 'J199', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"

olibclarke · February 14, 2025, 5:02pm

Sure:

{'_id': 'scrubbed', 'errors_run': [{'message': 'Job process terminated abnormally.', 'warning': False}], 'instance_information': {'CUDA_version': '11.8', 'available_memory': '146.10GB', 'cpu_model': 'AMD Ryzen Threadripper 3970X 32-Core Processor', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 25429999616, 'name': 'NVIDIA GeForce RTX 3090', 'pcie': '0000:01:00'}, {'id': 1, 'mem': 25438126080, 'name': 'NVIDIA GeForce RTX 3090', 'pcie': '0000:4c:00'}], 'ofd_hard_limit': 4096, 'ofd_soft_limit': 1024, 'physical_cores': 32, 'platform_architecture': 'x86_64', 'platform_node': 'c112384', 'platform_release': '3.10.0-1160.92.1.el7.x86_64', 'platform_version': '#1 SMP Tue Jun 20 11:48:01 UTC 2023', 'total_memory': '251.67GB', 'used_memory': '104.14GB'}, 'job_type': 'class_3D', 'params_spec': {'class3D_N_K': {'value': 40}, 'class3D_oem_batch_size': {'value': 5000}, 'class3D_oem_epochs': {'value': 20}, 'class3D_online_em_lr_hl': {'value': 0}, 'class3D_online_em_lr_init': {'value': 1}, 'class3D_target_res': {'value': 2.5}, 'class3D_use_scales': {'value': 'input'}, 'random_seed': {'value': 309211458}}, 'project_uid': 'P35', 'status': 'failed', 'uid': 'J1845', 'version': 'v4.6.2'}

(I have scrubbed the license ID)

vperetroukhin · February 18, 2025, 6:42pm

Hey @olibclarke – could you DM/email us a copy of the job report for this job please?

Thanks!
Valentin