CPU soft lockup during 2D classification with 250 classes

ctueting · March 1, 2022, 11:27am

Hi all,

we are facing an issue during 2D classification on our worker node.

In short:
We start the job, it is queued on the worker and starts. After loading the particles in the SSD, the worker sends multiple warnings (it affects all CPUs with always the same python pid):

Message from syslogd@cryo801w at Mar  1 12:18:18 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#44 stuck for 22s! [python:5754]

and eventually the job fails.

I checked the log file of the failed job, and it contains 422 out of 466 lines of sending heartbeat.

These are the last lines of the job.log:

========= sending heartbeat
**custom thread exception hook caught something
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

I do not see any messages in the core log regarding this issue.

We do not have similar issue on any of our other machines (multiple standalone installations), but the major difference with the new worker is, that it contains a AMD CPU, not an intel.

System:
CPU: AMD EPYC 7002
GPU: 4x NVIDIA RTX 3090

Cryosparc: Latest version and latest patch installed on master and worker.

The project folder is on the worker, the master is connected via sshds. Import/MotionCorr/etc and also some 2D class already finished successfully.

I tried to understand the issue a bit more, so I monitored the /etc/var/messages logfile on the worker, and got this is the log:

Message from syslogd@cryo801w at Mar  1 12:13:46 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#33 stuck for 24s! [python:5754]
Mar  1 12:13:46 cryo801w kernel: NMI watchdog: BUG: soft lockup - CPU#33 stuck for 24s! [python:5754]
Mar  1 12:13:46 cryo801w kernel: Modules linked in: nvidia_uvm(OE) cmac arc4 md4 nls_utf8 cifs ccm xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_uverbs ib_umad xfs vfat fat rdma_cm ib_cm iw_cm dm_mirror dm_region_hash dm_log dm_mod nvidia_drm(POE) nvidia_modeset(POE) amd64_edac_mod edac_mce_amd kvm_amd kvm nvidia(POE) irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw
Mar  1 12:13:46 cryo801w kernel: gf128mul glue_helper ablk_helper cryptd pcspkr raid456 async_raid6_recov async_memcpy async_pq snd_hda_codec_hdmi raid6_pq libcrc32c async_xor xor async_tx snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq joydev snd_seq_device snd_pcm raid0 snd_timer snd soundcore sg bnxt_re ib_core i2c_piix4 k10temp ipmi_si ipmi_devintf ipmi_msghandler i2c_designware_platform pinctrl_amd i2c_designware_core acpi_cpufreq ip_tables rndis_host cdc_ether usbnet mii ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci drm libahci bnxt_en libata nvme devlink nvme_core drm_panel_orientation_quirks nfit libnvdimm fuse
Mar  1 12:13:46 cryo801w kernel: CPU: 33 PID: 5754 Comm: python Kdump: loaded Tainted: P           OEL ------------   3.10.0-1160.59.1.el7.x86_64 #1
Mar  1 12:13:46 cryo801w kernel: Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
Mar  1 12:13:46 cryo801w kernel: task: ffff9be147523180 ti: ffff9b77a9034000 task.ti: ffff9b77a9034000
Mar  1 12:13:46 cryo801w kernel: RIP: 0010:[<ffffffffa51ffa90>]  [<ffffffffa51ffa90>] iommu_unmap_page+0x0/0x110
Mar  1 12:13:46 cryo801w kernel: RSP: 0018:ffff9b77a9037920  EFLAGS: 00000206
Mar  1 12:13:46 cryo801w kernel: RAX: 0000008000000000 RBX: 0000000000000002 RCX: 0000000000000027
Mar  1 12:13:46 cryo801w kernel: RDX: 0000000000001000 RSI: 000084f25e11f000 RDI: ffff9be18b12a000
Mar  1 12:13:46 cryo801w kernel: RBP: ffff9b77a9037958 R08: 0000000000000002 R09: 0000000000000000
Mar  1 12:13:46 cryo801w kernel: R10: 000000000000001b R11: 000ffffffffff000 R12: 000000000000001b
Mar  1 12:13:46 cryo801w kernel: R13: 000ffffffffff000 R14: 0000000000000002 R15: ffff9b77a9037918
Mar  1 12:13:46 cryo801w kernel: FS:  00007fba44ffd700(0000) GS:ffff9be18e040000(0000) knlGS:0000000000000000
Mar  1 12:13:46 cryo801w kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  1 12:13:46 cryo801w kernel: CR2: 00007fec3967a871 CR3: 0000007dde5de000 CR4: 0000000000340fe0
Mar  1 12:13:46 cryo801w kernel: Call Trace:
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa5201f13>] ? __unmap_single.isra.22+0x63/0x200
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa520341f>] unmap_sg+0x5f/0x70
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc09c1dfe>] nv_unmap_dma_map_scatterlist+0x8e/0xb0 [nvidia]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc09c2ac5>] nv_dma_unmap_pages+0x115/0x120 [nvidia]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc09c2e0d>] nv_dma_unmap_alloc+0x3d/0x60 [nvidia]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc12f05a1>] _nv031630rm+0xc1/0x1b0 [nvidia]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc0a0df8b>] ? _nv026794rm+0x9b/0xc0 [nvidia]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc0a60f58>] ? _nv029964rm+0x118/0x120 [nvidia]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc1301ffc>] ? rm_gpu_ops_free_duped_handle+0x1c/0x60 [nvidia]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc09d077a>] ? nvUvmInterfaceFreeDupedHandle+0x2a/0x40 [nvidia]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc33c21d2>] ? uvm_ext_gpu_map_free_internal+0x62/0x90 [nvidia_uvm]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc33c2e7e>] ? uvm_ext_gpu_map_free+0xe/0x20 [nvidia_uvm]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc3385334>] ? uvm_deferred_free_object_list+0x64/0x120 [nvidia_uvm]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc33856ff>] ? uvm_va_space_destroy+0x30f/0x440 [nvidia_uvm]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc3372fe0>] ? uvm_release.isra.5+0x80/0xa0 [nvidia_uvm]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc33730c4>] ? uvm_release_entry+0x54/0xb0 [nvidia_uvm]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa4e5088c>] ? __fput+0xec/0x230
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa4e50abe>] ? ____fput+0xe/0x10
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa4cc299b>] ? task_work_run+0xbb/0xe0
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa4ca1954>] ? do_exit+0x2d4/0xa30
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc09cadfa>] ? os_release_spinlock+0x1a/0x20 [nvidia]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc09dc96c>] ? _nv037891rm+0xac/0x1a0 [nvidia]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffc12f28a3>] ? rm_ioctl+0x63/0xb0 [nvidia]
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa4ca212f>] ? do_group_exit+0x3f/0xa0
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa4cb328e>] ? get_signal_to_deliver+0x1ce/0x5e0
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa4c2c527>] ? do_signal+0x57/0x6f0
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa4df0042>] ? __tlb_remove_page+0x92/0xa0
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa4c2cc32>] ? do_notify_resume+0x72/0xc0
Mar  1 12:13:46 cryo801w kernel: [<ffffffffa539a2ef>] ? int_signal+0x12/0x17
Mar  1 12:13:46 cryo801w kernel: Code: 48 29 c1 48 89 f0 48 d3 e0 48 c1 e0 03 48 f7 d8 48 21 c7 48 89 f8 c3 0f 1f 40 00 31 ff 48 89 f8 5d c3 66 0f 1f 84 00 00 00 00 00 <66> 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 89 d3

The system is also very unresponsive, but I guess this is a symptom of locked cpus. Restarting helps, but restarting the job results in the same erroneous behavior.

Any suggestions how to fix this issue?

If you need additional information, logfiles, etc. please ask. I will supply them as far as possible.

Thanks for your help.

Christian

edit:
even after killing the job in the cryosparc webapp, the CPUs are still locked.
I tried to kill the python process by
kill -9 5754
but the CPUs are not free’d. When checking the processes, there is still a python zombie process on the system:

# ps -ef | grep defunct
cryospa+  4295     1 32 10:29 ?        00:40:01 [python] <defunct>

edit2:
after going to lunch and checking the system 20 minutes later, after I tried to kill everything, the soft lockups are relieved and the system is running fine again.

edit3:
We tried initially 250 2D classes. By reducing the number to 128, the jobs seem to run fine. We have 2.2M particles with a boxsize of 350 px. The sample is very heterogenous, so we would like to have a good classification. Is there an upper (recommended) limit on the number of 2D classes?

wtempel · March 2, 2022, 5:30pm

@ctueting I am not sure what causes the “soft lockup”. As you were able to avoid the issue by changing parameters, you may want to look at the discussion of 2D classification job parameters in the guide and identify an acceptable (and runnable) combination of parameters for your particular case.
Also, are you confident in the stability and sufficiency of your power supply, both internal and external to your compute node?

ctueting · March 3, 2022, 7:53am

Thank you for your response.

I discussed with the corresponding scientist and we decided, to stick below 200 2D classes, which does not trigger a CPU panic. Even though the guide even says, you can go up to 300 classes. And as the particles are loaded in the SSD cache, I guess it’s even better to run iterative subclassification instead of a single huge classification job.

For the power supply. We just got the worker node in January and it’s located in the server room of our university. For the internal powering, the node has a supermicro 2000W redundant power supply, so I think, that this is not an issue. I also run some heavy archiving tasks on this machine (xz with unlimited threads over a couple of days), and the CPU performs fine. But nevertheless, this is a good point, we will keep an eye on the power supply.

KiSchnelle · March 7, 2022, 2:08pm

I also can log some soft lockups during different refinements, not sure about 2D classifications.
I think it happens especially when using very large boxsizes.

Since it seems to be iommu related did you try out disabling it by any chance?
I planned to do that for the next time the nodes are taken down anyway, but if you would have tested it already would be great to know.

cheers
Kilian

Mar 07 13:36:11 bert101 kernel: watchdog: BUG: soft lockup - CPU#72 stuck for 23s! [python:917703]
Mar 07 13:36:11 bert101 kernel: Modules linked in: beegfs(OE) nvidia_uvm(OE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) cuse rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua amd64_edac_mod edac_mce_amd kvm_amd kvm wmi_bmof ipmi_ssif input_leds joydev rndis_host cdc_ether usbnet mii mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) ccp ipmi_si ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel knem(OE) msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 multipath linear hid_generic usbhid hid raid0 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper igb ahci libahci dca mlx5_core(OE) ast pci_hyperv_intf drm_vram_helper mlxdevm(OE) i2c_algo_bit auxiliary(OE) tls ttm mlxfw(OE) psample drm_kms_helper mlx_compat(OE) nvme syscopyarea sysfillrect sysimgblt fb_sys_fops nvme_core drm i2c_piix4 wmi
Mar 07 13:36:11 bert101 kernel:  [last unloaded: beegfs]
Mar 07 13:36:11 bert101 kernel: CPU: 72 PID: 917703 Comm: python Tainted: P           OEL    5.4.0-100-generic #113-Ubuntu
Mar 07 13:36:11 bert101 kernel: Hardware name: sysGen www.sysgen.de AS -4124GS-TNR/H12DSG-O-CPU, BIOS 2.3 10/21/2021
Mar 07 13:36:11 bert101 kernel: RIP: 0010:fetch_pte.isra.0+0x68/0x160
Mar 07 13:36:11 bert101 kernel: Code: 48 89 e5 48 d3 e8 53 48 8b 36 25 ff 01 00 00 4c 8d 0c c6 b8 01 00 00 00 48 d3 e0 49 89 00 85 ff 0f 8e 87 00 00 00 41 8d 4a 03 <48> 63 ff 49 bb 00 f0 ff ff ff ff 0f 00 41 ba 01 00 00 00 eb 56 48
Mar 07 13:36:11 bert101 kernel: RSP: 0018:ffffa09d045a7970 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
Mar 07 13:36:11 bert101 kernel: RAX: 0000008000000000 RBX: 0000000000001000 RCX: 000000000000001e
Mar 07 13:36:11 bert101 kernel: RDX: 00008305e6ed8000 RSI: ffff89cbbbd73000 RDI: 0000000000000003
Mar 07 13:36:11 bert101 kernel: RBP: ffffa09d045a7978 R08: ffffa09d045a7988 R09: ffff89cbbbd73830
Mar 07 13:36:11 bert101 kernel: R10: 000000000000001b R11: 000ffffffffff000 R12: ffff8a0867a86098
Mar 07 13:36:11 bert101 kernel: R13: ffff8a0867a86094 R14: 0000000000000000 R15: 00008305e6ed8000
Mar 07 13:36:11 bert101 kernel: FS:  00007fe1643f6400(0000) GS:ffff8a088f200000(0000) knlGS:0000000000000000
Mar 07 13:36:11 bert101 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 07 13:36:11 bert101 kernel: CR2: 000055569a4fc428 CR3: 0000005289582006 CR4: 0000000000760ee0
Mar 07 13:36:11 bert101 kernel: PKRU: 55555554
Mar 07 13:36:11 bert101 kernel: Call Trace:
Mar 07 13:36:11 bert101 kernel:  iommu_unmap_page+0x78/0x100
Mar 07 13:36:11 bert101 kernel:  __unmap_single.isra.0+0x63/0x130
Mar 07 13:36:11 bert101 kernel:  unmap_sg+0x5f/0x70
Mar 07 13:36:11 bert101 kernel:  nv_unmap_dma_map_scatterlist+0x5b/0xa0 [nvidia]
Mar 07 13:36:11 bert101 kernel:  nv_dma_unmap_pages+0xa9/0x100 [nvidia]
Mar 07 13:36:11 bert101 kernel:  nv_dma_unmap_alloc+0x36/0x60 [nvidia]
Mar 07 13:36:11 bert101 kernel:  _nv031630rm+0xc1/0x1b0 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv026794rm+0x9b/0xc0 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv010251rm+0x29/0x40 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv028125rm+0xa8/0x260 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv028102rm+0x6e/0x110 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv002233rm+0x9/0x20 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv003684rm+0x1b/0x70 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv013896rm+0x784/0x7f0 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv034512rm+0xac/0xe0 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv035938rm+0xb0/0x140 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv035937rm+0x392/0x4f0 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv034423rm+0xbe/0x140 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv034424rm+0x42/0x70 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv000518rm+0x41/0x50 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv000652rm+0x73a/0xa80 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? _nv000652rm+0x38/0xa80 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? rm_ioctl+0x54/0xb0 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? nvidia_ioctl+0x6e6/0x8d0 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? nvidia_frontend_unlocked_ioctl+0x3b/0x50 [nvidia]
Mar 07 13:36:11 bert101 kernel:  ? do_vfs_ioctl+0x407/0x670
Mar 07 13:36:11 bert101 kernel:  ? ksys_ioctl+0x67/0x90
Mar 07 13:36:11 bert101 kernel:  ? __x64_sys_ioctl+0x1a/0x20
Mar 07 13:36:11 bert101 kernel:  ? do_syscall_64+0x57/0x190
Mar 07 13:36:11 bert101 kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

ctueting · March 9, 2022, 10:16am

With 450 px, our box-size was big, but not unreasonably big, I think.

Not sure, if we saw this soft-lockup elsewhere, or only in 2D classification. As reducing the classes to < 200 helps, I did not further investigate the issue.