Hi all,
we are facing an issue during 2D classification on our worker node.
In short:
We start the job, it is queued on the worker and starts. After loading the particles in the SSD, the worker sends multiple warnings (it affects all CPUs with always the same python pid):
Message from syslogd@cryo801w at Mar 1 12:18:18 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#44 stuck for 22s! [python:5754]
and eventually the job fails.
I checked the log file of the failed job, and it contains 422 out of 466 lines of sending heartbeat
.
These are the last lines of the job.log:
========= sending heartbeat
**custom thread exception hook caught something
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.
I do not see any messages in the core log regarding this issue.
We do not have similar issue on any of our other machines (multiple standalone installations), but the major difference with the new worker is, that it contains a AMD CPU, not an intel.
System:
CPU: AMD EPYC 7002
GPU: 4x NVIDIA RTX 3090
Cryosparc: Latest version and latest patch installed on master and worker.
The project folder is on the worker, the master is connected via sshds. Import/MotionCorr/etc and also some 2D class already finished successfully.
I tried to understand the issue a bit more, so I monitored the /etc/var/messages logfile on the worker, and got this is the log:
Message from syslogd@cryo801w at Mar 1 12:13:46 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#33 stuck for 24s! [python:5754]
Mar 1 12:13:46 cryo801w kernel: NMI watchdog: BUG: soft lockup - CPU#33 stuck for 24s! [python:5754]
Mar 1 12:13:46 cryo801w kernel: Modules linked in: nvidia_uvm(OE) cmac arc4 md4 nls_utf8 cifs ccm xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_uverbs ib_umad xfs vfat fat rdma_cm ib_cm iw_cm dm_mirror dm_region_hash dm_log dm_mod nvidia_drm(POE) nvidia_modeset(POE) amd64_edac_mod edac_mce_amd kvm_amd kvm nvidia(POE) irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw
Mar 1 12:13:46 cryo801w kernel: gf128mul glue_helper ablk_helper cryptd pcspkr raid456 async_raid6_recov async_memcpy async_pq snd_hda_codec_hdmi raid6_pq libcrc32c async_xor xor async_tx snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq joydev snd_seq_device snd_pcm raid0 snd_timer snd soundcore sg bnxt_re ib_core i2c_piix4 k10temp ipmi_si ipmi_devintf ipmi_msghandler i2c_designware_platform pinctrl_amd i2c_designware_core acpi_cpufreq ip_tables rndis_host cdc_ether usbnet mii ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci drm libahci bnxt_en libata nvme devlink nvme_core drm_panel_orientation_quirks nfit libnvdimm fuse
Mar 1 12:13:46 cryo801w kernel: CPU: 33 PID: 5754 Comm: python Kdump: loaded Tainted: P OEL ------------ 3.10.0-1160.59.1.el7.x86_64 #1
Mar 1 12:13:46 cryo801w kernel: Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
Mar 1 12:13:46 cryo801w kernel: task: ffff9be147523180 ti: ffff9b77a9034000 task.ti: ffff9b77a9034000
Mar 1 12:13:46 cryo801w kernel: RIP: 0010:[<ffffffffa51ffa90>] [<ffffffffa51ffa90>] iommu_unmap_page+0x0/0x110
Mar 1 12:13:46 cryo801w kernel: RSP: 0018:ffff9b77a9037920 EFLAGS: 00000206
Mar 1 12:13:46 cryo801w kernel: RAX: 0000008000000000 RBX: 0000000000000002 RCX: 0000000000000027
Mar 1 12:13:46 cryo801w kernel: RDX: 0000000000001000 RSI: 000084f25e11f000 RDI: ffff9be18b12a000
Mar 1 12:13:46 cryo801w kernel: RBP: ffff9b77a9037958 R08: 0000000000000002 R09: 0000000000000000
Mar 1 12:13:46 cryo801w kernel: R10: 000000000000001b R11: 000ffffffffff000 R12: 000000000000001b
Mar 1 12:13:46 cryo801w kernel: R13: 000ffffffffff000 R14: 0000000000000002 R15: ffff9b77a9037918
Mar 1 12:13:46 cryo801w kernel: FS: 00007fba44ffd700(0000) GS:ffff9be18e040000(0000) knlGS:0000000000000000
Mar 1 12:13:46 cryo801w kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 1 12:13:46 cryo801w kernel: CR2: 00007fec3967a871 CR3: 0000007dde5de000 CR4: 0000000000340fe0
Mar 1 12:13:46 cryo801w kernel: Call Trace:
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa5201f13>] ? __unmap_single.isra.22+0x63/0x200
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa520341f>] unmap_sg+0x5f/0x70
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc09c1dfe>] nv_unmap_dma_map_scatterlist+0x8e/0xb0 [nvidia]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc09c2ac5>] nv_dma_unmap_pages+0x115/0x120 [nvidia]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc09c2e0d>] nv_dma_unmap_alloc+0x3d/0x60 [nvidia]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc12f05a1>] _nv031630rm+0xc1/0x1b0 [nvidia]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc0a0df8b>] ? _nv026794rm+0x9b/0xc0 [nvidia]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc0a60f58>] ? _nv029964rm+0x118/0x120 [nvidia]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc1301ffc>] ? rm_gpu_ops_free_duped_handle+0x1c/0x60 [nvidia]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc09d077a>] ? nvUvmInterfaceFreeDupedHandle+0x2a/0x40 [nvidia]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc33c21d2>] ? uvm_ext_gpu_map_free_internal+0x62/0x90 [nvidia_uvm]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc33c2e7e>] ? uvm_ext_gpu_map_free+0xe/0x20 [nvidia_uvm]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc3385334>] ? uvm_deferred_free_object_list+0x64/0x120 [nvidia_uvm]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc33856ff>] ? uvm_va_space_destroy+0x30f/0x440 [nvidia_uvm]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc3372fe0>] ? uvm_release.isra.5+0x80/0xa0 [nvidia_uvm]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc33730c4>] ? uvm_release_entry+0x54/0xb0 [nvidia_uvm]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa4e5088c>] ? __fput+0xec/0x230
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa4e50abe>] ? ____fput+0xe/0x10
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa4cc299b>] ? task_work_run+0xbb/0xe0
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa4ca1954>] ? do_exit+0x2d4/0xa30
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc09cadfa>] ? os_release_spinlock+0x1a/0x20 [nvidia]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc09dc96c>] ? _nv037891rm+0xac/0x1a0 [nvidia]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffc12f28a3>] ? rm_ioctl+0x63/0xb0 [nvidia]
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa4ca212f>] ? do_group_exit+0x3f/0xa0
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa4cb328e>] ? get_signal_to_deliver+0x1ce/0x5e0
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa4c2c527>] ? do_signal+0x57/0x6f0
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa4df0042>] ? __tlb_remove_page+0x92/0xa0
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa4c2cc32>] ? do_notify_resume+0x72/0xc0
Mar 1 12:13:46 cryo801w kernel: [<ffffffffa539a2ef>] ? int_signal+0x12/0x17
Mar 1 12:13:46 cryo801w kernel: Code: 48 29 c1 48 89 f0 48 d3 e0 48 c1 e0 03 48 f7 d8 48 21 c7 48 89 f8 c3 0f 1f 40 00 31 ff 48 89 f8 5d c3 66 0f 1f 84 00 00 00 00 00 <66> 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 89 d3
The system is also very unresponsive, but I guess this is a symptom of locked cpus. Restarting helps, but restarting the job results in the same erroneous behavior.
Any suggestions how to fix this issue?
If you need additional information, logfiles, etc. please ask. I will supply them as far as possible.
Thanks for your help.
Christian
edit:
even after killing the job in the cryosparc webapp, the CPUs are still locked.
I tried to kill the python process by
kill -9 5754
but the CPUs are not free’d. When checking the processes, there is still a python zombie process on the system:
# ps -ef | grep defunct
cryospa+ 4295 1 32 10:29 ? 00:40:01 [python] <defunct>
edit2:
after going to lunch and checking the system 20 minutes later, after I tried to kill everything, the soft lockups are relieved and the system is running fine again.
edit3:
We tried initially 250 2D classes. By reducing the number to 128, the jobs seem to run fine. We have 2.2M particles with a boxsize of 350 px. The sample is very heterogenous, so we would like to have a good classification. Is there an upper (recommended) limit on the number of 2D classes?