My 2D classifications always get stuck/fails after some time when using large box-sizes of f.e 840. I tried splitting the particle stack and its for both stacks (300.000 particles each) the same. The nodes themselves spam this line on terminal
kernel:[ 5852.541115] watchdog: BUG: soft lockup - CPU#21 stuck for 22s! [python:32366]
with all the CPU numbers. Pretty much every 30 seconds a new line.
Then either the job continues sometimes a bit and actually progresses in iterations, but with still spamming the above line, or after some times the job fails with:
Job is unresponsive - no heartbeat received in 30 seconds.
Then after this error the following errors appear multiple times.
[CPU: 6.58 GB] Traceback (most recent call last):
File "/home/cryosparcuser/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1790, in run_with_except_hook
run_old(*args, **kw)
File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 1027, in cryosparc_compute.engine.engine.process.work
File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 87, in cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
File "/home/cryosparcuser/cryosparc_worker/cryosparc_compute/particles.py", line 113, in get_original_real_data
return self.blob.view().copy()
File "/home/cryosparcuser/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 126, in view
return self.get()
File "/home/cryosparcuser/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 121, in get
_, data, total_time = prefetch.synchronous_native_read(self.fname, idx_start = self.page, idx_limit = self.page+1)
File "cryosparc_worker/cryosparc_compute/blobio/prefetch.py", line 64, in cryosparc_compute.blobio.prefetch.synchronous_native_read
RuntimeError: fopen: No such file or directory
The job is using SSD cache and runs on 2 GPU.
Running version 3.2.0, nvidia driver version 470.57.02, cuda version 11.4, Ubuntu 20.04 LTS (5.4.0-81-generic). For hardware its a AMD Ryzen Threadripper 3960X CPU and 2 RTX 3090 GPUs.
Anyone ever had this problem and knows how to solve it?
cheers
Kilian