2D Classification stucks/fails with large boxsizes (840)

KiSchnelle · August 24, 2021, 11:00am

My 2D classifications always get stuck/fails after some time when using large box-sizes of f.e 840. I tried splitting the particle stack and its for both stacks (300.000 particles each) the same. The nodes themselves spam this line on terminal

kernel:[ 5852.541115] watchdog: BUG: soft lockup - CPU#21 stuck for 22s! [python:32366]

with all the CPU numbers. Pretty much every 30 seconds a new line.
Then either the job continues sometimes a bit and actually progresses in iterations, but with still spamming the above line, or after some times the job fails with:

Job is unresponsive - no heartbeat received in 30 seconds.

Then after this error the following errors appear multiple times.

[CPU: 6.58 GB]   Traceback (most recent call last):
  File "/home/cryosparcuser/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1790, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 1027, in cryosparc_compute.engine.engine.process.work
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 87, in cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
  File "/home/cryosparcuser/cryosparc_worker/cryosparc_compute/particles.py", line 113, in get_original_real_data
    return self.blob.view().copy()
  File "/home/cryosparcuser/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 126, in view
    return self.get()
  File "/home/cryosparcuser/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 121, in get
    _, data, total_time = prefetch.synchronous_native_read(self.fname, idx_start = self.page, idx_limit = self.page+1)
  File "cryosparc_worker/cryosparc_compute/blobio/prefetch.py", line 64, in cryosparc_compute.blobio.prefetch.synchronous_native_read
RuntimeError: fopen: No such file or directory

The job is using SSD cache and runs on 2 GPU.

Running version 3.2.0, nvidia driver version 470.57.02, cuda version 11.4, Ubuntu 20.04 LTS (5.4.0-81-generic). For hardware its a AMD Ryzen Threadripper 3960X CPU and 2 RTX 3090 GPUs.

Anyone ever had this problem and knows how to solve it?

cheers
Kilian

vamsee · August 24, 2021, 3:44pm

You do not need to classify at a large box size. I’d suggest downsampling your particles (~84px or 128px) during extraction, 2D classifying them and then when you are confident about your final particle set, going back and re-extracting. Saves you space and computation power overall and is a lot faster too.

KiSchnelle · August 25, 2021, 7:19am

Yea i know its not so smart to say atleast:D but since i had both i just wanted to try it and see the difference in particles i end up for both ways at the end just out of curiosity. And i mean it should work anyway shouldnt it?

Edit:
I now get the same error in Homogeneous Refinement with the CPU stuck for 134k particles with 840. Ab-inito worked totally normal. Job terminates with:

[CPU: 27.13 GB]  ====== Starting Refinement Iterations ======
[CPU: 27.13 GB]  ----------------------------- Start Iteration 0
[CPU: 27.13 GB]    Using Max Alignment Radius 6.370 (30.000A)
[CPU: 27.13 GB]    Auto batchsize: 177 in each split
[CPU: 36.25 GB]  -- THR 0 BATCH 488 NUM 177 TOTAL 1.0933723 ELAPSED 157.44071 --
[CPU: 42.89 GB]    Processed 354.000 images in 204.630s.
[CPU: 47.59 GB]    Computing FSCs... 
[CPU: 16.6 MB]   ====== Job process terminated abnormally.

vamsee · August 25, 2021, 5:14pm

Ab-initio would work because the final resolution is limited to 12A and nothing in the high-res range. As soon as you put the high-res info back in, it’ll fail. All of this comes back to your box size being still too big. Unless your particle is actually ~420px in size, you really don’t need 840px. The highest I’ve tried is around 600px and that has worked reasonably well albeit slow.

KiSchnelle · August 27, 2021, 10:16am

Well i wanted to compare the results from using all different EER scaling factors, and when using the 16k images your having around 0.23A/px. I mean this was just like for testing and comparing it, i dont think it is very useful to even use a upsampling factor of 4. But how would i make use of the 16k images if i cant refine unbinned particles?