(4.5.1) RBMC crash (low particle count removal failed?)

Hi CryoSPARC team,

Had RBMC crash in a new way:

Resolution cutoffs: alignment 3.252 A, cross-validation 2.299 A
Removed 875 movies with fewer than 2 particles.
Recentering particles based on their aligned 3D poses...
Removing 96 particles too close to micrograph edges

--------------------------------------------------------------
        STARTING: OPTIMIZE HYPERPARAMETERS
--------------------------------------------------------------


Working with 1137 movies containing 25012 particles


Computing intended data cache configuration


SEARCH RANGES:
zs: 
	4.6052
	6.2146
	8.0064
	9.7981
thetas: 
	-1.9373
	-2.0420
	-2.1468
	-2.2515
	-2.3562
	-2.4609
	-2.5656
	-2.6704
	-2.7751
r start:
	0.1000
r end:
	10.0000
r step:
	0.4950


==================== BEGINNING ITERATION 1 ====================


Iteration overview (parameters to be tried):
---r---  -theta-  ---z---  |  -spatial-  -dist.-  --accel--
  0.100   -1.937    4.605  |   9.65e-01      100   9.11e-01
  0.100   -2.042    4.605  |   9.56e-01      100   9.15e-01
  0.100   -2.147    4.605  |   9.47e-01      100   9.20e-01
  0.100   -2.251    4.605  |   9.39e-01      100   9.25e-01
  0.100   -2.356    4.605  |   9.32e-01      100   9.32e-01
  0.100   -2.461    4.605  |   9.25e-01      100   9.39e-01
  0.100   -2.566    4.605  |   9.20e-01      100   9.47e-01
  0.100   -2.670    4.605  |   9.15e-01      100   9.56e-01
  0.100   -2.775    4.605  |   9.11e-01      100   9.65e-01
  0.100   -1.937    6.215  |   9.65e-01      500   9.11e-01
  0.100   -2.042    6.215  |   9.56e-01      500   9.15e-01
  0.100   -2.147    6.215  |   9.47e-01      500   9.20e-01
  0.100   -2.251    6.215  |   9.39e-01      500   9.25e-01
  0.100   -2.356    6.215  |   9.32e-01      500   9.32e-01
  0.100   -2.461    6.215  |   9.25e-01      500   9.39e-01
  0.100   -2.566    6.215  |   9.20e-01      500   9.47e-01
  0.100   -2.670    6.215  |   9.15e-01      500   9.56e-01
  0.100   -2.775    6.215  |   9.11e-01      500   9.65e-01
  0.100   -1.937    8.006  |   9.65e-01     3000   9.11e-01
  0.100   -2.042    8.006  |   9.56e-01     3000   9.15e-01
  0.100   -2.147    8.006  |   9.47e-01     3000   9.20e-01
  0.100   -2.251    8.006  |   9.39e-01     3000   9.25e-01
  0.100   -2.356    8.006  |   9.32e-01     3000   9.32e-01
  0.100   -2.461    8.006  |   9.25e-01     3000   9.39e-01
  0.100   -2.566    8.006  |   9.20e-01     3000   9.47e-01
  0.100   -2.670    8.006  |   9.15e-01     3000   9.56e-01
  0.100   -2.775    8.006  |   9.11e-01     3000   9.65e-01
  0.100   -1.937    9.798  |   9.65e-01    18000   9.11e-01
  0.100   -2.042    9.798  |   9.56e-01    18000   9.15e-01
  0.100   -2.147    9.798  |   9.47e-01    18000   9.20e-01
  0.100   -2.251    9.798  |   9.39e-01    18000   9.25e-01
  0.100   -2.356    9.798  |   9.32e-01    18000   9.32e-01
  0.100   -2.461    9.798  |   9.25e-01    18000   9.39e-01
  0.100   -2.566    9.798  |   9.20e-01    18000   9.47e-01
  0.100   -2.670    9.798  |   9.15e-01    18000   9.56e-01
  0.100   -2.775    9.798  |   9.11e-01    18000   9.65e-01



Cross-validation scores computed:
[▇▇▇▇▇---------------------------------------------------------------------------] 2728/40932 (7%)


DIE: [refmotion worker 3 (NVIDIA RTX A4000)] fatal error: Specified micrograph has less than two particles.
movie 1095300623923001321: J12/imported/001095300623923001321_FoilHole_28436071_Data_28423640_1_20240529_003121_EER.eer


====== Job process terminated abnormally.

Rerunning now, it’s picked different movies this time (at least it’s reporting a different movie count and particle count, so I hope it won’t happen again).

Will send more info privately if desired.

Thanks,
R

Happened again on re-run, but different error (just died randomly on 5th iteration)

dmesg has following:

[1690135.707482] python[964071]: segfault at 48 ip 00007fb1c6b87c6d sp 00007fb0f97fcbc0 error 4 in libcuda.so.550.54.15[7fb1c68db000+498000] likely on CPU 46 (core 14, socket 0)
[1690135.707499] Code: e9 fb 78 01 48 89 85 28 fe ff ff 48 85 c9 0f 85 4f fc ff ff 4d 85 f6 74 30 49 8b 86 88 00 00 00 4c 89 e6 48 89 95 20 fe ff ff <48> 8b 78 48 48 81 c7 08 01 00 00 e8 43 d4 15 00 48 8b 95 20 fe ff

Hi @rbs_sci, thanks for reporting this. This definitely isn’t supposed to happen; the job pre-screens the movies for micrographs with less than 2 particles, and the fatal error that you’re hitting is just a sanity check to make sure that assumption isn’t being violated. If you’re hitting it, there’s definitely a bug. I’ll look into this and let you know if I need additional information.

Harris

1 Like

Thanks, @hsynder! :slight_smile:

I played with the parameters a little and it’s completed trajectory and hyperparameter calculation, about a third of the way through particle processing now.

Hi, @hsnyder Any solutions? I have encountered a similar issue on CS v4.5.3.

:1: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f0792544910> (size 1). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead.
:1: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f0792547d60> (size 1). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead.
:1: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f0792545390> (size 1). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead.
DIE: [refmotion worker 6 (NVIDIA GeForce RTX 2080 Ti)] fatal error: Specified micrograph has less than two particles.
movie 12221426732231969034: J237/imported/012221426732231969034_FoilHole_20139791_Data_20155404_34_20240422_162116_Fractions.tiff
/net/flash/flash/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 60756 Illegal instruction (core dumped) python -c “import cryosparc_compute.run as run; run.run()” “$@”

Hi @rbs_sci Have you managed to solve it? Could you share the parameters you used?

Hi @qchen,

I think I just got lucky with the randomised subset of particles for parameterisation, to be honest. However, here’s some details of the successful run…

I went back to defaults for everything except setting Hyperparameter Search Thoroughness to “Extensive”, which I always use as I don’t find it significantly slower than “Fast” and the two dataset test runs I did when RBMC was made public (comparing all three modes) it gave a small improvement in resolution over “Fast” or “Balanced”.

This gave me some parameters which I was initially a little sceptical of (Spatial prior strength: 4.8880e-03, Spatial correlation distance: 3000, Acceleration prior strength: 4.8880e-03) since the spatial prior and acceleration prior were the same, but the FCC fit and dose weighting look OK and particle motion tracks look believable.

J63_

I did, however, still have a warning flash up when it was working on the particle step:

WARNING: [refmotion worker 2 (NVIDIA RTX A4000)] error (movie will be skipped): Specified micrograph has less than two particles
movie 1095300623923001321: J12/imported/001095300623923001321_FoilHole_28436071_Data_28423640_1_20240529_003121_EER.eer

So like I said, I think I just got lucky that it didn’t pull that micrograph for hyperparameter optimisation.

@hsynder,

I just checked the mic that caused the first RBMC run to fail… it’s the same micrograph.

DIE: [refmotion worker 3 (NVIDIA RTX A4000)] fatal error: Specified micrograph has less than two particles.
movie 1095300623923001321: J12/imported/001095300623923001321_FoilHole_28436071_Data_28423640_1_20240529_003121_EER.eer

So somehow this mic has snuck through the 2 particle cutoff, or has something else wrong with it.

However, the second failed RBMC run did not fail on that micrograph, it just died during hyperparameter optimisation iteration 6 with no particular error other than:

====== Job process terminated abnormally.

And the dmesg output I reported previously.

I’ll ask my collaborators if it’s OK to share this micrograph with you if you’d like it (and a good mic?) for testing.

Hi @rbs_sci, thanks. I have the same issue and still struggling with it. @hsnyder

Hi @rbs_sci, sorry for the delay getting back to you on this. I have a theory regarding a possible cause… In 4.5 we introduced particle recentering and it’s also on by default. Are you using it? If so, one possible cause is particle recentering followed by rejection of particles too close to a micrograph edge. Those steps are done after the initial screening for micrographs with less than 2 particles. That ordering is definitely a bug, but I don’t know if it’s your bug. Are you using particle recentering? If so, can you isolate the problematic movie and see if running with recentering off fixes the issue?

@qchen tagging you in case the same workaround works for you as well.

Ah, that makes sense; I’ll check and update as appropriate!