(4.5.1) RBMC crash (low particle count removal failed?)

rbs_sci · June 8, 2024, 3:54am

Hi CryoSPARC team,

Had RBMC crash in a new way:

Resolution cutoffs: alignment 3.252 A, cross-validation 2.299 A
Removed 875 movies with fewer than 2 particles.
Recentering particles based on their aligned 3D poses...
Removing 96 particles too close to micrograph edges

--------------------------------------------------------------
        STARTING: OPTIMIZE HYPERPARAMETERS
--------------------------------------------------------------


Working with 1137 movies containing 25012 particles


Computing intended data cache configuration


SEARCH RANGES:
zs: 
	4.6052
	6.2146
	8.0064
	9.7981
thetas: 
	-1.9373
	-2.0420
	-2.1468
	-2.2515
	-2.3562
	-2.4609
	-2.5656
	-2.6704
	-2.7751
r start:
	0.1000
r end:
	10.0000
r step:
	0.4950


==================== BEGINNING ITERATION 1 ====================


Iteration overview (parameters to be tried):
---r---  -theta-  ---z---  |  -spatial-  -dist.-  --accel--
  0.100   -1.937    4.605  |   9.65e-01      100   9.11e-01
  0.100   -2.042    4.605  |   9.56e-01      100   9.15e-01
  0.100   -2.147    4.605  |   9.47e-01      100   9.20e-01
  0.100   -2.251    4.605  |   9.39e-01      100   9.25e-01
  0.100   -2.356    4.605  |   9.32e-01      100   9.32e-01
  0.100   -2.461    4.605  |   9.25e-01      100   9.39e-01
  0.100   -2.566    4.605  |   9.20e-01      100   9.47e-01
  0.100   -2.670    4.605  |   9.15e-01      100   9.56e-01
  0.100   -2.775    4.605  |   9.11e-01      100   9.65e-01
  0.100   -1.937    6.215  |   9.65e-01      500   9.11e-01
  0.100   -2.042    6.215  |   9.56e-01      500   9.15e-01
  0.100   -2.147    6.215  |   9.47e-01      500   9.20e-01
  0.100   -2.251    6.215  |   9.39e-01      500   9.25e-01
  0.100   -2.356    6.215  |   9.32e-01      500   9.32e-01
  0.100   -2.461    6.215  |   9.25e-01      500   9.39e-01
  0.100   -2.566    6.215  |   9.20e-01      500   9.47e-01
  0.100   -2.670    6.215  |   9.15e-01      500   9.56e-01
  0.100   -2.775    6.215  |   9.11e-01      500   9.65e-01
  0.100   -1.937    8.006  |   9.65e-01     3000   9.11e-01
  0.100   -2.042    8.006  |   9.56e-01     3000   9.15e-01
  0.100   -2.147    8.006  |   9.47e-01     3000   9.20e-01
  0.100   -2.251    8.006  |   9.39e-01     3000   9.25e-01
  0.100   -2.356    8.006  |   9.32e-01     3000   9.32e-01
  0.100   -2.461    8.006  |   9.25e-01     3000   9.39e-01
  0.100   -2.566    8.006  |   9.20e-01     3000   9.47e-01
  0.100   -2.670    8.006  |   9.15e-01     3000   9.56e-01
  0.100   -2.775    8.006  |   9.11e-01     3000   9.65e-01
  0.100   -1.937    9.798  |   9.65e-01    18000   9.11e-01
  0.100   -2.042    9.798  |   9.56e-01    18000   9.15e-01
  0.100   -2.147    9.798  |   9.47e-01    18000   9.20e-01
  0.100   -2.251    9.798  |   9.39e-01    18000   9.25e-01
  0.100   -2.356    9.798  |   9.32e-01    18000   9.32e-01
  0.100   -2.461    9.798  |   9.25e-01    18000   9.39e-01
  0.100   -2.566    9.798  |   9.20e-01    18000   9.47e-01
  0.100   -2.670    9.798  |   9.15e-01    18000   9.56e-01
  0.100   -2.775    9.798  |   9.11e-01    18000   9.65e-01



Cross-validation scores computed:
[▇▇▇▇▇---------------------------------------------------------------------------] 2728/40932 (7%)


DIE: [refmotion worker 3 (NVIDIA RTX A4000)] fatal error: Specified micrograph has less than two particles.
movie 1095300623923001321: J12/imported/001095300623923001321_FoilHole_28436071_Data_28423640_1_20240529_003121_EER.eer


====== Job process terminated abnormally.

Rerunning now, it’s picked different movies this time (at least it’s reporting a different movie count and particle count, so I hope it won’t happen again).

Will send more info privately if desired.

Thanks,
R

rbs_sci · June 10, 2024, 12:25am

Happened again on re-run, but different error (just died randomly on 5th iteration)

dmesg has following:

[1690135.707482] python[964071]: segfault at 48 ip 00007fb1c6b87c6d sp 00007fb0f97fcbc0 error 4 in libcuda.so.550.54.15[7fb1c68db000+498000] likely on CPU 46 (core 14, socket 0)
[1690135.707499] Code: e9 fb 78 01 48 89 85 28 fe ff ff 48 85 c9 0f 85 4f fc ff ff 4d 85 f6 74 30 49 8b 86 88 00 00 00 4c 89 e6 48 89 95 20 fe ff ff <48> 8b 78 48 48 81 c7 08 01 00 00 e8 43 d4 15 00 48 8b 95 20 fe ff

hsnyder · June 10, 2024, 2:42pm

Hi @rbs_sci, thanks for reporting this. This definitely isn’t supposed to happen; the job pre-screens the movies for micrographs with less than 2 particles, and the fatal error that you’re hitting is just a sanity check to make sure that assumption isn’t being violated. If you’re hitting it, there’s definitely a bug. I’ll look into this and let you know if I need additional information.

Harris

rbs_sci · June 10, 2024, 10:41pm

Thanks, @hsynder!

I played with the parameters a little and it’s completed trajectory and hyperparameter calculation, about a third of the way through particle processing now.

qchen · June 11, 2024, 6:07pm

Hi, @hsnyder Any solutions? I have encountered a similar issue on CS v4.5.3.

:1: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f0792544910> (size 1). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead.
:1: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f0792547d60> (size 1). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead.
:1: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f0792545390> (size 1). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead.
DIE: [refmotion worker 6 (NVIDIA GeForce RTX 2080 Ti)] fatal error: Specified micrograph has less than two particles.
movie 12221426732231969034: J237/imported/012221426732231969034_FoilHole_20139791_Data_20155404_34_20240422_162116_Fractions.tiff
/net/flash/flash/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 60756 Illegal instruction (core dumped) python -c “import cryosparc_compute.run as run; run.run()” “$@”

qchen · June 12, 2024, 8:11am

Hi @rbs_sci Have you managed to solve it? Could you share the parameters you used?

rbs_sci · June 12, 2024, 9:06am

Hi @qchen,

I think I just got lucky with the randomised subset of particles for parameterisation, to be honest. However, here’s some details of the successful run…

I went back to defaults for everything except setting Hyperparameter Search Thoroughness to “Extensive”, which I always use as I don’t find it significantly slower than “Fast” and the two dataset test runs I did when RBMC was made public (comparing all three modes) it gave a small improvement in resolution over “Fast” or “Balanced”.

This gave me some parameters which I was initially a little sceptical of (Spatial prior strength: 4.8880e-03, Spatial correlation distance: 3000, Acceleration prior strength: 4.8880e-03) since the spatial prior and acceleration prior were the same, but the FCC fit and dose weighting look OK and particle motion tracks look believable.

J63_

I did, however, still have a warning flash up when it was working on the particle step:

WARNING: [refmotion worker 2 (NVIDIA RTX A4000)] error (movie will be skipped): Specified micrograph has less than two particles
movie 1095300623923001321: J12/imported/001095300623923001321_FoilHole_28436071_Data_28423640_1_20240529_003121_EER.eer

So like I said, I think I just got lucky that it didn’t pull that micrograph for hyperparameter optimisation.

…

@hsynder,

I just checked the mic that caused the first RBMC run to fail… it’s the same micrograph.

DIE: [refmotion worker 3 (NVIDIA RTX A4000)] fatal error: Specified micrograph has less than two particles.
movie 1095300623923001321: J12/imported/001095300623923001321_FoilHole_28436071_Data_28423640_1_20240529_003121_EER.eer

So somehow this mic has snuck through the 2 particle cutoff, or has something else wrong with it.

However, the second failed RBMC run did not fail on that micrograph, it just died during hyperparameter optimisation iteration 6 with no particular error other than:

====== Job process terminated abnormally.

And the dmesg output I reported previously.

I’ll ask my collaborators if it’s OK to share this micrograph with you if you’d like it (and a good mic?) for testing.

qchen · June 12, 2024, 10:19am

Hi @rbs_sci, thanks. I have the same issue and still struggling with it. @hsnyder

hsnyder · June 17, 2024, 3:27pm

Hi @rbs_sci, sorry for the delay getting back to you on this. I have a theory regarding a possible cause… In 4.5 we introduced particle recentering and it’s also on by default. Are you using it? If so, one possible cause is particle recentering followed by rejection of particles too close to a micrograph edge. Those steps are done after the initial screening for micrographs with less than 2 particles. That ordering is definitely a bug, but I don’t know if it’s your bug. Are you using particle recentering? If so, can you isolate the problematic movie and see if running with recentering off fixes the issue?

@qchen tagging you in case the same workaround works for you as well.

rbs_sci · June 17, 2024, 9:53pm

Ah, that makes sense; I’ll check and update as appropriate!

sjcalise · July 15, 2024, 11:03pm

@rbs_sci did turning off particle recentering fix it? I’m getting this same bug (we are on v.4.5.3 now)

rbs_sci · July 15, 2024, 11:12pm

It didn’t make any difference.

I’ve got a couple of other test runs for a related issue running on different datasets at the minute, although I will want to finish the full pipeline as a sanity check before I post further (and it’s a big dataset so isn’t going very quickly…)

roms2332 · July 16, 2024, 11:55pm

Hi everyone, I just want to say that I encountered the same issue, and disabled the recentering option, and it worked (at least so far). So in some cases, this apparently is enough to fix it.

rbs_sci · July 16, 2024, 11:59pm

Hm. I got some very odd dose weighting plots.

Definitely not trusting that. Will try something else.

I don’t think the hyperparameter search converged correctly either; did “fast” as “extensive” was taking hours to progress just 1% of a single iteration. Poor dose weighting probably related to that.

sjcalise · July 17, 2024, 5:54pm

Definitely odd looking plots, I always run with Fast thoroughness and have never seen anything like that though. Btw, what type of resolution improvement are you typically getting by switching from Fast to Extensive? I suppose it depends on the dataset, but I’ve gotten nice 0.2 Å resolution improvements from RBMC on multiple datasets using just Fast.

rbs_sci · July 18, 2024, 12:47am

This dataset has been difficult in general because of what it is, but I wasn’t expecting miracles out of that test - although I wasn’t expecting that result either…

Improvement depends on the dataset. I’ve seen effectively zero improvement (second decimal place territory) and as much as 0.5 Ang on some data (where whole frame motion correction was used).

Normally there isn’t much difference between estimation modes; the hyperparameters end up more or less the same. Sometimes the convergence plots are a little jumpy with fast, and much smoother with extensive.

More tests to do.

hsnyder · July 26, 2024, 6:02pm

Hi @rbs_sci I’m just chiming in to say that I am following this thread. I have no idea what happened with the FCC plot, but as you surmised, something is very wrong. It sounds like you’re on it, but if you can post a couple example particle plots, a description of the processing pipeline, and/or the hyperparam estimation plots, I may be able to advise.

hsnyder · August 8, 2024, 9:58pm

That the bug I mentioned earlier, where particle recentering is done after we screen for micrographs with less than 2 partices, has been fixed in the patch just released today. Happy processing.

JinsungKim24 · August 28, 2024, 5:42pm

Using v.4.5.3, still the same error occured.
‘fatal error: Specified micrograph has less than two particles.’
The message was incurred at approximately 74% mark during the cross-validation score computation.

hsnyder · August 28, 2024, 5:46pm

@JinsungKim24 can you post your full cryosparc version? It appears on the cryosparc home page next to the logo