(4.5.1) RBMC crash (low particle count removal failed?)

Hi CryoSPARC team,

Had RBMC crash in a new way:

Resolution cutoffs: alignment 3.252 A, cross-validation 2.299 A
Removed 875 movies with fewer than 2 particles.
Recentering particles based on their aligned 3D poses...
Removing 96 particles too close to micrograph edges

--------------------------------------------------------------
        STARTING: OPTIMIZE HYPERPARAMETERS
--------------------------------------------------------------


Working with 1137 movies containing 25012 particles


Computing intended data cache configuration


SEARCH RANGES:
zs: 
	4.6052
	6.2146
	8.0064
	9.7981
thetas: 
	-1.9373
	-2.0420
	-2.1468
	-2.2515
	-2.3562
	-2.4609
	-2.5656
	-2.6704
	-2.7751
r start:
	0.1000
r end:
	10.0000
r step:
	0.4950


==================== BEGINNING ITERATION 1 ====================


Iteration overview (parameters to be tried):
---r---  -theta-  ---z---  |  -spatial-  -dist.-  --accel--
  0.100   -1.937    4.605  |   9.65e-01      100   9.11e-01
  0.100   -2.042    4.605  |   9.56e-01      100   9.15e-01
  0.100   -2.147    4.605  |   9.47e-01      100   9.20e-01
  0.100   -2.251    4.605  |   9.39e-01      100   9.25e-01
  0.100   -2.356    4.605  |   9.32e-01      100   9.32e-01
  0.100   -2.461    4.605  |   9.25e-01      100   9.39e-01
  0.100   -2.566    4.605  |   9.20e-01      100   9.47e-01
  0.100   -2.670    4.605  |   9.15e-01      100   9.56e-01
  0.100   -2.775    4.605  |   9.11e-01      100   9.65e-01
  0.100   -1.937    6.215  |   9.65e-01      500   9.11e-01
  0.100   -2.042    6.215  |   9.56e-01      500   9.15e-01
  0.100   -2.147    6.215  |   9.47e-01      500   9.20e-01
  0.100   -2.251    6.215  |   9.39e-01      500   9.25e-01
  0.100   -2.356    6.215  |   9.32e-01      500   9.32e-01
  0.100   -2.461    6.215  |   9.25e-01      500   9.39e-01
  0.100   -2.566    6.215  |   9.20e-01      500   9.47e-01
  0.100   -2.670    6.215  |   9.15e-01      500   9.56e-01
  0.100   -2.775    6.215  |   9.11e-01      500   9.65e-01
  0.100   -1.937    8.006  |   9.65e-01     3000   9.11e-01
  0.100   -2.042    8.006  |   9.56e-01     3000   9.15e-01
  0.100   -2.147    8.006  |   9.47e-01     3000   9.20e-01
  0.100   -2.251    8.006  |   9.39e-01     3000   9.25e-01
  0.100   -2.356    8.006  |   9.32e-01     3000   9.32e-01
  0.100   -2.461    8.006  |   9.25e-01     3000   9.39e-01
  0.100   -2.566    8.006  |   9.20e-01     3000   9.47e-01
  0.100   -2.670    8.006  |   9.15e-01     3000   9.56e-01
  0.100   -2.775    8.006  |   9.11e-01     3000   9.65e-01
  0.100   -1.937    9.798  |   9.65e-01    18000   9.11e-01
  0.100   -2.042    9.798  |   9.56e-01    18000   9.15e-01
  0.100   -2.147    9.798  |   9.47e-01    18000   9.20e-01
  0.100   -2.251    9.798  |   9.39e-01    18000   9.25e-01
  0.100   -2.356    9.798  |   9.32e-01    18000   9.32e-01
  0.100   -2.461    9.798  |   9.25e-01    18000   9.39e-01
  0.100   -2.566    9.798  |   9.20e-01    18000   9.47e-01
  0.100   -2.670    9.798  |   9.15e-01    18000   9.56e-01
  0.100   -2.775    9.798  |   9.11e-01    18000   9.65e-01



Cross-validation scores computed:
[ā–‡ā–‡ā–‡ā–‡ā–‡---------------------------------------------------------------------------] 2728/40932 (7%)


DIE: [refmotion worker 3 (NVIDIA RTX A4000)] fatal error: Specified micrograph has less than two particles.
movie 1095300623923001321: J12/imported/001095300623923001321_FoilHole_28436071_Data_28423640_1_20240529_003121_EER.eer


====== Job process terminated abnormally.

Rerunning now, itā€™s picked different movies this time (at least itā€™s reporting a different movie count and particle count, so I hope it wonā€™t happen again).

Will send more info privately if desired.

Thanks,
R

Happened again on re-run, but different error (just died randomly on 5th iteration)

dmesg has following:

[1690135.707482] python[964071]: segfault at 48 ip 00007fb1c6b87c6d sp 00007fb0f97fcbc0 error 4 in libcuda.so.550.54.15[7fb1c68db000+498000] likely on CPU 46 (core 14, socket 0)
[1690135.707499] Code: e9 fb 78 01 48 89 85 28 fe ff ff 48 85 c9 0f 85 4f fc ff ff 4d 85 f6 74 30 49 8b 86 88 00 00 00 4c 89 e6 48 89 95 20 fe ff ff <48> 8b 78 48 48 81 c7 08 01 00 00 e8 43 d4 15 00 48 8b 95 20 fe ff

Hi @rbs_sci, thanks for reporting this. This definitely isnā€™t supposed to happen; the job pre-screens the movies for micrographs with less than 2 particles, and the fatal error that youā€™re hitting is just a sanity check to make sure that assumption isnā€™t being violated. If youā€™re hitting it, thereā€™s definitely a bug. Iā€™ll look into this and let you know if I need additional information.

Harris

1 Like

Thanks, @hsynder! :slight_smile:

I played with the parameters a little and itā€™s completed trajectory and hyperparameter calculation, about a third of the way through particle processing now.

Hi, @hsnyder Any solutions? I have encountered a similar issue on CS v4.5.3.

:1: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f0792544910> (size 1). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead.
:1: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f0792547d60> (size 1). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead.
:1: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f0792545390> (size 1). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead.
DIE: [refmotion worker 6 (NVIDIA GeForce RTX 2080 Ti)] fatal error: Specified micrograph has less than two particles.
movie 12221426732231969034: J237/imported/012221426732231969034_FoilHole_20139791_Data_20155404_34_20240422_162116_Fractions.tiff
/net/flash/flash/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 60756 Illegal instruction (core dumped) python -c ā€œimport cryosparc_compute.run as run; run.run()ā€ ā€œ$@ā€

Hi @rbs_sci Have you managed to solve it? Could you share the parameters you used?

Hi @qchen,

I think I just got lucky with the randomised subset of particles for parameterisation, to be honest. However, hereā€™s some details of the successful runā€¦

I went back to defaults for everything except setting Hyperparameter Search Thoroughness to ā€œExtensiveā€, which I always use as I donā€™t find it significantly slower than ā€œFastā€ and the two dataset test runs I did when RBMC was made public (comparing all three modes) it gave a small improvement in resolution over ā€œFastā€ or ā€œBalancedā€.

This gave me some parameters which I was initially a little sceptical of (Spatial prior strength: 4.8880e-03, Spatial correlation distance: 3000, Acceleration prior strength: 4.8880e-03) since the spatial prior and acceleration prior were the same, but the FCC fit and dose weighting look OK and particle motion tracks look believable.

J63_

I did, however, still have a warning flash up when it was working on the particle step:

WARNING: [refmotion worker 2 (NVIDIA RTX A4000)] error (movie will be skipped): Specified micrograph has less than two particles
movie 1095300623923001321: J12/imported/001095300623923001321_FoilHole_28436071_Data_28423640_1_20240529_003121_EER.eer

So like I said, I think I just got lucky that it didnā€™t pull that micrograph for hyperparameter optimisation.

ā€¦

@hsynder,

I just checked the mic that caused the first RBMC run to failā€¦ itā€™s the same micrograph.

DIE: [refmotion worker 3 (NVIDIA RTX A4000)] fatal error: Specified micrograph has less than two particles.
movie 1095300623923001321: J12/imported/001095300623923001321_FoilHole_28436071_Data_28423640_1_20240529_003121_EER.eer

So somehow this mic has snuck through the 2 particle cutoff, or has something else wrong with it.

However, the second failed RBMC run did not fail on that micrograph, it just died during hyperparameter optimisation iteration 6 with no particular error other than:

====== Job process terminated abnormally.

And the dmesg output I reported previously.

Iā€™ll ask my collaborators if itā€™s OK to share this micrograph with you if youā€™d like it (and a good mic?) for testing.

Hi @rbs_sci, thanks. I have the same issue and still struggling with it. @hsnyder

Hi @rbs_sci, sorry for the delay getting back to you on this. I have a theory regarding a possible causeā€¦ In 4.5 we introduced particle recentering and itā€™s also on by default. Are you using it? If so, one possible cause is particle recentering followed by rejection of particles too close to a micrograph edge. Those steps are done after the initial screening for micrographs with less than 2 particles. That ordering is definitely a bug, but I donā€™t know if itā€™s your bug. Are you using particle recentering? If so, can you isolate the problematic movie and see if running with recentering off fixes the issue?

@qchen tagging you in case the same workaround works for you as well.

Ah, that makes sense; Iā€™ll check and update as appropriate!

@rbs_sci did turning off particle recentering fix it? Iā€™m getting this same bug (we are on v.4.5.3 now)

It didnā€™t make any difference. :frowning:

Iā€™ve got a couple of other test runs for a related issue running on different datasets at the minute, although I will want to finish the full pipeline as a sanity check before I post further (and itā€™s a big dataset so isnā€™t going very quicklyā€¦)

Hi everyone, I just want to say that I encountered the same issue, and disabled the recentering option, and it worked (at least so far). So in some cases, this apparently is enough to fix it.

Hm. I got some very odd dose weighting plots.



Definitely not trusting that. Will try something else.

I donā€™t think the hyperparameter search converged correctly either; did ā€œfastā€ as ā€œextensiveā€ was taking hours to progress just 1% of a single iteration. Poor dose weighting probably related to that.

Definitely odd looking plots, I always run with Fast thoroughness and have never seen anything like that though. Btw, what type of resolution improvement are you typically getting by switching from Fast to Extensive? I suppose it depends on the dataset, but Iā€™ve gotten nice 0.2 ƅ resolution improvements from RBMC on multiple datasets using just Fast.

This dataset has been difficult in general because of what it is, but I wasnā€™t expecting miracles out of that test - although I wasnā€™t expecting that result eitherā€¦

Improvement depends on the dataset. Iā€™ve seen effectively zero improvement (second decimal place territory) and as much as 0.5 Ang on some data (where whole frame motion correction was used).

Normally there isnā€™t much difference between estimation modes; the hyperparameters end up more or less the same. Sometimes the convergence plots are a little jumpy with fast, and much smoother with extensive.

More tests to do.

Hi @rbs_sci Iā€™m just chiming in to say that I am following this thread. I have no idea what happened with the FCC plot, but as you surmised, something is very wrong. It sounds like youā€™re on it, but if you can post a couple example particle plots, a description of the processing pipeline, and/or the hyperparam estimation plots, I may be able to advise.

1 Like

That the bug I mentioned earlier, where particle recentering is done after we screen for micrographs with less than 2 partices, has been fixed in the patch just released today. Happy processing.

2 Likes

Using v.4.5.3, still the same error occured.
ā€˜fatal error: Specified micrograph has less than two particles.ā€™
The message was incurred at approximately 74% mark during the cross-validation score computation.

@JinsungKim24 can you post your full cryosparc version? It appears on the cryosparc home page next to the logo
image