Reference based motion correction fails with "h->magic == TALLOC_HEADER_MAGIC"

mcherrier · December 14, 2023, 8:38am

Hello everyone

I recently tested the new Reference based motion correction job with data collected on a Glacios microscope and I had very good results (went from 3.4A to 2.97A!). Now I have new data of the same protein collected on a Kryos microscope and I want to test the Reference based motion correction procedure again. Unfortunately this time the job fails : first the job runs normally for the first 14 iteration, and when it starts to compute the empirical dose weights it stops with an error: “assertion failed: h->magic == TALLOC_HEADER_MAGIC”

[2023-12-09 10:40:16.48] [CPU: 53.81 GB] [Avail: 182.46 GB] ==================== BEGINNING ITERATION 14 ====================
[2023-12-09 10:40:16.48] [CPU: 53.81 GB] [Avail: 182.46 GB] Iteration overview (parameters to be tried):
—r— -theta- —z— | -spatial- -dist.- --accel–
10.000 -2.042 6.215 | 1.07e-02 500 1.35e-04
10.000 -2.147 6.215 | 4.31e-03 500 2.28e-04
10.000 -2.251 6.215 | 1.85e-03 500 4.22e-04
10.000 -2.356 6.215 | 8.49e-04 500 8.49e-04
[2023-12-09 10:40:16.48] [CPU: 53.81 GB] [Avail: 182.46 GB] Cross-validation scores computed:
[???
???] 776/776 (100%)
[2023-12-09 10:46:51.01] [CPU: 54.19 GB] [Avail: 181.99 GB] Terminating ray theta=-2.0420, z=6.2146 (reached maximum r value)
[2023-12-09 10:46:51.01] [CPU: 54.19 GB] [Avail: 181.99 GB] Terminating ray theta=-2.1468, z=6.2146 (reached maximum r value)
[2023-12-09 10:46:51.01] [CPU: 54.19 GB] [Avail: 181.99 GB] Terminating ray theta=-2.2515, z=6.2146 (reached maximum r value)
[2023-12-09 10:46:51.02] [CPU: 54.19 GB] [Avail: 181.99 GB] Terminating ray theta=-2.3562, z=6.2146 (reached maximum r value)
[2023-12-09 10:46:51.79] [CPU: 54.19 GB] [Avail: 181.99 GB] Exiting early, all search paths have reached the ~zero trajectory
regime
[2023-12-09 10:46:59.12] [CPU: 2.17 GB] [Avail: 234.17 GB] Best hyperparameters:
Spatial prior strength: 1.0976e-02
Spatial correlation distance: 500
Acceleration prior strength: 5.3389e-02
[2023-12-09 10:46:59.26] [CPU: 2.17 GB] [Avail: 234.17 GB] --------------------------------------------------------------
STARTING: COMPUTE EMPIRICAL DOSE WEIGHTS

[2023-12-09 10:46:59.26] [CPU: 2.17 GB] [Avail: 234.17 GB] Using hyperparameters:
Spatial prior strength: 1.0976e-02
Spatial correlation distance: 500
Acceleration prior strength: 5.3389e-02
[2023-12-09 10:46:59.26] [CPU: 2.17 GB] [Avail: 234.17 GB] Using all FCs for doseweighting
[2023-12-09 10:46:59.32] [CPU: 2.18 GB] [Avail: 234.16 GB] Working with 317 movies containing 20035 particles
[2023-12-09 10:46:59.38] [CPU: 2.18 GB] [Avail: 234.14 GB] Movies processed:
[------------------------------------------------------------------
--------------] 0/317 (0%)
[2023-12-09 10:47:07.77] [CPU: 207.6 MB] [Avail: 235.51 GB] ====== Job process terminated abnormally.
[2023-12-09 10:47:11.00] [CPU: 218.2 MB] [Avail: 235.53 GB] assertion failed: h->magic == TALLOC_HEADER_MAGIC

As inputs I’m linking particles, volume and Mask from a NU-Refine job.
Does anyone know why the job failed and how I can fix it?

Thank you!

hsnyder · December 18, 2023, 7:04pm

Hi @mcherrier,

Does this error occur reliably on subsequent runs? (you can break the reference motion job into multiple jobs that do the different stages to avoid needing to re-run the entire hyperparameter stage just to do a few tests of the empirical dose weight stage). Another user that saw this issue reported that it was fixed with a system reboot.

However, it’s possible that you’ve found a bug. Technically, what that message is indicating is that some block of memory didn’t contain what the job expected it do. Possible causes include hardware memory issues, application bugs (i.e. a problem in RBMC) or kernel bugs.

If you can reliably reproduce this issue, please let me know, as that would strongly imply that it’s a reference motion bug.

– Harris

mcherrier · December 19, 2023, 9:04am

Hi @hsnyder

Thanks for your answer.
I already break the reference motion correction job into different steps as you suggested, and I always have the same error. I have tested several things: increasing the “gpu_oversub_gb” value; increasing the “mem_cache_sz” value; reducing the number of particles/micrographs; reducing the size of the box; using 1 or 2 GPUs; changing the running gpu. Nothing works, I always get the same error at the “Compute empirical dose weights” step.
I also tried to restart the computer and again I have the same error.

Do you want me to test something else? Do you need more information about my system/data?

Mickael

mcherrier · January 22, 2024, 12:46pm

Hi all,
I recently installed the latest patch for cryosparc (v4.4.1+240110) and re-tested the motion based motion correction job that failed previously. Unfortunately I still get the same error.
Have you made any progress on this bug? Do you have any idea how to fix it?

Thanks a lot

hsnyder · January 29, 2024, 4:55pm

Hi @mcherrier I’ve sent you a direct message about this issue.
– Harris

hsnyder · February 13, 2024, 4:22pm

In case anybody else encounters this, it has been determined that this is a bug in reference motion (v4.4). It occurs when the rigid motion trajectory (from the upstream motion correction job) is not defined for the first frame(s) - i.e. the first frame (or few frames) were discarded somewhere in an earlier motion-correction run. Until a fix is available, you can work around the issue by redoing the initial motion correction with all frames. I’d recommend patch motion correction and patch CTF, followed by a job to reassign the particles you have already picked to the new micrographs. You’ll have to re-extract the particles and re-do the initial refinement since redoing patch motion will break the particle shifts.

hsnyder · May 7, 2024, 4:14pm

Hi @mcherrier and anyone else who encounters this problem,

The issue has been fixed in CryoSPARC v4.5 which was released today.

– Harris

mcherrier · May 7, 2024, 4:15pm

Hi

Thank you !

Mickael