Reference based motion correction job is terminating abnormally during cross validation

Connor · November 10, 2023, 7:04pm

Hello,

Could anyone help me figure out why my reference based motion correction is giving me this error? I’ve attached a screen shot that shows the error and when it occurs. Interestingly, the point at which it terminates is seemingly random during cross validation - perhaps some images are holding it up that didn’t hold up motion correction or CTF in cryosparc live?

All of my data are processed within cryosparc, I used cryosparc live for motion correction and CTF before exporting - I’ve picked particles and gone through multiple round of 3D classification/refinements, and have obtained nice high resolution structures (2.4A at best).

I have attempted two data sets, one where I have extracted particles, sorted particles (classification) and refined structures in C1, it has a large box size of 720 (1.048A/Pix).

A more complicated model I was trying, is where my particles were refined in C2, symmetry expanded, then re-extracted on one end of the molecule and reprocessed through classification, particle subtraction and refinements.

I’m not sure how to diagnose this, as the only error message I get is the job was terminated abnormally. I have successfully updated drivers and have tested 3D classification, refinements and orientation diagnosis, all working with my v4.4.

To add a little to this error. I have also tested the test-dataset and everything runs smoothy, I used the job builder to build every job, instead cryosparc live. Reference based motion correction works in this case.

So it’s likely some difference between the data sets.

Sincerely,
Connor Arkinson.

ccgauvin94 · November 10, 2023, 9:14pm

If you go to the job metadata:

and then click “Log”:

and scroll down, you can usually see a more detailed error. In my case:

What I’m finding, anecdotally, is that K3 frames work fine, but super resolution frames are basically impossible to get running so far. It does 5-6 and then OOMs. With 2 GPUs, it’s routinely hitting 230+ GB.

Connor · November 10, 2023, 9:37pm

Yep, looks like mine is similar! Mine are super resolution with K3 also.

ccgauvin94 · November 10, 2023, 9:40pm

The furthest I’ve made it is by setting the GPU Memory Oversubscription to more memory than the GPU has, so that it only works on 1 movie per GPU…that job is still running at 10 minutes, whereas the others failed ~7 minutes in.

Connor · November 10, 2023, 9:45pm

The longest for me is just over an hour, but its never made it through the cross validation. I do seem to get the heartbeat message through out the job until it eventually terminates.

ccgauvin94 · November 10, 2023, 9:53pm

How much memory is in the computer? We have 256 GB, and basically I can run 2 GPUs, each motion correcting one SR movie at a time, and it eats the entire 256 GB, is what it’s looking like.

Connor · November 10, 2023, 10:00pm

Comp specs:
2x Intel Xeon Scalable Gold 5320
512 GB ECC Registered DDR4 DRAM
4x NVIDIA RTX 3090
960GB SSD for Boot
3x 3.84 TB NVme SSD for scratch
8x 18 TB Enterprise HDD

I’ve tried 4 GPU with two movies per, it doesn’t appear to go over the max memory or give that specific message, that’s why I thought it might of been a particular set of Tifs.

ccgauvin94 · November 10, 2023, 10:02pm

Yeah, without better error messages, that’s tricky to troubleshoot. I’d probably try backing down to 1 per GPU just because it’s an easy thing to try, but it could be a host of other problems.

hsnyder · November 10, 2023, 11:36pm

Hi @Connor This does sound like a memory issue, possibly due to box size, movie frame count, or both. Some linux commands which may help with confirming that:

sudo journalctl | grep -i OOM
sudo dmesg | grep -i OOM

OOM refers to “out of memory” - Linux will kill processes that request more memory than is available. If you see OOM log entries from around the time your job failed, you can be pretty sure that’s what happened.

But I also noticed that the cryosparc log is reporting less than 250 GB RAM available when the job starts, and you’re saying the computer has 512 GB in total - suggesting that something else is also running on your server at the same time. I’d recommend:

check that no other processes are running
set the oversubscription threshold high (as suggested by @ccgauvin94)
reduce the size of the RAM cache to 20 GB or so.
use few GPUs

and see what happens.

Connor · November 11, 2023, 1:12am

Hello,

Thanks for reply.
No one else is using the computer, also i’ve ensured no-other jobs are running.

I retried and set the GPU oversubscription really high to process one at a time, and used 1 GPU and lowered in-memory cache size to 20 gb.

The job still failed but this time I got extra error messages on termination.

This time it has an issue with a random tif image. However, its odd because these images if corrupt haven’t stopped anything else. In addition, when inspecting the micrograph, they look fine, they are motion corrected and CTF fit, perhaps there’s something else I should look at, but they appear to be tif files for all extensive purposes.

When repeating the job I get a different tif image as the fault.

Connor

hsnyder · November 13, 2023, 6:20pm

Hi Connor,

What happens if you go to the command line and run tiffinfo on that file? (e.g. tiffinfo /data/.../...Y-1-2.tif)

Connor · November 13, 2023, 9:23pm

When I run the command I get this continued, which has around 50 sections (one per Frame).

Connor · November 16, 2023, 11:39am

Hi, i’m not sure what next to try. I tried all sorts of compute settings.

I can’t seem to get RFBC to work with the datasets I’ve collected, none of the Frames seem to bother relion (including Bayesian polishing), CryoSparc or tiffinfo, except for RBMC. I tried reprocessing the data but still no luck, it went smoothly through patch motion, CTF and refinements etc.

The RFBC does work with the t20S test dataset, so installation of v4.4 should be fine.

Is there anything else I can try, or perhaps I can provide more details? Maybe even share some data for a test? Same error each time but always with a random Tif.

Data is collected in super resolution-CDS, then I Fourier crop (0.5) during patch motion correction before picking and classification. Can it be to do with the binning the micrographs during motion correction?

Cheers,
Connor

hsnyder · November 16, 2023, 5:00pm

Hi @Connor,

That is quite strange. Could you email us the job report ? There may be more information that might shed some light on what’s going wrong.

Harris

rbs_sci · November 17, 2023, 8:58am

I’m having a similar issue with an EER dataset, except it never gets anywhere at all (starts cross validation then dies before any progress) the job log is full of Unknown field tag 65002 warnings (standard for EER data, it would be nice if you would make a way to suppress that specifically as the logs end up being tens of gigabytes of line after line of that with EER data!)… box size is 440 pixels, 4K sampling (EER upsampling 1), 1344 total EER frames which all read successfully but then the job heartbeats twice, reports complete for both main and monitor processes and fails.

dmesg shows Python has segfaulted in bin_motion.so with an error 6. Memory usage for the process never exceeds 32GB on a server with 1TB, even when I tell it it can use 500GB+.

I’ve previously had 450 pixel boxes work fine with both 4K and 8K TIFF, this is the first time I’m trying EER data.

edit: Only thing other than EER being the issue I can think of is perhaps it’s too few particles per micrograph? I’m actually playing with a contaminant as a test run (~0.5% of the total dataset) so for 13,000 micrographs there are less than 40,000 particles. Still hits 2.4Å, though, which both amuses and depresses me.

edit 2: Data are on a network drive (symlinked to a local directory), network utilisation appears healthy (although not fast - it’s reading one frame at a time?)

edit 3: The Python segfault is reproducible by one of my collaborators on a completely different dataset (also EER, however).

hsnyder · November 17, 2023, 7:35pm

Hi @rbs_sci could you open a separate forum thread for that issue? Unless it also thinks your data is “Not a TIFF or MDI file, bad magic number 0”, etc. Thanks!

hsnyder · November 17, 2023, 7:39pm

Hi @Connor, I had a look at the file that you sent and just as you’re finding, I don’t see anything wrong with it. Have you found that the reference motion jobs fail consistently on the same movie?

If you wouldn’t mind, please try a reference motion job with the number of GPUs set to 1 and the oversubscription memory threshold set to 100GB. This will ensure only one movie is processed at a time. To make it faster, you can use exposure curation, cryosparc tools, or a method of your choice to isolate the movie that’s failing and only feed that single movie into the reference motion job. Then clone that job and run it several times. Does it always fail?

Connor · November 17, 2023, 8:18pm

Hi Harris, Thanks!
It doesn’t always show the error message, sometimes it just say ‘‘this job was terminated abnormally’’ without the ‘‘not a Tiff…’’.
But when it does show a Tif file error, it is seemly random.

However! I did what you said, I took the single exposure that failed and re-ran RBMC with just that one movie and it worked!

I’m not sure how to get it to work with all, does this help understand the issue?

rbs_sci · November 17, 2023, 9:38pm

Hi @hsnyder, my apologies, thought it might be relevant/related as EER is basically just TIFF and as @Connor says, the “bad magic number 0” error doesn’t always occur. Another topic created. Thanks.

hsnyder · November 17, 2023, 9:58pm

@Connor hmmmm very interesting… It’s possible your dataset somehow exposes a subtle bug. Have you tried the job on other datasets or just this one? Also, when you’re doing your runs with multiple movies, could you try the same steps I mentioned to get it to only process one at a time? (1 GPU, very high value for GPU oversubscription threshold).