Reference based motion correction job is terminating abnormally during cross validation

ccgauvin94 · November 10, 2023, 9:40pm

The furthest I’ve made it is by setting the GPU Memory Oversubscription to more memory than the GPU has, so that it only works on 1 movie per GPU…that job is still running at 10 minutes, whereas the others failed ~7 minutes in.

Connor · November 10, 2023, 9:45pm

The longest for me is just over an hour, but its never made it through the cross validation. I do seem to get the heartbeat message through out the job until it eventually terminates.

ccgauvin94 · November 10, 2023, 9:53pm

How much memory is in the computer? We have 256 GB, and basically I can run 2 GPUs, each motion correcting one SR movie at a time, and it eats the entire 256 GB, is what it’s looking like.

Connor · November 10, 2023, 10:00pm

Comp specs:
2x Intel Xeon Scalable Gold 5320
512 GB ECC Registered DDR4 DRAM
4x NVIDIA RTX 3090
960GB SSD for Boot
3x 3.84 TB NVme SSD for scratch
8x 18 TB Enterprise HDD

I’ve tried 4 GPU with two movies per, it doesn’t appear to go over the max memory or give that specific message, that’s why I thought it might of been a particular set of Tifs.

ccgauvin94 · November 10, 2023, 10:02pm

Yeah, without better error messages, that’s tricky to troubleshoot. I’d probably try backing down to 1 per GPU just because it’s an easy thing to try, but it could be a host of other problems.

hsnyder · November 10, 2023, 11:36pm

Hi @Connor This does sound like a memory issue, possibly due to box size, movie frame count, or both. Some linux commands which may help with confirming that:

sudo journalctl | grep -i OOM
sudo dmesg | grep -i OOM

OOM refers to “out of memory” - Linux will kill processes that request more memory than is available. If you see OOM log entries from around the time your job failed, you can be pretty sure that’s what happened.

But I also noticed that the cryosparc log is reporting less than 250 GB RAM available when the job starts, and you’re saying the computer has 512 GB in total - suggesting that something else is also running on your server at the same time. I’d recommend:

check that no other processes are running
set the oversubscription threshold high (as suggested by @ccgauvin94)
reduce the size of the RAM cache to 20 GB or so.
use few GPUs

and see what happens.

Connor · November 11, 2023, 1:12am

Hello,

Thanks for reply.
No one else is using the computer, also i’ve ensured no-other jobs are running.

I retried and set the GPU oversubscription really high to process one at a time, and used 1 GPU and lowered in-memory cache size to 20 gb.

The job still failed but this time I got extra error messages on termination.

This time it has an issue with a random tif image. However, its odd because these images if corrupt haven’t stopped anything else. In addition, when inspecting the micrograph, they look fine, they are motion corrected and CTF fit, perhaps there’s something else I should look at, but they appear to be tif files for all extensive purposes.

When repeating the job I get a different tif image as the fault.

Connor

hsnyder · November 13, 2023, 6:20pm

Hi Connor,

What happens if you go to the command line and run tiffinfo on that file? (e.g. tiffinfo /data/.../...Y-1-2.tif)

Connor · November 13, 2023, 9:23pm

When I run the command I get this continued, which has around 50 sections (one per Frame).

Connor · November 16, 2023, 11:39am

Hi, i’m not sure what next to try. I tried all sorts of compute settings.

I can’t seem to get RFBC to work with the datasets I’ve collected, none of the Frames seem to bother relion (including Bayesian polishing), CryoSparc or tiffinfo, except for RBMC. I tried reprocessing the data but still no luck, it went smoothly through patch motion, CTF and refinements etc.

The RFBC does work with the t20S test dataset, so installation of v4.4 should be fine.

Is there anything else I can try, or perhaps I can provide more details? Maybe even share some data for a test? Same error each time but always with a random Tif.

Data is collected in super resolution-CDS, then I Fourier crop (0.5) during patch motion correction before picking and classification. Can it be to do with the binning the micrographs during motion correction?

Cheers,
Connor

hsnyder · November 16, 2023, 5:00pm

Hi @Connor,

That is quite strange. Could you email us the job report ? There may be more information that might shed some light on what’s going wrong.

Harris

rbs_sci · November 17, 2023, 8:58am

I’m having a similar issue with an EER dataset, except it never gets anywhere at all (starts cross validation then dies before any progress) the job log is full of Unknown field tag 65002 warnings (standard for EER data, it would be nice if you would make a way to suppress that specifically as the logs end up being tens of gigabytes of line after line of that with EER data!)… box size is 440 pixels, 4K sampling (EER upsampling 1), 1344 total EER frames which all read successfully but then the job heartbeats twice, reports complete for both main and monitor processes and fails.

dmesg shows Python has segfaulted in bin_motion.so with an error 6. Memory usage for the process never exceeds 32GB on a server with 1TB, even when I tell it it can use 500GB+.

I’ve previously had 450 pixel boxes work fine with both 4K and 8K TIFF, this is the first time I’m trying EER data.

edit: Only thing other than EER being the issue I can think of is perhaps it’s too few particles per micrograph? I’m actually playing with a contaminant as a test run (~0.5% of the total dataset) so for 13,000 micrographs there are less than 40,000 particles. Still hits 2.4Å, though, which both amuses and depresses me.

edit 2: Data are on a network drive (symlinked to a local directory), network utilisation appears healthy (although not fast - it’s reading one frame at a time?)

edit 3: The Python segfault is reproducible by one of my collaborators on a completely different dataset (also EER, however).

hsnyder · November 17, 2023, 7:35pm

Hi @rbs_sci could you open a separate forum thread for that issue? Unless it also thinks your data is “Not a TIFF or MDI file, bad magic number 0”, etc. Thanks!

hsnyder · November 17, 2023, 7:39pm

Hi @Connor, I had a look at the file that you sent and just as you’re finding, I don’t see anything wrong with it. Have you found that the reference motion jobs fail consistently on the same movie?

If you wouldn’t mind, please try a reference motion job with the number of GPUs set to 1 and the oversubscription memory threshold set to 100GB. This will ensure only one movie is processed at a time. To make it faster, you can use exposure curation, cryosparc tools, or a method of your choice to isolate the movie that’s failing and only feed that single movie into the reference motion job. Then clone that job and run it several times. Does it always fail?

Connor · November 17, 2023, 8:18pm

Hi Harris, Thanks!
It doesn’t always show the error message, sometimes it just say ‘‘this job was terminated abnormally’’ without the ‘‘not a Tiff…’’.
But when it does show a Tif file error, it is seemly random.

However! I did what you said, I took the single exposure that failed and re-ran RBMC with just that one movie and it worked!

I’m not sure how to get it to work with all, does this help understand the issue?

rbs_sci · November 17, 2023, 9:38pm

Hi @hsnyder, my apologies, thought it might be relevant/related as EER is basically just TIFF and as @Connor says, the “bad magic number 0” error doesn’t always occur. Another topic created. Thanks.

hsnyder · November 17, 2023, 9:58pm

@Connor hmmmm very interesting… It’s possible your dataset somehow exposes a subtle bug. Have you tried the job on other datasets or just this one? Also, when you’re doing your runs with multiple movies, could you try the same steps I mentioned to get it to only process one at a time? (1 GPU, very high value for GPU oversubscription threshold).

Connor · November 17, 2023, 10:20pm

Ok, with all movies for a certain particle stack of around 12 K particles; I used 1 GPU, oversubscription 100 GB with 80 in-memory catche size. It failed within 5 mins during cross validation with the same error different Tif. I took that movie using curate exposures and it worked again.

I guess an issue is my complex has a lot of conformations so it ends up diluting the particles, so on a micrograph of 150 particles, only a few could belong to one class - so it probably ends up being at a 1000 micrographs to get enough particles for a high res reconstruction.
So for example ‘‘Working with 2399 movies containing 5673 particles’’ maybe its just too much for it handle?
The only other dataset that worked was the T20 test dataset but that has very few micrographs, perhaps thats suggestive of the problem?

I tried another dataset but the conditions are the same in terms of collection, with the same sort of error.

larsonmattr · November 21, 2023, 6:18pm

We were having a similar error/issue with EER images when processing on NVIDIA RTX6000 GPU /w 24 GB memory with resulting error “DIE: [refmotion worker 6 (Quadro RTX 6000)] ERROR: cuMemAlloc(size=17392893120): CUDA ERROR: (CUDA_ERROR_OUT_OF_MEMORY) out of memory” in the job.log.err.

After we raised the oversubscription to 100GB this appears to have helped the Reference based motion correction pass through the initial cross validation steps successfully.

The default of 20GB oversubscription (< 24GB) may have been trying to then put multiple micrographs in memory at the same time causing a failure and raising the value avoided this.

hsnyder · November 21, 2023, 8:17pm

As an update on this, Connor is currently diagnosing a possible hardware problem.

@larsonmattr it sounds like you’ve already found a workaround for your GPU memory exhaustion issue, but for what it’s worth I agree with your diagnosis and would have suggested exactly that workaround.

–Harris