Particle Consolidation very slow

ZTBioPhysics · April 12, 2023, 1:44pm

Hello,

I have two questions:

Why is particle consolidation necessary? Why would the memory requirements (say for caching to SSD) be dependent on the initial number of particle picks and not the actual size of the particle stack being processed (for instance, after inspect picks, extraction, 2Dclass, and subset selection). This seems odd to me.
As it stands, I read that I can use the downsample job to “consolidate” my particle stack to overcome this issue. I am noticing this job is quite slow, or at least slower than I would expect from what I understand it to be doing, that is, re-writing metadata files. Could this be a cryosparc configuration issue or is this job just slow. My initial picks were ~1.5M, reduced down to ~700k after inspect, extract, 2dclass, and subset selection.

Thank you

ccgauvin94 · April 12, 2023, 3:18pm

Are you talking about re-stacking particles?

Imagine you have 10 micrographs, each with 1000 particles. cryoSPARC extracts those particles to 10 particle stacks, each with 1000 particles. Now imagine that you do a 2D classification, and select half those particles. 500 from each stack. Now you have 5000 particles you want to use. You can continue loading each stack of 1000 particles every time you want your 500, or you can make new stacks of half the size. Everything downstream will be faster if you do the consolidation, since you’ll only be loading the particles you are using, not all the particles that were originally extracted from that micrograph. At least, that’s my understanding based on the published documentation.

Downsampling isn’t just re-writing metadata files. First, it’s downsampling, which means it has to load in a particle stack, again, all of the particles that were originally extracted in each micrograph, then it has to multiply each pixel in each particle you are using by neighboring pixels, or a fourier filter, or however you are downsampling, and then write that file out.

If you go look at the output of an extract particles job, you’ll see each micrograph gets a corresponding .mrc file, and that .mrc file is a 3D stack, each slice containing one of the particles extracted from that micrograph. That is the crux of the problem here, as I understand it. “consolidating” particles basically means making a new file just with the particles you are using from each stack, not all the particles that were extracted.

ZTBioPhysics · April 12, 2023, 3:50pm

thanks for the response. For more context about “consolidating” see under “tips and tricks”:

I am not actually downsampling my particles. They suggest using the downsample job, just without actually doing any downsampling. I suspect you are correct, though re. your response to my first question. I see why this would be slow if the original stacks have to be loaded. I thought perhaps this could simply be accomplished by changes to the metadata files.

To clarify the issue I was having, I noticed that the amount of cache my jobs were requesting was much larger than would be expected for the number of particles and that the memory requested for cache was the amount needed for the entire original set of particles, even after many rounds of subset selection. I was incorrectly assuming that subset selection took care of this (ie created new stacks). In the link above, they even say this happens after performing pick inspection, which is odd to me as the particles have not even been extracted yet.

ccgauvin94 · April 12, 2023, 5:10pm

Hmm, yes, that could be it. There is a “ReStack Particles” job - perhaps try that? Either way, I think it will need to open each full particle stack to copy out the particles it needs, so there’s probably no quick way to do it.

To clarify the issue I was having, I noticed that the amount of cache my jobs were requesting was much larger than would be expected for the number of particles and that the memory requested for cache was the amount needed for the entire original set of particles, even after many rounds of subset selection.

This makes sense then, if it’s loading all the extracted particles, and then just using the ones that you have selected via subset selection.

In the link above, they even say this happens after performing pick inspection, which is odd to me as the particles have not even been extracted yet.

The wording they use is a bit confusing, but:

You might run into this situation if you ran an “Inspect Picks” job after an “Extract From Micrographs” job, and you modified the picking thresholds of your particles to include a smaller subset than the original stack.

They are talking about running inspect picks after extracting, which is probably a bit of an unusual workflow and seems counterintuitive, at least to me. I could see how you get mixed up there.

ZTBioPhysics · April 12, 2023, 6:59pm

I didn’t see the re-stack job type, I’ll look into that, thanks. And yes, I see now that they say after extraction, which makes sense. Ive never run inspect picks after extraction, but now that I know I can I can see how this could be useful in tracking down which micrographs the good particles came from and potentially doing a second round of selection based on this.

Cheers.

CryoEM2 · April 13, 2023, 1:16pm

This exact issue is the purpose of the (newly released) restack particles job as suggested above. To “forget” the parent stack and regenerate a metadata file for only a sub stack to improve caching. You may need to update.