Questions about heterogeneous refinement strategy to clean up particle stack

cbeck · January 15, 2024, 9:19pm

Hi! I really appreciate everybody’s contribution to this forum, and it’s been really useful to me as someone who’s just starting to get into data processing. This is my first post, so please let me know if I’m breaking any guidelines or if there’s anything I can do to communicate my questions more effectively.

I’m trying to develop a strategy for removing junk particles from my dataset. Previous threads have mentioned that 2D classification should only be used to remove obvious high-contrast junk, and that iterative rounds of Heterogeneous Refinement is the best way to sort out the rest of the junk. Based on these previous threads, I’ve developed a workflow in which I run 4 identical heterogeneous Refinement jobs in parallel. Each job has initial references that come from good ab-initio classes, as well as junk references that come from an early iteration of a refinement job. Once the first 4 jobs finish, I pool all of the particles that went into at least one of the “good” classes, and use the best volumes to start another round of 4 identical heterogeneous refinement jobs, again with junk references.

Does this strategy sound reasonable? If anybody uses a similar workflow, I’d appreciate if you could share some details on your strategy. Specifically,

How many good and junk classes do you use?
Do you use identical references for multiple classes?
For the junk classes, do you only use references with random noise, or do you also use junk references from a previous round of classification?
How do I determine when to stop? I’ve been doing a 2D classification after every round of heterogeneous refinement to check how clean my particle stack is and to verify that I’m not accidentally throwing out good particles along with the junk

On a deeper level, I’m still trying to develop a classification strategy for 1) sorting out junk particles (e.g. ice contamination, foil edges, aggregates), 2) identifying compositional heterogeneity (classifying subcomplexes), and 3) identifying conformational heterogeneity.

Is there an order I should follow? (e.g. sort out junk first, then identify compositional heterogeneity, and finally look for conformational heterogeneity)
If I expect compositional and conformational heterogeneity, should I have a starting reference for each composition/conformation, or would identical copies of the same consensus reference be sufficient for heterogeneous refinement to identify different compositions/conformations?

If anybody uses a completely different workflow, I’d be happy to hear about those as well.

Edit: I just came across user olibclarke’s excellent exploratory data processing guide. It describes a strategy called “decoy classification” as well as methods for generating decoy volumes for heterogeneous refinement, which was relevant to some of my questions.