General "protocol" for cleaning 2D classification of Junk and searching for rare poses

Hi everyone,

I am wondering if anyone have some suggestions for cleaning up 2D classifications or searching for lowly populated poses.

I figured the most general approach would be broad classification (a large number of classes to search for, high uncertainity, pick the good particles and redo). This wouldnt neccessary search for new poses but would effectively remove junk each iteration.

Any suggestions are appreciated!

Therein lies the art… Knowing what to include and what to exclude.

One idea is that you can bootstrap yourself somewhat by making 2D projections of a 3D map, and making sure you do not exclude different views that maybe you didn’t recognize.

To clarify, you often know the general shape of your molecule ahead of time, like you’re imaging some variant of a known structure. Alternatively, once you’ve gotten part way through a project you have a general idea of the shape, but want to go back and improve your coverage of different views.

Given a 3D map (whether from an intermediate refinement, or from a published PDB), you can make 2D projections using EMAN2 (a useful tool for your toolkit). The exact command is as follows: inputVolume.mrc --outfile=yourProjection.mrcs --orientgen=eman:delta=5 --sym=c1 --projector=standard --verbose=2 --parallel=thread:#cores

Note the .mrcs suffix on the output (to make a stack of 2D images, rather than a 3D volume).

This makes an approximately even series of projections, with the angle varying in delta steps.

Replace delta=N with n=N to give the ~number of orientations to generate, rather than the spacing between them. Twenty is usually sufficient.

To view the projections, use the command:

e2display yourProjection.mrcs

You may need to middle-click on the projections and adjust the brightness/contrast to make the projections look like the 2D class averages.

With this in hand, you can go back to your 2D class averages and have more confidence in deciding if a particular class is junk, or is a legit view you didn’t recognize before.



Junk, junk, junk. It’s always good to get rid of junk particles.

Best not to pick them in the first place, but that’s a separate discussion…

After a few rounds of 2D classification, my favorite tool to get rid of junk particles is a multi-class ab initio refinement. For me, the magic number is four. Set the ab initio to have four classes. Then clone that job three times and run them all (our nodes have four GPUs each, so we can run four jobs simultaneously). At the beginning stages this will usually give you one good class and three junk classes. You get to decide.

It’s the nature of the math that four identical ab initio jobs, with the same input particle stack, will not give identical results.

Then you pick the good classes from all four jobs, and put them all into a single homogenous refinement. This will not create duplicate particles (provided there were no duplicates in the first place).

If you think about the logic, this is what you are doing. You are keeping any particle that any one of the four ab initio jobs sorted into a good class. And you are discarding all the particles that all four jobs agreed are junk particles. This is a conservative way to discard junk particles without throwing away good particles.

The homogenous refinement allows you to evaluate the results.

Repeat as many times as you have patience for, with the output of the homogenous refinement going into four new four-class ab inito refinements.

When I do this, I find that the first few rounds will improve the resolution of the refinements. In subsequent rounds the ab initio refinements may start to give two or three “good” classes with minor differences. It’s up to you what to keep and discard. Further rounds of this technique may not change the resolution of the homogenous refinements, or may even make the stated resolution worse by 0.1-0.2 Å; however, a careful look at the maps shows them to actually be getting better, and more homogenous (IMO).

Stop when you’ve hit diminishing returns or are genuinely making things worse.

Have you tried turning off force max over poses and shifts in 2D? If not, try:
force max over poses and shifts - OFF
batch size per class - 300
num iterations - 35-40
num classes 100-200
This will take much longer than usual, but still just a few hours for a large dataset that is reasonably binned, in my experience.

Is there a functional difference between e2project3d and the create templates function baked-in to cryosparc?