Forced symmetry in 2D classification

mwaxham · April 5, 2020, 6:12pm

I have a negative stained data set of particles that seem to have mixed 6- and 7-fold symmetry through the preferred axis. Is there a way to force symmetry on the 2D classification to see if I can falsely bias the classification to test the question of 6- or 7-fold symmetry?

apunjani · April 6, 2020, 2:59pm

Hi @mwaxham, this is an interesting problem - to clarify, you believe there may be a mixture of 6- and 7- fold symmetric configurations, and want to separate them?

2D classification does not support any symmetry (since symmetry really only works in 3D where you can have a axis etc) but the best way to deal with this would be to move on to ab-initio reconstruction with multiple classes. Even 2 classes may be enough to separate the 2 symmetries if they are present.

mwaxham · April 6, 2020, 3:25pm

Yes, this is the issue. In this case, unfortunate for a 3D effort, the particles take a strong preferred orientation so nearly every particle presents looking down this 6- or 7-fold axis. It would be an interesting test to force either a D6 or D7 symmetry (these are dodecamers or tetradecamers) onto the data in the 2D classification routine and see how the particles distribute in each case.

I will try your suggestion of using 2 classes in the ab-initio reconstruction routine.

rj.edwards · April 7, 2020, 4:25am

I’ve started calling it restricted or limited orientation, rather than preferred orientation. Because it’s not preferred. It’s exactly what I don’t prefer.

I do a lot of 2D class averages with negative stain data. If the stain, the microscopy, and the particle picking are good you should be able to distinguish 6-fold from 7-fold with unsupervised 2D class averaging, no problem. That is, if your dataset is large enough; say at least 10,000 particles, but 100,000 is much better.

You can use the occupancy of each class as a rough estimate of the ratio of 6-fold to 7-fold. But this is only a rough number and I would never try to put much emphasis on it. The class averaging is by nature stochastic. I’ve made a semi-systematic study of it, and it can vary by ~5%, in my experience. Plus great looking class averages can contain lots of particles that are junk (why particle picking is important).

Dare I say it on this forum? I prefer Relion for 2D classification. I think maybe for the sole reason I can do it slow (all particles on first iteration) or fast (subset of particles for initial iterations), whereas cryoSPARC is always fast. In my experience the fast protocol gives more class collapse (everything in a few classes) and the slow version spreads the particles across classes more evenly. The experts (which I’m not) would have to chime in about the detailed differences between the two packages in how they handle the classification. Something about marginalizing.

BTW, don’t try to do the 2D classification with 2 classes. Better to have 20 classes and see which ones are 6- versus 7-fold. That also gives you room (classes) for the junk to sort into.

apunjani · April 7, 2020, 3:18pm

Hi @rj.edwards, thanks for your detailed comments and insights!
I think you are right about how to handle this case of distinguishing different symmetries in 2D classification.

In terms of “slow” vs. “fast” classification: you are also right that by default, cryoSPARC is tuned to be as fast as possible. However, this is easy to change by modifying two parameters:

“Batch size per class”: controls the number of particles (multiplied by the number of classes) used in early iterations. Set this to a large number (10000) to slow down classification.
“Force max over poses/shifs”: turn this off to enable marginalization that helps in some cases to resolve smaller molecules or more subtle differences

Even with these “slow” parameters changes, I believe cryoSPARC will be substantially speedier (in terms of wall clock time) than other programs, so may be worth a shot to play with.

PS. You should always dare to say anything on this forum - We are always looking for information and pointers to help us improve cryoSPARC!

rj.edwards · April 7, 2020, 7:22pm

@apunjani: Thanks for the batchsize tip! And I knew it had something to do with marginalizing. Unfortunately, that’s all I knew, since I don’t know what that means. If you’re willing, could you expand a bit on what marginalizing is and what the Force max over poses/shifts is doing?

Like most groups probably, we use a variety of packages for analysis. Our NSEM workflow developed within a Relion framework, but is starting to shift towards cryoSPARC as our datasets grow ever larger, for exactly the reason you cite, it’s faster. Our cryoEM mostly runs through cryoSPARC.

Love the responsiveness of the cryoSPARC community!

DanielAsarnow · April 7, 2020, 9:35pm

I also like to increase online-EM iterations to 40 or even 60. With batchsize 200 and 60 iterations, you will get excellent classes. This approach doesn’t slow down classification much, because a lot of time is always spent in the final full iteration. Limiting the alignment resolution to 12A is also a good idea.

apunjani · April 9, 2020, 3:32pm

Thanks @DanielAsarnow for the tip!

@rj.edwards glad we can help
Marginalization is a generic concept: When we perform inference of an unknown target variable (the 2D class density images in this case) while there is also another unkown latent variable (the 2D pose and shift of each particle), marginalization means that instead of trying to estimate just a single value of the unkown latent variable (pose), we should instead keep track of every possible value and how likely it seems given the data. In 2D classification this corresponds to keeping a probability distribution over possible 2D angles for each image. Every angle gets a probability value (e.g. 0.1, 0.2, etc) and then when we are reconstructing the target variable (2D density image for each class) we combine the experimental images by averaging them over poses, weighted by the probability of each pose.
So without marginalization (i.e. with “force max over pose/shift” on), we only keep track of the single maximum probability pose for each image, and the 2D class density image is just the average of all particles in the class, each from a single pose.
With marginalization, we “blur” every image by weighted averaging it over several poses, and then add all those averaged images together to get the reconstructed 2D class image.
So the “max” is really an approximation to marginalization (which is the more theoretically correct operation to perform) - we are replacing the full probability distribution with a point estimate. But in practice, “max” saves a lot of time and can actually be beneficial in many cases where the “width” of the probability distribution in marginalization is mis-estimate and too much blurring would happen.

Hope that helps