3D Classification strategy: Classify until exhaustive?

Hi everyone!

Today I would like to ask a question on your strategy involving the use of 3D Classification for peeling may conformations from a dataset of a wildly active protein.

I have been using 3D Classification on 800k particles (selected by many rounds of 2D classification to remove junk particles) and split into 4 classes consisting of 180k, 250k, 198k, 203k particles each. Then, I realize that each class can be split into 2-3 more classes. For one of those subclasses consisting of 40k particles, when I tried 3D Classification again into 2 classes, I got back mostly 39k particles class and another class of junk particles. This I would consider as “exhaustive classification”, where I cannot split 1 class further.

The reason I have been more actively using 3D Classification is because previously we ran the 3D Classification job on our old workstation and it took more than 1 week to finish. Then when we migrated to a faster workstation, we were able to run the job in 2 days, and smaller set of particles in less than 2 hours. Therefore, although I have been building models on the maps I got from the initial 4-class Classification job, I have been actively doing more 3D Classification and discover smaller and more intricate movements of my protein, that got averaged (and lower quality) in the first 4 maps.

My question for our community is, if it is better to do 3D Classification until exhaustive (cannot split anymore, or split into homogenous classes, confirm by NU-Refine and look at the map closely), or is there a better way to classify the highly dynamic states of the protein that got captured in CryoEM? I tried 3D-Variability but that has not really helped me separate the classes.

Moreover, should I be doing “double confirm Classification” where I run the 3D class job again with similar parameters to see if I get the same results? I found one way to ensure the 3D classification job get easily reproducible results is by decreasing the Convergence criterion (%) to much lower numbers, like 0.001, and increase the max F-EM rounds (up to 50-100) to get a very stable classes without much particles shifting between each round of classification. But this would only work on very fast workstations.

Looking forward to hearing your thoughts!

1 Like

Hi @lecongmi001,

That’s an interesting topic. I think the strategy you described could be successful, especially with high-conformational or dynamical protein.

I need some insights though, because the 3D classification in cryoSPARC requires that you provide as many model to match the number of classes you want, does this strategy could introduce a bias related to this ? Unlike Relion which produce as many classes you want from particles and only 1 reference map.

All my attempts to distinguish conformational states with 3D classification job were unsuccessful, perhaps because my input models are too close from each other, and as a result I always got classes that looked exactly the same (input particles could produce a map at 3-3.5 Ă… with NU refinement). What type of initialization mode do you use during 3D classification job in cryoSPARC?

By my side, I use ab-initio jobs to separate the different conformational states, especially at low resolution. Basically after that I unbinned progressively my particles until they reached nyquist, and I sort those particles with heterogeneous refinement jobs. Multibody 3D refinement algorithm from relion could produce nice result to assess dynamical states, but it has some limits, especially regarding the size of your protein/complex.

Best,
Kevin

1 Like

3D classification does not require input models - by default models will be generated from reconstructions of subsets of the input particle set.

If you are seeing that all output volumes look identical, I would test switching on force hard classification, and also experiment with using different values for the target resolution.

Cheers
Oli

2 Likes

Hi Kevin,

Thank you for your input and experience! For this type of classification, I usually run the 3D Classification without any Initial model (Initialization mode: simple) so as to let the Classification job run as random as possible, hoping to avoid any input biases.
So far, for some group of particles, I was able to get two distinct classes with number of particles consistent across different runs of the same Classification job. Sort of like getting a triplicate: when the number of particles across all classes are the same with exactly similar parameters (clone job), then I consider them being “true” conformers of each other. For some Classification jobs, the number of particles per group keep shifting ~1k across different jobs, so I tried reducing the number of classes and it turned out I was “forcing” it into too many classes while in reality 2 of the classes the particles could be together in one class.

Cheers,
Khoa.

Thanks Oli and Khoa for your experience, appreciate it!

Do you usually ran 3D classification with low-resolution particles (let’s say 3 or 4 Å/pix), or more likely with high resolution particles (after NU-refinement for instance) ?

Best,
Kevin

Very much depends what you are looking for (the scale of the heterogeneity, the resolution at which it becomes apparent)

But in any case you will want to run a refinement of some sort first, as 3D classification uses the input alignments, even if you downsample the particles prior to classification (which may be advantageous for reasons of speed)

1 Like

As often, the answer is “it depends”. It depends primarily on which type of heterogeneity you are facing. If it is purely discrete heterogeneity (compositional heterogeneity is always discrete, conformational can be discrete), and at a scale that allows discriminating classes, then heterogeneous refinement and/or 3D classification should in principle let you completely resolve it. But when faced with continuous conformational heterogeneity, classification approaches won’t work because they would need an infinite number of classes to model the data correctly. This of course breaks down because in such cases, trying to classify exhaustively will only lead to more and more classes, less and less populated every round. Map quality will improve with the first few rounds because the most different conformations start separating into different classes, but eventually map quality will degrade as the number of particles per class decreases below a usable number and there is no longer enough accumulated signal to get a good reconstruction.

Continuous heterogeneity is a difficult problem, and in practice you often encounter all kinds of heterogeneity (discrete and continuous), so you need to address all of them either one by one (for instance, separating different species by heterogeneous refinement and/or 3D classification, then resolving conformations of each single species using 3DVA or FlexRefine) or all at once (cryoDRGN is good at doing this!).

I worked on a case like this a couple years ago, for which exhaustive classification was leading nowhere. What eventually worked was 3DVA and cryoDRGN. It’s here if you’re interested in reading about it: https://doi.org/10.7554/eLife.71420

5 Likes

To second @olibclarke point if you see CryoSparc’s tutorial https://guide.cryosparc.com/processing-data/tutorials-and-case-studies/tutorial-3d-classification

The Per-particle Class ESS Histogram displayed in 3D Classification (≥v4.0) can be a good indicator if you have particles with high probability of being in more than 1 class.

Had this happen with one of my recent attempts.

Force hard classification = OFF

Force hard classification = ON

1 Like

Is there a particular refinement job that works best? I can 3D classify post Ab Initio by using the 3D alignments and drag dropping but would it be better to further refine via NU or HET-refine? Would further refinement improve the quality of the 3D alignments used by the 3D classification job?

In the same vein would making these refinement jobs multi-class based off an ratio knowledge of a basic make-up of heterogeneity present in the data set help the alignments?

New to this,Thanks!

Hi, and welcome to the forum! I would definitely do some refinements after Ab Initio, because the ab initio map can be really poor. Moreover, ab initio reconstruction is usually one of the earliest steps in the 3D processing pipeline - have you cleaned up junk particles yet? I’d recommend Oli’s decoy classification method using Heterogeneous Refinement. You can find some resources about it here:

Once you’ve sorted out the junk particles, the next step would be to get a higher resolution consensus, typically using jobs like Homogeneous Refinement, Non-Uniform Refinement, or Local Refinement. From there, you can use 3D Classification. Because 3D classification doesn’t do any alignment, it’s usually best to get the best alignments possible so that the different 3D classes represent distinct compositional/conformational states - otherwise, the classification might be dominated by subtle differences between improperly aligned particles.

To answer your last question, if you have prior knowledge of what the distribution of your heterogeneity looks like, you can absolutely use that to inform how many classes to use. For example, if your protein is known to exist in two different states and each state represents 1/2 of the particles, then two classes is a good place to start. However, if one of the two states only represented 1/10th of the data, then I might use 10 classes. I also usually run another 3D classification job with the same parameters, but with Force Hard Classification turned on - this often helps with finding rarer classes. Hope this helps!

Best,
cbeck

This is incredibly informative. Thank you so very much!

This got me over a hump I had been stuck on for some time so deeply grateful. Is there an amount of particles you would recommend as minimum for 3D classes (I know you said base classes on comp/disc heterogeneity) but curious if you had a general number for how many particles you would typically want i a 3D class. Obv will be variable but let me know if you have a basic rule of thumb. Also same goes for 2D classes, although as I understand it these are only projections so not as relevant (for example technically a class could only have 5 or 6 particles but could still project as a semi okay looking class average. Am I correct or misunderstanding??

Thanks and all the best!

Have played with this as suggested by you and @cbeck but have a question someone here might know

Won’t it always show a ESS of 1 when hard classification is turned on? As particles are only projected onto best class?

If so could this not still result in heterogenous particles in a given class? For example if I was hypothetically classifying into 15 3D (or 2D) classes and 4 are great, some okay and some bad, is it not possible “good/okay” particles are placed into a less than desirable class because hard classification was used?

For 2D classification, the general rule of thumb I’ve been taught was to have at least 1000 particles (on average) per class, but this starts to break down with larger datasets. I routinely use 200 2D classes for 2 million particles, which gives me good results. I don’t think many people use much more than 200 classes, because the job starts to run extremely slowly.

For 3D classification, I usually shoot for 40-50k particles per class. However, some of my colleagues show work with large, rigid proteins have used as little as 10-20k particles per class.

However, for both 2D and 3D classification, it’s important to note that the “batch size per class” parameter is a distinct parameter that controls how many particles to use for the initial iterations, which can be really important to tune for difficult datasets. The fewer particles you use for the initial iterations, the faster the job will run. However, if you don’t use enough particles, the classification might be unstable, so I’ve never gone below the default batch sizes. For 2D classification, people on this forum have recommended batch sizes from 200-1000. I’ve personally had a lot of success with a batch size of 400 for a dataset that had preferred orientation. For 3D classification, a user in this thread one used a batch size of 10000.

As for your other question, yes, the ESS will always be 1 when hard classification is turned on. Hard classification has worked wonders for me in the past, but unfortunately, I can’t really explain why it works so well. @Mark-A-Nakasone, do you have a better understanding of why hard classification can give better results?

Best,
cbeck

1 Like

Hi @cbeck, No idea about force hard classification = on, but we do tend to set using the the per-particle class size (ESS) as a diagnostic and compare multiple 3D class jobs - down stream in the SPA pipeline after NU or homogeneous refine.

Not that it is worth the time, one could explore this by finely sampling the intermediate results. Usually we keep those off to save space. “Keep results of every F-EM iteration = true”

Could you go into some detail about how you use the ESS as a diagnostic? I probably haven’t been using it as effectively as I could be. Based on the tutorial, an ESS > 1 means that there’s uncertainty in the classification and that it can be helpful to increase the number of O-EM epochs. Has this generally been helpful for you?