Ground truth classification

Consider the scenario of 100,000 particles with 2 discrete conformations, A and B. I am looking for the most accurate way to separate and quantify the populations with cryosparc. Some instances (datasets) contain 5% conformation A, whereas some contain 50% conformation A. Heterogeneous classification seems to NEED to put ~10k particles into every class, so the outcome is a product of the input: 1A + 1B + 1junk is not equal to 2A + 2B +2junk, and if i give 5 junk then half my particles are mis-classified as junk since these classes must be populated. Heaven forbid I give 1A+3B, it will suggest the data is significantly B form. Is there a parameter/toggle to allow multi-reference input to ignore or appropriately populate the input classes?

FWIW if I try 3D classification beta PCA input (no reference model), I get wacky numbers which over-represent class B as I can clearly tell from 2D. Multi ab-initio only works well with huge stacks or large particles. 3DVA impractical here. Is there another strategy I’m missing? Hard classification seems to have no effect.

I am also considering some kind of 2D template matching strategy since the states are known at high resolution, interested in suggestions here.

Relion will gladly give a 3D class with 0.1% particles if there are truly no particles of that type; this is what I’m trying to replicate.

@olibclarke my boss told me to write you directly :wink:

1 Like

Seems generally useful discussion, as was also mentioned here Equal distribution of particles in ab initio/3D classes

We have had mixed experience with this - some samples behave as you describe, where it seems like they want to split into even populations no matter what, and others behave quite well, allowing minor populations to be resolved fairly reproducibly. Certainly we don’t see what you describe with regards to junk classes - usually we provide 7 or 8 junk classes initially, and after a couple of rounds these are all usually populated at <1%.

I will say that for small populations a kind of hybrid strategy has recently proved useful - we first run a round of heterogeneous refinement, to assign orientations for the major distinct classes, then take all output particles and input those for a round of deep 3D classification with many classes (say 50-80), using a target resolution commensurate with the scale of the expected structural variability.

This has the advantage that even if there is a bias towards equally populating the classes, it doesn’t really matter too much as because as you have greatly overestimated the number of classes, you can then just combine the groups of ~identical ones and sort it out that way. This strategy has allowed us to isolate minor classes for samples where we were not able to identify them any other way.

1 Like

awesome, thank you! what do you use for junk? I guess I always provide 1 or 2 junk, and the junk are an early iteration or improper model from ab initio. And you must be using pristine 2D clean particles, I’m trying both from rough sets and cleaned sets, and junk classes populate accordingly.

I guess with the hybrid strategy you have the advantage of approximately assigning eulers with decent model independence, and then giving huge room for model reformation from there. That sounds really ideal, neat trick! Will give it a go and report. I always used high-res refinements as input for 3D class, worried that slight changes in shifts/angles would be the only thing classified. but I guess not the case.

For junk, we usually use either completely random density (which you can make by killing an ab initio job during the first iteration), or models generated of specific contaminants (empty nanodiscs, micelles, edges) using ab initio of select 2D classes.

We are not using pristine 2D classes by the way - we usually do hetero cleanup against an almost raw particle stack (although usually picked with Topaz or similar), just with high contrast junk removed.

1 Like

BTW for the het to 3Dclass trick to work how I envision, the het refine should be done with references on common axis. I now have all conformation B separated to some number of classes, and all conformation A separated to some classes, with no particles shared between the two since they were in different orientations, so it’s like 2 small 3D class jobs. Strangely all conformation B outputs are indecisive between A/B whereas all conformation A outputs are well-defined - this is on me to further define, ongoing.

Target resolution changes the ability to parse into classes, but does not in any way (almost down to the particle!) change the sum of particles in each conformation observed. Changes in the other 3 parameters (0.9 instead of 0.75, 500 instead of 100, 0.1 instead of 0.2) have no noticeable effect in time it takes to run the job or % conformations, but 14/60 populated classes dropped to 7/60 populated classes.

Adding or not adding a junk class also does not change the population of A vs B, only the amount of particles that remain in each. So % junk is equally pulled out of each conformation.

In het, any excess model references in either direction does seem to affect the populations, suggesting bias. Which means inherent bias in the other direction if you add only 1 of each.

right, yes the input models should be in the same orientation, that’s right