Advice on Number of Classes

newbie · May 6, 2024, 5:23pm

Hello,

I am currently classifying 2D images of ~2-4 million particles. We expect there to be quite some heterogeneity in the particles due to the nature of the protein complex that we’re working with. Since we have such a large number of particles and quite large heterogeneity, would it make sense to have 1,000-2,400 classes for example or would it be overkill? Would I need to increase the batchsize per class for such settings?

I was also wondering what would increasing the iteration number do practically? What effect would that have on the 2d classes?

Thanks in advance.

cbeck · May 6, 2024, 9:24pm

I routinely use 100-200 classes for datasets with 2-3 million particles. I think 1000+ would be entirely too many classes in almost all cases. Not only would the job take much longer to run, I don’t know how anyone could visually group that many classes into groups by compositional or conformational heterogeneity.

Instead, I only use 2D classification for two reasons: 1) to get a quick sense of the overall quality of my dataset, and 2) to get rid of the most obvious, high-contrast junk. I do the bulk of my classification in 3D with ab initio reconstruction, heterogeneous refinement, and 3D classification. This thread has some helpful pointers.

In general, I don’t try to capture all the heterogeneity in my data from the start. Since compositional heterogeneity is usually a lot more discrete than conformational heterogeneity, I usually classify based on composition first just get to quickly get to a first set of consensus volumes (using ab initio and heterogeneous refinement). Next, for each consensus volume and its associated particle stack, I classify based on conformational heterogeneity (using 3D classification and 3D variability analysis).

Edit to add: I believe increasing the number of iterations gives more time for the classification to converge and assign particles to the correct class. This appears to be helpful for small, low-SNR particles. This thread has a helpful discussion on some of these parameters.

carlos · May 7, 2024, 7:17am

I agree. With so many particles, 3DVA should do a nice job. You just have to get rid of the junk, with 2D classification and hetero refinement (force hard classification ON). And start cutting the resolution to 12 - 18 angstroms. There is no single solution for this, but the most important modo is “don’t go too fast”.

rbs_sci · May 7, 2024, 8:32am

The time-per-iteration increases dramatically above 400 classes, so I usually stop there even with very heterogeneous samples with high particle counts. You’ll also need to tweak the

One or two rounds of 2D classification should suffice, followed by multiple rounds of heterogeneous refinement with perhaps 10-15 “good” target volumes and 5 “junk” volumes of different shapes. “Junk” volumes can be generated by running an ab initio with maybe 10 classes and killing it after the first 100-200 iterations when some rough shapes come out of the data, just pick the worst ones to act as junk-sinks. Good models can be (hopefully) acquired by allowing ab initio to run to completion. Start heterogeneous refinement at maybe 8Å for a couple of rounds, then push it up to 6Å, then 4Å. Unless the complex is very large, lower than 8Å I usually find actually hurts rather than helps.

Depending on how stable the “core” of the heterogeneous complex is, it can then be worth doing a homogeneous refinement (into one class, obviously) then pulling the sub-species out by 3D classification (into as large a number of classes as the system can cope with) at 4-6Å.

Something else I’ve taken to testing recently is any heterogeneous refinement class I’m suspicious of, I run a quick 2D (into 20-30 classes) to check particle quality. If it tests OK, I’ll include it in further processing.