Option to do reconstruction of each class in 3D classification beta?

olibclarke · December 8, 2021, 12:41am

Hi,

I have been running some tests on the new 3D-classification (beta) job, and I am trying to understand what is going on.

What I am finding is that when I run 3D classification using a mask, on a system where I have characterized the heterogeneity - two well defined states separable by either heterogeneous refinement or 3D-VA - I am seeing much less pronounced heterogeneity in the output classes than I am expecting.

Basically they all look the same, with minor differences in the regions where I know there is heterogeneity. Having said that, the class distribution is uneven, and reproducibly so, and the minor differences are muted “ghosts” of what we see in 3D-VA, heterogeneous refinement, or Relion classification without alignments.

Then I decided to RTFM, and saw the attached paragraph. I then look at the log, and realize that the “dataset-wide mean class ESS” is in the 9-16 range at the end of classification (for a 20 class job).

Does this mean that the output volumes are not reflective of a reconstruction of the particle set assigned to each class, or am I misunderstading? If this is the case, would it be possible to add an option to calculate a naive reconstruction of each class in the final step, without considering the contribution of every particle in every other class?

Also out of curiosity, does heterogeneous refinement use weighted back projection too? Or is this a new thing in 3D classification (beta)?

Cheers
Oli

UPDATE:
Calculating per-class reconstructions using the Homogeneous Reconstruction job type gives much more sensible and interpretable results, but is rather a pain to do for 20 classes, let alone 100. Would be great to have an option for this integrated into the job!

olibclarke · December 8, 2021, 1:35pm

Also - Target class ESS fraction - just to understand, is this effectively setting a convergence criterion? So by default, the number of iterations will be set such that the mean ESS is ~75% of the number of classes?

But then later in the docs it states that the classification is converged when this value reaches 1, and in my hands it seems to be still 9+ at the end of classification… am I doing something wrong? Or does this just mean that the nominal number of classes is likely greater than the actual number of classes in the data?

Is there any way to see the per class ESS values? It seems like if I am understanding correctly that this would be helpful to get a sense of variability - if one class has a low ESS, and the others have high ESS, does this mean that the low-ESS class is distinct, i.e. the particles in it do not have significant probabilities of assignment to other classes?

vperetroukhin · December 8, 2021, 7:15pm

Thanks for all this feedback @olibclarke! Responses below:

Does this mean that the output volumes are not reflective of a reconstruction of the particle set assigned to each class, or am I misunderstading?

Yep, that is correct.

If this is the case, would it be possible to add an option to calculate a naive reconstruction of each class in the final step, without considering the contribution of every particle in every other class?

Thanks for this suggestion! We discussed it internally and came up with the idea to create a new ‘multi-class’ reconstruct-only job for any ‘all_particles’-type outputs that include class assignments. This should be in a release soon!

Also out of curiosity, does heterogeneous refinement use weighted back projection too? Or is this a new thing in 3D classification (beta)?

Heterogeneous refinement does indeed use weighted backprojection – the job I mentioned above will be useful to get single-class reconstructions from hetero_refine as well.

Also - Target class ESS fraction - just to understand, is this effectively setting a convergence criterion? So by default, the number of iterations will be set such that the mean ESS is ~75% of the number of classes?

This does not set a convergence criteria but (optionally) allows the job to tune the initial class similarity (which in turn adjusts the initial class distribution). When the class similarity (‘Beta’) is 0, the class assignments are defined by the standard maximum likelihood optimization. To ensure that the initial volumes don’t inadvertently skew the initial class assignments such that most particles belong to only a few classes, we define this constant, ‘Beta’, which we increase until the probability mass is diffused enough such that the empirical class ESS is at least the target class ESS (and hence each particle has a non-negligible probability of coming from any of the classes). Beta is then annealed down to 0 throughout the classification to ensure the final class assignments are indeed the maximum likelihood classes under the resulting volumes.

But then later in the docs it states that the classification is converged when this value reaches 1, and in my hands it seems to be still 9+ at the end of classification… am I doing something wrong? Or does this just mean that the nominal number of classes is likely greater than the actual number of classes in the data?

Is this consistent across different hyper-parameters? Have you tried increasing the number of epochs or changing the batch size per iteration? Does the class histogram change much in the last few iterations? This could mean that there are 9+ very similar classes that are difficult to differentiate between, but this seems unlikely since you are seeing appropriate heterogeneity when you reconstruct the single-class particle sets. Thus there is probably a single mode among these 9+ classes that is indeed the correct class. Perhaps this is a sign that the algorithm simply has not converged, but it’s tough to say.

Is there any way to see the per class ESS values? It seems like if I am understanding correctly that this would be helpful to get a sense of variability - if one class has a low ESS, and the others have high ESS, does this mean that the low-ESS class is distinct, i.e. the particles in it do not have significant probabilities of assignment to other classes?

Ah yep, that is indeed what this would mean and we do actually report this in the heterogeneous refinement job! In the next release, I can add this to the 3D Classification to go along with the Class Distribution histogram (which should be similar to per class ESS if most particles have a class ESS of 1).

olibclarke · December 8, 2021, 7:30pm

Thanks @vperetroukhin this is a very helpful explanation!
Re this:

Is this consistent across different hyper-parameters? Have you tried increasing the number of epochs or changing the batch size per iteration? Does the class histogram change much in the last few iterations?

I haven’t tried tweaking these parameters yet - was just trying with more or less default parameters on some different well-characterized datasets to get a feel for what I need to tune. If the mean ESS can be used to assess convergence, perhaps it would be useful to print a graph of the per-epoch mean ESS to the log? I can certainly try increasing the number of epochs and increasing the batch size and see what happens… will try and let you know if I see a difference!

UPDATE:
Following up on this - increasing the batch size per class reduced the final ESS slightly but not very much - from 9 to 6 for a 10 class job. Seeing if increasing the number of epochs helps now

UPDATE:
Increasing number of epochs did not help - ESS still high, classes still look much more similar than the per class reconstructions.

Cheers
Oli

olibclarke · December 8, 2021, 9:22pm

One more query @vperetroukhin:

Regarding the PCA which is optionally used for initializing classification, am I understanding this correctly:

Select random subsets of particles (many more subsets than desired classes)
Reconstruct each subset
Cluster the subset-reconstructions using PCA (similar to the approach used in 3D-VA display for clustering…?). Average “superclusters” to give desired number of classes.

If this is right, when is the initial lowpass filter applied? Is it applied right at the end to the averaged clusters, or initially, to the reconstruction of each subset?

Also, how many reconstructions should we use? the default is 100, but I guess that won’t be appropriate if you want 100 output classes… how many subset reconstructions approximately should one have per output class?

Cheers
Oli

Jinseo · December 8, 2021, 9:24pm

Hi

I am also testing new 3D Classification (beta) with my dataset.
As same with above Oli mentioned, output volumes didn’t reflect actual particles distributions (all outputs looked same even if 5~7k particles classes).
With some reasons, the output particles might lost original locations, and Homo- and Nu- refinements with output volumes showed horrible results sometimes. Refinements with input volume gave me reasonable results.

The highest number of particles class didn’t mean the solid class for me. Some cases, the highest classes showed worst classes.

I tested results between Hetero- with several identical volumes and new 3D classification (beta). With some reasons, the outputs from 3D classification lost some views related particles.
Especially, when I used masked 3D Classification, the major class contains 60~70% of particles missed particles from rare views and minor views (mine consist of 80% of front views, 20% of top and profile side views). I don’t know why.

If anyone knows about answer, please let me know.

Thanks,

Jinseo

vperetroukhin · December 9, 2021, 8:45pm

Hey @olibclarke,

Yep – this is basically right! Regarding step 3:

Cluster the subset-reconstructions using PCA (similar to the approach used in 3D-VA display for clustering…?). Average “superclusters” to give desired number of classes.

We apply PCA directly on the space of voxels after applying a mask to each reconstruction (either an auto-generated one based on the average of all reconstructions, or the input supplied) and use the same clustering procedure as the 3D Variability Display job-- namely, a Gaussian Mixture Model. One more subtlety is that there is a simple outlier rejection procedure applied prior to clustering (based on percentiles of a coordinate along the first PCA dimension), to reduce the chance of single-volume clusters.

If this is right, when is the initial lowpass filter applied? Is it applied right at the end to the averaged clusters, or initially, to the reconstruction of each subset?

The latter. For each reconstruction, we only solve for Fourier components up to the lowpass filtering threshold.

Also, how many reconstructions should we use? the default is 100, but I guess that won’t be appropriate if you want 100 output classes… how many subset reconstructions approximately should one have per output class?

Yes, that’s right. In my testing I’ve used at least 3-5 reconstructions per output class (e.g., 400 reconstructions for 100 classes). Thinking about this, perhaps we should actually change this to a normalized ‘reconstructions per class’ parameter in the future for a more robust default set.

olibclarke · December 9, 2021, 8:49pm

Yes, that’s right. In my testing I’ve used at least 3-5 reconstructions per output class (e.g., 400 reconstructions for 100 classes). Thinking about this, perhaps we should actually change this to a normalized ‘reconstructions per class’ parameter in the future for a more robust default set.

Yes - or maybe just have it auto-adjust, in the same way that search range in local refinement auto-adjusts when you adjust the gaussian prior

vperetroukhin · December 9, 2021, 8:52pm

Hi @Jinseo,

If you have a second, can you expand more about this part:

With some reasons, the output particles might lost original locations, and Homo- and Nu- refinements with output volumes showed horrible results sometimes. Refinements with input volume gave me reasonable results.

Is the resolution very poor or does the refinement fail in some other way?

It seems that @user123 also reported similar results in a separate thread. I’ve tried to reproduce this on my end but can’t seem to spot anything obvious.

Jinseo · December 9, 2021, 10:24pm

Unfortunately,

I can not show my processing data.
Just partially,

From 3D Classification against entire volume

Initial real space slices from Nu-refinement of best class(446K particles)

After 1 iteration

After a few more iterations, volumes become normal.

But when I masked part of my volume.

Until final, volume wasn’t recovered.

Slices from final iteration (223K particles).

The resolution was stuck at 6A

But when I used the input volume of 3D Classification. Everything was normal. The final resolution was 3.5A.

To compare between using hetero- for classification and new 3D Classification (beta) for entire volumes.
Both major classes showed similar particle numbers (660K)
But from 3D Classifications lost some views (The resolutions are same, but quality of 3D Classification is much worse than Hetero- result.

From Hetero-
P10_J188_viewing_direction_distribution_iteration_007

From 3D Classification
P10_J203_viewing_direction_distribution_iteration_009

Jinseo

olibclarke · December 9, 2021, 11:56pm

So if we are looking for subtle differences, and starting from a high resolution refinement, would it be a valid strategy to start with a much less aggressive lowpass filter than the default, say 4.5 or 5Å, to get a set of structures that has more variability in the way of say small helix movements etc, as opposed to larger domain motions?

marino-j · December 10, 2021, 1:51pm

if you would tell what the new 3D classification is good for, given the current limitations, what would you say ? Thanks !

olibclarke · December 10, 2021, 9:28pm

@marino-j so far it has been useful for what you would expect - classification of conformational/compositional changes in local regions. It is just that for analysis, I have found it more informative to calculate per class reconstructions, rather than the weighted back projections used by default.

Having said that I still don’t understand why the ESS values are remaining so high for all my “test” classifications - it would be useful to have more details about exactly what parameters were used for the examples in the tutorial.

vperetroukhin · December 10, 2021, 10:29pm

@olibclarke Happy to provide those! (we tried to add all the salient ones to the tutorial)

EMPIAR 10077

EMPIAR 10256

CryoEM1 · December 10, 2021, 11:02pm

I can’t get homogeneous reconstruction jobs to work as follow-up after 3D classification. IndexError: list index out of range. homogeneous refinements job run fine but the no-alignment classification is intentionally performed on masked regions that are too small to align well. tried on multiple processing nodes, with or without mask either supplied from 3D classification job or from original mask creation job.

vperetroukhin · December 10, 2021, 11:30pm

@CryoEM1 – this occurs because the GS split information is currently not being passed through to the class sets appropriately. We’ll have a fix for this in the next patch. For now, you can get around this by turning on Force re-do GS split in the reconstruction advanced options.

CryoEM1 · December 10, 2021, 11:43pm

thank you this worked.

vperetroukhin · December 10, 2021, 11:50pm

I have not tried this myself, but certainly makes sense to me!

olibclarke · December 12, 2021, 6:59pm

@Jinseo the issue here I think is that your output volumes from 3D-classification are locally masked, and therefore not appropriate as volume inputs for NU-refine. That is why the input volume for 3D-classification works, but these do not.

Cheers
Oli

vperetroukhin · December 15, 2021, 7:22pm

FYI everyone: the latest patch includes two fixes that address issues raised in this thread:

GS split should now be properly passed through to particle subsets
Class volumes are unmasked so should work with downstream refinement jobs.