I’ve posted about this before but I’m still puzzled by it - often, in 3D-VA, a mode which has a roughly Gaussian distribution along the mode in terms of population will result in a bimodal distribution in 3D classification.
E.g. consider the attached - this is a continuous transition between two flexible states of a large protein.
Ran 3D-VA, used 3D-VA display in intermediates mode to reconstuct bins of equal population along the mode. Used the resulting volumes as initial volumes for 3D-classification, with a low learning rate (0.01) so as not to allow the volumes to drift too much in any one iteration.
This results in a clear bimodal distribution in class population - I tend to believe this more than the implied distribution from 3D-VA (which would indicate that the transition state is the most populated), but I am trying to understand why this occurs.
Is there some prior towards a Gaussian distribution along the 3D-VA mode causing this behavior? If so, is there any way to tune this?
Following this with interest. I had naively assumed 3DVA-modelled distributions to be somewhat diagnostic. Grateful to be shown otherwise. I’m looking forward to finding out what exactly is happening under the hood in this scenario.
I’ve been wondering about this myself ever since I read the 3DVA paper in more detail:
Inspired by Roweis (Roweis, 1998), we formulate 3DVA as a form of Probabilistic PCA, assuming that data are drawn from a high dimensional Gaussian distribution, with Gaussian observation noise in Eq. (2) Gaussian prior over latent coordinates.
The example datasets in the paper don’t really follow a bimodal distribution and are more like a gaussian with a long tail towards one end. Also, if I’m understanding the algorithm correctly, 3DVA generates a series of volumes by adding a difference volume to the consensus, and scaling the difference volume by the latent coordinate along the axis. Does this mean that the particles assigned a latent coordinate of 0 greatly resemble the consensus volume? Or does it just mean that these particles aren’t well described by adding the difference volume to the consensus?
I’ve also been wondering if it’s safe to interpret the latent space produced by 3DVA as a conformational landscape. I’ve been learning how to use cryoDRGN recently, and their documentation implies that the PCA it runs on the generated volumes can be interpreted as a conformational landscape, but I don’t know if this applies to the PCA that 3DVA uses.
The thing is though - 3D-VA is capable of resolving bimodal/multimodal distributions, e.g. Fig. 9a of the 3D-VA paper:
Although the case in Fig 9 is compositional heterogeneity (where total mass changes). I wonder if that makes a difference.
I know that even for transitions where I expect a bimodal distribution (and see one, using Class3D), like open/closed states of an ion channel, 3D-VA tends to generate a ~Gaussian predicted population distribution.
Maybe I’m misinterpreting those plots, but wouldn’t those data look mostly Gaussian if you project the particles onto either the x or y axes? The clusters only seem to be obvious when you plot one component against other.
Oli, when you reconstruct equally-spaced bins of particles along the principal component axis (3DVA mode 1 from your original post), do the maps show a smooth, continuous motion in your protein?
For my own data, I did something similar to you where I binned the data into 5 bins of equal population, so that each bin had ~20% of the particles. I was expecting the third bin to have the highest resolution because it spanned the shortest range along the principal component, which I assumed meant that this population of particles would show the least amount of motion. Surprisingly, I got the complete opposite result: the corresponding map (after reconstruction and local refinement) ended up having the worst resolution and showed the most amount of blurriness in the flexible part of the protein, while the maps from bin 1 and bin 5 had the highest resolution and were the most conformationally homogeneous despite spanning a much larger range than bins 2-4. I’m curious to know if you observe something similar.
I saw pretty much exactly the same as you describe with 5 bins - but with 10 bins, the reconstructions were comparable in resolution (although still slightly better, and more populated, at either end of the range)
your particle distribution looks very gaussian to me in latent space with reasonable latent coordinate values, so I would expect a sensible description of a movement undergone by your protein.
In my hands, 3DVA is very useful to describe local movement of the protein around a central position, say in response to ligand-binding or mutation. It is very useful and powerful when used in comparison between datasets, ensuring the sets of particles under scrutiny can be said similar (simimlar number of particles, resolution, distribution in latent space).
Larger deformations of proteins observed by 3DVA can be difficult to observe and it becomes the limitation of the system, where non-linear interpretation of the latent space is probably the way to go. On 1 system, I’ve been able to obtain a “half” movement but not the “full” deformation. Sorry I can’t say more here, but the question is how far we can go with this method and it could be case specific in terms of system size and type of deformation expected. Some movement of the splicesome are very impressive for example.
For the example in Fig9 you show above, 3DVA is good at identifying clusters, which should then be separated to study individually, but I doubt we can link the sub-states in a sensible movement.
Once you get a good movement by 3DVA, don’t forget to give phenix.varref a go to make more sense of it.
Best of luck.
Vincent
The example in Fig 9 is from the original 3D-VA paper
The example I showed in the original post is a sensible description of movement, my query was specifically regarding the discrepancy between population estimations from 3D-VA and 3D-classification
Indeed about your original question and the bimodal distribution after 3D classification, we don’t observe that but rather many different cases where either we have 1 main specie and multiple smaller ones, or equal distributions among species or something in between. I’ve never seen your classification distribution. But you might be looking at a large complex, so the movements we observe are probably of different nature, where you might be something more discrete than homogeneous(?).
Have you tried classification with a low learning rate as described in the original post? Using the default learning rate and resolution, I find the initial volumes diverge too much, so it is harder to interpret the class distribution.
I haven’t tried systematically, but I just ran a test on a recent dataset having 98k particles and a moderate movement but nevertheless present with definitely a continuous deformation.
With learning rate=0.4: