Output particles in "intermediates" mode of 3D-variability display?

Hi,

In the release notes for 2.12, it is stated that the “intermediates” mode of 3D-variability display should output particle subsets for further refinement. It doesn’t right now (though cluster mode does) - would it be possible to add?

Also, it is currently not possible to set the “rolling window” in this mode to 0, which would be useful in order to allow for evenly splitting up particles along a given component without overlap between groups.

Cheers
Oli
image

Hi @olibclarke v2.13 now allows for width=0 to use tophat (i.e. non-overlapping) windows now, but the job still doesn’t output the particle subsets like it was supposed to… this will be added.

In a pinch this can be done by manually creating subsets of the particles.cs file from the 3D var job output, and then importing those using the Import Result Group job.

Thanks @apunjani - I’m still not completely clear on how to “manually create subsets of the particles.cs file from the 3D var job output” - sorry for being dim, do you have an easy to follow guide to this?

Cheers
Oli

1 Like

I was able to do it, though it was sort of tricky to figure out. I have a 1E6 particle dataset that’s going to 2.4A, I tried to take the 2**18 (262k) nearest neighbors of the centroid from the embedding and refine those, but it actually got worse. It looks like a big Gaussian so I have to think some more about why this didn’t work.

I can share a script and details if you don’t want to wait for the next update.

Thanks Daniel, a script would be very helpful! In this case I just want to select particles in a selected amplitude range along one eigenvector - more for classification of different confs than improvement of res

Cheers
Oli

Here’s what I did, in your case you can just use idx = (x > lo) & (x < hi) to index the cs and pt array instead of the whole KNN thing I’m doing.

import numpy as np
cs = np.load("P16/J205/cryosparc_P16_J205_particles.cs")
pt = np.load("P16/J205/P16_J205_passthrough_particles.cs")
x, y, z = cs['components_mode_0/value'], cs['components_mode_1/value'], cs['components_mode_2/value']
X = np.array([x, y, z])
mu = np.mean(X, axis=1)
vsmu = sp.spatial.distance.cdist(np.atleast_2d(mu), X.T)
knn_2p18 = np.argsort(vsmu)[0, :2**18]
np.save("P16_J205_particles_muknn_2p18.cs", cs[knn_2p18], allow_pickle=True, fix_imports=True)
np.save("P16_J205_passthrough_particles_muknn_2p18.cs", pt[knn_2p18], allow_pickle=True, fix_imports=True)

Then cp P16_J205_particles.csg P16_J205_particles_muknn_2p18.csg and edit the new file to point to your new .cs files. You can now import this .csg file as a results group.

2 Likes

Fantastic - thank you!

No worries! It took me several tries to get the .csg file edited right, so go back to that if it fails the first time. I just noticed the greater-thans were swapped above and edited that, so also double check those if you copy-and-pasted.

I wanted to follow up on this thread as I’ve been having the same problem. Have you guys found a clear solution to extracting particles that relate to each intermediate? I had some trouble with the script and was wondering if there has been a different way to approach this issue.

With the .csg files? It’s far easier at the moment to filter and re-import a converted .star file than this .csg thing.

from pyem import star
df = star.parse_star("file.star")
star.write_star("file.star", df.loc[<some expression of cs['components_mode_0/value']>])

@apunjani also following up on your 3DVar preprint, would you mind sharing some more details on your 1D VAE embedding, especially the loss function? I’m trying the same approach and I get pretty different results with RMS error alone vs. RMS error + KL divergence.

1 Like

Hi @DanielAsarnow,

We’ll make better docs for using .cs and .csg files soon :slight_smile:

Regarding the VAE: good questions. I mostly did this as a proof of principle to show/easy visualization that the 3DVA embedding captures many discrete clusters, even more than the number of 3DVA dimensions in this case. I wouldn’t recommend it generally for actually separating clusters - if they really are separable, the GMM operating directly in reaction coordinate space works better (that’s what cluster mode of the 3D Var Display job does). But for more details:

  • I also found that tweaking the relative weighting between the RMS error and KL divergence of the VAE latent space prior made a big difference in the results. In this case it’s a low dim latent space (1 or 2 dims) and we don’t really want the VAE to learn to spread out clusters so that the latent space is “full”/approaching a Gaussian distribution. We actually want the distribution of particles in the latent space to be as multimodal as necessary rather than unimodal. So I set the relative weight of the KL term to 0.01, and it mostly just serves to keep the VAE latent embeddings bounded
  • Given the amount of data and very low dim latent space, omitting the KL term entirely also gave good clustering of the conformations but runaway values of the latent embeddings

Here’s the network architecture I used (super simple):

class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()
        
        dim = n_components
        D = 1024
        
        self.fc1 = nn.Linear(dim, D)
        self.fc2_mu = nn.Linear(D, 1)
        self.fc2_logvar = nn.Linear(D, 1)
        self.fc3 = nn.Linear(1, D)
        self.fc4 = nn.Linear(D, dim)
        
    def encode(self, x):
        h1 = F.relu(self.fc1(x))
        return self.fc2_mu(h1), self.fc2_logvar(h1)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5*logvar)
        eps = torch.randn_like(std)
        return mu + eps*std
    
    def decode(self, z):
        h3 = F.relu(self.fc3(z))
        return self.fc4(h3)
    
    def forward(self, x):
        mu, logvar = self.encode(x.view(-1,n_components))
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar    

Trained with ADAM with learning rate 1e-2, training time batch size of 1024

It’s not too sensitive to parameters, but each run does result in the clusters being arranged in a different order on the 1D embedding.

1 Like

@apunjani Thanks a ton, I’ll let you know how it goes for me with tweaks based on what you posted. The reason I’m still interested in this approach is that my cases still have a lot of outlier particles that interfere with GMM-type clustering. I’ll also try a density-based approach (probably OPTICSXi, not DBSCAN because the cluster densities are not constant).

@apunjani Another couple of 3DVar questions.

  1. Can I make a model for a hypothetical particle particle simply by multiplying the component values times the component volumes, summing them, and adding them to the map from the 3DVar job?
  2. Do the means of the component values give the relative weight of the components? Is there a way to calculate the %variance explained by each component?

Update

Answer to 1) is yes - the formula is inded V0 + Zi * Vi for each component i just as in the imaging model from the 3DVar preprint equation 4 (the part in parens).

I’m not quite sure about 2). The magnitude of the components seems to be the variance of the component values, but I’m not sure how to find the total variance.