Masking out glycans

@lizellelubbe,
Approximately how many particles did you have when you did the ab initio with 4 classes? Also, how many for the final NU refinement?

I had roughly 400k particles before splitting into 4 classes with ab initio and hetero refine. The best class had around 130k and these were used for nu refine. But choosing 4 classes was a guess and I may still need further subclassification. I am hoping to get insight into the degree of heterogeneity remaining within the current classes by doing 3dva

Ok!

I tried running NU refine with expanded mask, but unfortunately it did not seem to help.

This is what the ab initio model looks like. To me, it definitely is similar to the crystal structure, which comprises 80% of the structure used for cryo EM. I think especially the beta-supersandwich domain is recognizable. Please let me know what you think.


The unsharpened map after RU refine
refined unsharp

The sharpened map after RU refine

I tried playing around with different sharpening values, but when the “artefact shell” starts to disappear, so does the protein core.

I am busy with my first ever Cryo-EM structure and unfortunately not an expert. Have you tried homogenous refinement instead of NU refine using your current particle stack and initial volume? I am just wondering if the shell you see is specific to the NU job type. Do you know where the glycans are expected?

Are you using tilt, defocus refinement, ctf refinement etc during NU refine?

Well, either way your input is helping out!

I did try homogeneous refinement and the shell is there, but much less clear. But the homo ref also gives substantially worse results and terrible density maps, so it is a bit hard to determine.
P19_J415_fsc_iteration_005_after_fsc_mask_auto_tightening

Yes, we have made MS glycopeptide analysis and the position of the “density cones” makes sense.

No, I have not tried any of those settings. I will see if it makes a difference.

Thanks!

I think that enabling those settings may give worse results for smaller proteins so I have disabled them and only refined. But I don’t know where your shell artifact comes from. Your initial model looks good to me

Hi @emil,

Did you ever manage to refine your glycoprotein? You probably finished a long time a go but I just wanted to mention what worked for me in the end to help anyone facing a similar problem with glycans.

I first tried dilating the mask by 6 or 8 with padding of 12 during NU Refine. This seemed to work at first but then I noticed strong streaks of density near N-glycan sites. It seemed like the mask near the glycans caused overfitting there and a decrease in protein resolution. It was even worse after local refinement even when using a static mask. Dilating more (up to 20+) and way past the point of any observable glycan density, different lowpass filters, etc didn’t help. I then tried to vary all the settings in both NUR and local refine again and the *only thing that worked was to use a mask that extended just beyond the protein density (using dilation of 6 thus cutting through the N-glycans) but then padding by 20 or 30 to have a very, very soft edge. This gave me really nice refinement of the protein and good enough glycan density to allow building of the core fucosylated pentasaccharide. I’m not sure why dilating to cover the glycans caused overfitting streaks while dilating for the protein only and wide padding for the glycans worked. Maybe someone else can offer an explanation?

Hi @lizellelubbe,

Thanks a lot for getting back to me! I have actually not come much further and I will definitely try your method. Just be sure, are you then just changing the “Dynamic mask near” and “Dynamic mask far” to e.g. 6 and 26, respectively? No other non-default settings?

Once I had an ab initio model that looked reasonable, I set up non-uniform refinement as below. It is dependent on the dataset though and I cannot guarantee that it’ll work for you. The padding and threshold had to be altered slightly for some of my other particle stacks (the dataset was heterogeneous) but in general low dilation and high padding worked. I also made sure that the particle stack didn’t have any duplicated particles before doing NU refine, otherwise the FSC curves didn’t drop down to zero. With local refine after this (in case you need it) I used the static mask option and created my own mask as input with similar padding as for NUR. If I used the dynamic mask option in local refine, glycan overfitting was introduced again. My alignment parameters were also set to search locally around the NUR values.

My settings:


defocus refine and global ctf refine were switched off

Hope this helps somewhat!

I definitely helps. Thanks!

Can you please also just briefly comment on if/why the following settings helped out: “ignore tilt”, “ignore trefoil”, “ignore tetra” and “minimize over per-particle scale”

I didn’t choose to refine the higher-order aberrations as my particle is small and flexible (tried local CTF refinement before and it didn’t give good results). Tutorial: CTF Refinement - CryoSPARC Guide

Ok, I see. Thanks again!

Hi @lizellelubbe,

In my experience, soft padding is the most important property/parameter of masks used for refinement. It is most important for local refinements, when the mask typically excludes portions of the structure and not just the solvent. When working with small masks, I’ve observed similar phenomenons as pointed out here. We’ve updated our local refinement guide page with some specific notes/suggestions on mask padding for datasets.

I think that in part, the underlying explanation of your observations is due to signal processing issues. If the volume is thought of as a discrete 3D signal, then the application of a mask to the volume can be thought of as windowing the signal in order to exclude regions that we are not interested (windows are applied to a signal via multiplication, just like masks). In all refinements that follow the gold-standard FSC method of regularization and resolution assessment, we must assume that the Fourier coefficients with frequency larger than the initial lowpass resolution have shared signal corrupted by independent noise. The problem with masks is that they break that last assumption – using a common mask means that the noise in both half maps (after masking) is not independent. This compromises our ability to separate signal from noise, and hence, to reduce overfitting.

Based on the convolution theorem, the severity of this violation is directly related to the Fourier-space properties of the mask. In short, the more slowly the DFT of the mask falls off over frequency, the worse the violation will be. For example, a rectangular mask (i.e. one with no soft padding, regardless of dilation) has very slow falloff in Fourier space:

(from wikipedia). On the other extreme, the hann window (i.e. a “cosine” window) has much faster falloff:

The closer the mask is to a hann window (i.e. the softer the falloff in real space), the more the noise in each half-map remains independent after masking, and thus we are more able to reliably detect resolution and limit overfitting. In practice, this means that any GSFSC-based method will require trading off precision in real space (how well the masked is focused on the particular domain of interest) and precision in Fourier space (required to prevent overfitting). Heavily prioritizing real-space precision leads to overfitting and artefacts – but heavily prioritizing precision in Fourier-space means the refinement is no longer focused on a specific domain of the structure. Right now, this trade off must be considered for each refinement, but we do have a helpful rule of thumb on the local refinement job page linked above that can be used as a starting point for a good softness level.

Best,
Michael

6 Likes

Thanks for the very clear and detailed explanation @mmclean and for updating the tutorial page, I really appreciate it!

Hi!

There has been no answer on this topic for a long time, so I will try to bring up the problem of highly glycosylated proteins.

I work with rather a monomeric glycoprotein ~ 110-120 KDa, with a completely unknown PDB structure. Apart from the amino acid sequence, only the general domain structure at the sequence level is known, as to how many domains there should be. It is known that glycans constitute up to 42% of protein mass, including sialic acid, the exact locations of all glycans and their lengths are unknown.

Evidently, glycans strongly mask the protein core and so far I have not been able to visualize the secondary structures of the protein core. Depending on the 2D Classes, the Ab initio, and Refinement settings, slightly different maps are generated, including some of the settings described here.

maybe I can take a similar approach here as in this paper of Lubbe et al. (congrats - great work!) to deal with glucans.

EMBO J 2022 41(16):e110550.

doi: 10.15252/embj.2021110550

I’ve been working on it since the beginning of last summer when I got my Krios results.


Hi michpon,

It can definitely be a struggle. What non-default parameters did you use for 2D classification, ab initio and refinement? Can you share examples of 2D classes? Have you tried the new 3D classification job?

Also, can you get any leads from the Alphafold model? Does it have good confidence metrics?

Hi Emil,

thank you for your quick response. The map images presented in the previous post (J151 & J571 were from the same particle extraction (box 256). The classification leading to the J151 (after sharpening from J149 ) was with default settings from homogenous refinement Job 149
J151 sharpen from J149: This is J149:

Several classes were chosen for it.

The J571 (second picture from my previous post) had force non-negative during 2D classification and only one class was chosen. Files are attached for it:






I was not doing a new 3D classification job as I remember. .

Alphafold model was able to fold only 3 first N-terminal domains with several beta-sheets and one alpha per domain (attachment, yellow N-terminal helix is redundant in native protein signal peptide), the rest is floating unfolded.

I produced some I-TASSER models of all separate domains (based on sequence fragments) and for N-terminal domains D1-D3 and they look similar to those of AlphaFold. The rest C-terminal domains are more disordered in I-TASSER, however, some have some fragments of secondary structures.
I-TASSER can’t handle all protein sequence at all to get something loong ok… Good threading templates are missing.

I am writing this now based on the results from cryoSPARC 3.2 saved locally. I generated a few more different maps with the setup attempts described on this topic, but as far as I can remember, none of them look any better than what is shown here.

I will try to delve into the procedure used in the publication of lizellelubbe - EMBO J 2022 41(16):e110550.and recreate it for my protein.

I’m switching slowly to the cryoSparc 4 versions, maybe there are some cool new options out there.

Best
Michał.

A few things to try that have helped me:

2D classification:
Keep your settings, but turn off “force max over poses/shifts” and “enforce non-negativity”. “force max over poses/shifts” off will make the job run slower, but usually yields better and more diverse classes for small proteins. Non-negativity does not seem to help you in this case, but rather creates artificial classes.
Also try using a circular mask slightly wider than your max particle diameter.

Ab initio:
Increase the maximum and initial resolution to something like 6/15 Å. This will make the job take significantly longer to finish, but is sometimes necessary to yield reasonable ab initio structures of small proteins without larger distinguishable features.
Also increase the initial and final minibatch size to 300/1000. I’ve sometimes increased to as much as 1000/5000, but then the job takes forever. :slight_smile:

Refinement:
Decrease the “initial lowpass filtering” to something like 12 Å to “keep” more features from the ab initio structure.

Good luck!

Thanks you emil for your advice.

Here is the Chimera-generated map (min max level) of my protein prediction model, partially based on AlphaFold (3 good domains) and iTasser (the remaining domains which are predicted to have no secondary structures and are the most highly glycosylated - the most known sites described in UniProt. The entire structure should form a beaded necklace because there is a disulfide bridge between domains 1 and 6. I produce it by joining models from the domains in Chimera and optimized these interdomain connections in Foldit Standalone. The Ramachandran chart looks good.



I fit some of the new cryoSPARC maps made according to emil settings and it looks together like that:

The maps were fitted by shape because the secondary structures are still not visible even for those domains well folded in beta sheets and 1 alpha helix, so there is not much to match exactly. I think I need to filter the particles better, but before that, I can do motion correction MotionCor2 and all over again picking several times to filter the particles. The current results are from patch motion.

The last image is upside down :blush: