How to improve the Ab-initio map quality

JJC · September 1, 2023, 8:59am

Hi everyone!

Here shows my 2D classification results. The quality of 2D seems good, but I can’t get the high quality map from Ab-initio job, the map has may wrong density., like some spike on the protein surface.

So, Is there any better Ab-initio parameters I can use?

Thaks a lot!

Das · September 2, 2023, 10:20am

For ab initio you do not need a high resolution map. Your particles seems to be too small and needs better alignment. You need to be cautious while doing select 2D jobs. If you select bad classes, you tend to get such maps. Better use default parameters and just tune initial resolution to 12-8A. Should work fine.

JJC · September 2, 2023, 11:50am

Thank you for replying this,

The structure has an Alphafold predicted model. I use the predicted model as the good reference and do the hetero-refinement, the particle after hetero then run the 2D classification. I nearly kept all the particles and do the Ab-initio. So I think the alignment might be good because I use the Alphafold predicted model?

And the other question is how to define the “bad classes” in 2D clasification?

Das · September 2, 2023, 2:03pm

Ideally you should not use model as it creates model bias. I would suggest that you use topaz train with only good particles that you have from blob picker. Following you can use topaz extract. After that extract particles in the right box size with fourier cropping and do iterative 2D classification untill you get clean particles classes. After that go for Ab initio with 3 or more classes. Do put initial starting resolution as i advised. At this resolution you’ll be able to get some good initial model and remove junk. Then do a homogeneous refinement using default settings and use the high resolution model for heterogeneous sorting untill there are no bad classes coming up.
Best.

JJC · September 7, 2023, 8:49am

Thanks a lot!

I will try the method you said.

rwaldo · September 27, 2023, 8:56pm

Hi @JJC! Welcome to the discussion forum and thanks for your post! I know this was from a while ago and you may have solved your problem already, but I thought I’d contribute a bit of background information to help you or anyone else who stumbled upon this question.

Those 2D classes do look promising! I wonder if you noticed though that along the edges of the class averages, there are thin spikes coming off your protein — kind of like a 2D version of the spikes you see in your ab initio model. The background region far away from the particles also has a texture to it, kind of like a waviness. Both of these are signs of overfitting.

Overfitting

Overfitting occurs when we don’t have enough true signal in our images, and so the algorithm starts to align noise instead of the particles. This is basically unavoidable — there will always be images or parts of images without signal in them, but the noise is everywhere. For a classic example of what can happen as a result of fitting to noise, check out the seminal paper from Richard Henderson. This is also why @Das recommended you don’t use your alphafold model to align your particles.

So now the question is, how can we be sure that what we see in our cryoEM maps is real, and not Einstein-from-noise? The field has settled on using something called Gold Standard Fourier Shell Correlation. This is quite a mouthful, and so you will often see it abbreviated as GSFSC or simply FSC.

For GSFSC we split the particles into two half sets. These half sets are used to create two copies of the map, but importantly the particles are never mixed up. This means we have two completely independent estimates of the 3D shape of your target.

The clever step is that we can compare these two structures at every resolution (see note at bottom). Provided that our particle stack is clean, we know that each half set should look exactly identical — there’s no reason that half-set A will have a different particle than half-set B. However, since noise is random, we expect the noise to be completely different in the two half-sets. Put another way: true signal should be highly correlated, while noise should be completely uncorrelated.

What we do is we look at the correlation for all of the different resolutions in our map. We then filter out (i.e., remove) the resolutions where correlation dips below some threshold. The choice of this threshold is somewhat arbitrary — the field has settled largely on looking at where the correlation drops below 0.143 and calling that the resolution. If you are interested in why 0.143 was selected, the Appendix of this paper lays out the (somewhat involved) argument.

How is this useful?

I’m telling you all of this because it’s important to know which jobs obey these rules and which jobs don’t. For example, 2D classification and ab initio do not use half-sets! This means they are susceptible to over-fitting and model bias. For this reason, they are generally treated as tools to quickly assess data quality or clean up junk particles, but once you have something that looks okay we recommend that users move on to jobs that do obey the half-set rule.

For instance, in your case, I would recommend you try a Homogeneous Refinement. This is a job that creates a single map of your particles, like ab initio, but it obeys the half-set rules. You might see those spikes disappear, or at least get a lot smaller and less intrusive. Homogeneous Refinement has some other tricks to reduce overfitting and can give quite nice results.

One more recommendation

Overfitting can also often be reduced by cleaning up your particle stack. The Heterogeneous Refinement job is really useful for this task. I bet you’d be pleased with the results of plugging in your particles from the ab initio job and two or three copies of your ab initio map. Hopefully you’ll end up with one cleaner stack of particles and two “junk” stacks which you can discard.

I put this recommendation in a separate section because Heterogeneous Refinement is another job that does not obey the gold standard half-set rules

I hope this all helps! I know that’s a ton of information, but I hope you will find it useful as you keep processing data, and we’re always here and happy to help with questions you have further down the line!

Note: More precisely, we’re comparing spatial frequencies. In 3D Fourier space, a given spatial frequency is represented by a shell. Since we’re calculating the correlation of all of these spatial frequencies, we are calculating the Fourier Shell Correlation — the FSC!

JJC · December 17, 2024, 2:56am

Thanks a lot ! The question has been solved by collecting some tilt 45° movies.