Problem with 150 kDa protein complex cryo-EM data processing

yuzgad · October 23, 2025, 10:21pm

Dear all,

I am writing about a problem we encountered during the reconstruction of a 150 kDa protein complex using CryoSPARC.

In total, around 19,000 movies were collected, and approximately 1.5 billion particles were picked using Topaz Train, resulting in the 2D classes shown below. The 2D classes appear to show secondary structure details, but also indicate a preferential orientation of the particles.

The core of our complex is about 120 Å in diameter and consists of a tetramer of one protein, along with up to four molecules of another protein. After 2D classification and ab-initio reconstruction, we performed particle cleanup, resulting in ~500,000 particles after the first particle sorting.

Using these particles, we performed an initial non-uniform (NU) refinement, which produced an FSC curve showing a bump around 4–5 Å.

Upon closer inspection, only the central part of the map shows reasonable density, while the left and right parts appear somewhat random, featureless, and fragmented — preventing reliable model building.

We then attempted further particle sorting using heterogeneous refinement, orientation diagnostics, and orientation rebalancing, but obtained essentially the same result. The density did not improve even when changing the box size (from 300 px to 440 px, pixel size 0.73 Å).

We also tried various NU refinement strategies, including reducing the AWF factor to 1.5, enabling and disabling the dynamic mask (with dynamic start resolution of 1 Å), but none of these approaches significantly improved the map quality. Although we know the overall architecture of the complex, the fit to the density is poor, and even recent map–model building tools such as CryoAtom were not helpful.

We suspect that the issue might be related to flexibility between parts of the complex, but we are unsure how to properly validate this hypothesis. We also tried performing a 3DFlex refinement, but it seems that either the motion was not recognized or we may not have set up the job correctly.

We would be very grateful for any suggestions or ideas on how to improve the quality of our map or further diagnose the problem.

Thank you very much in advance for your help!

Best regards,
Yury Zgadzay

CryoEM2 · October 23, 2025, 11:00pm

those 2D classes do not look too bad - plenty of distinct views. It seems you tried mostly very fancy, extreme, expert tools and parameter changes, but your dataset should be able to give quality reconstructions using a much more standard workflow, of which many are well documented in guides, recent biorxiv, and on this discussion board.

train different topaz models for each of the rare views 2) use select 2D to remove some of the single macaroni elbow class that is most dominant 3) if you have 1.5 million particles that reach 3.26Å resolution, then use only those particles in a 3D classification job with 20 classes, resolution 12, auto solvent mask near 30, auto solvent mask far 50, class similarity 0, save results each f-EM, turn off convergence by density. 4) you could also run multi ab initio with class similarity 1 try 5 or 6 classes to see if there are sub assemblies? 5) extremely high chance that you can rigid-body fit known structures into the density so no need to try automated model builders. and use unsharpened map to start this process. 6) favor 3DVA over 3Dflex.

jcoleman · October 24, 2025, 4:28pm

These classes and maps look pretty good but I think you can probably still improve them further, especially to resolve the areas which are more flexible.

My suggestion would be to run the cryosparc micrograph denoiser and then do template picking using templates generated from your map. When you run inspect picks, you can be pretty aggressive with the NCC score when you have picked on denoised micrographs.

I would use the discarded particles from 2D to generate a couple of ab initio maps for decoys. Then feed these decoy maps and the map of your complex into several rounds of hetero refinement. This will allow you to get the maximum amount of particles out of your dataset and we find this is useful for flexible proteins.

Next step would be to then run more refinements and ab initios to characterize the heterogeneity and try to zero in on a subset of particles that have these regions better resolved. Hope this helps.