CryoSPARC Workflow Discussion: Comparing Different Processing Orders

Hi everyone,

I hope this message finds you well. I have a question regarding workflow optimization in CryoSPARC and would greatly appreciate your insights.

In my previous lab, I typically followed the following workflow for single-particle cryo-EM computations:

  1. Ab-Initio Reconstruction: Run Ab-Initio on the particles to generate 3-5 initial volumes.
  2. Hetero-Refine: Use all the particles and all of the generated volumes from the Ab-Initio step as inputs in a Hetero-Refine task, which helps reassign classes by leveraging both the particles and the multiple initial reconstructions.
  3. Non-Uniform Refinement (nu-Refine): Merge the best-performing class(-es) and input them into a Non-Uniform Refine task for higher-resolution refinement.

If the results are unsatisfactory, I repeat the cycle of Ab-Initio → Hetero-Refine → nu-Refine. Occasionally, I also include the refined volume from the previous round of nu-Refine as an additional input in the next Hetero-Refine task to improve particle selection.

However, at my new school, I’ve noticed that some colleagues use a different workflow: Ab-Initio → nu-Refine → Hetero-Refine, which is quite different from what I’m used to. This has left me wondering about the rationale behind this approach.

Could anyone who has tried or is familiar with this alternative workflow (Ab-Initio → nu-Refine → Hetero-Refine) share their experiences? Specifically:

  1. Does this workflow have specific advantages under certain conditions?
  2. When might it be more beneficial to run nu-Refine immediately after Ab-Initio, rather than Hetero-Refine?

Any advice or insights would be greatly appreciated! Thank you in advance for your help.

Best regards,
Zhe

1 Like

Hi zhe,

I don’t believe you can define a single workflow that will be optimal for everything. First of all, very often you don’t even need ab-initio. You certainly don’t need to do ab-initio with many millions of particles from the same sample, just take a fraction of them, say 10k or 20k per ab-initio class you are asking for; ab-initio is very time-consuming and gives all freedom for the ptcls to move around, things might get messed up there. Very often one can give a starting model (anything with similar structure) to be used as input for a homogeneous refinement since the beginning (your folks use NU- but it will do just about the same, unless your particle set is really clean and there is no conformational heterogeneity). The utility of this is to center all the particles in 3D, pack them all together before trying to separate them. The refined volume can come out pretty ugly if there is heterogeneity, but this doesn’t matter as the next steps will try to classify the different conformations/proteins. Depending on how it looks, you can try 3D classification, 3DVA, 3DFlex, or just go for hetero refinement - ignoring the high resolution no matter what. Without that initial refinement (=alignment), programs will spend more time trying to figure out the centering, and might not work at all depending on what you had in the grid.
I like 2D Classifications as well, but they are mostly used for visualization of the particle set, only the classes appearing as clearly background or ice blobs are eliminated. If the separation strategy works, 2D classes will look nicer and nicer - this is a good control IMO.
In the first rounds of hetero refinements, we often add a few junk volumes as inputs, the aim is to eliminate junk particles. Looking at the final percentages helps deciding if you still need to repeat that or not.
Well… you’ve got pretty much all my secrets, and I’m used to treating heterogeneous, flexible stuff. Other cases might be much easier.
I believe the most difficult decision in SPA is stopping the processing or not. I always think I can improve the map somehow…

Luck!

1 Like

Thank you so much for your detailed response, Carlos! I completely agree with your points, especially regarding the fact that no single workflow can be universally applied to all datasets. It truly depends on the specific characteristics of the data and the experimental goals.

In my work, I have also found that each dataset requires a slightly different approach. For instance, I typically use particles from bin2 or bin3 for 2-3 rounds of 2D classification because I find 2D classification to be more efficient for rapid screening and visualization. If the data contains significant heterogeneity, I tend to be more cautious about whether to proceed with 3D classification to avoid over-screening and losing useful information.

Regarding ab-initio, I fully agree with your point that it is a time-consuming but essential step. In my experience, ab-initio plays a crucial role in generating initial models, especially when dealing with complex or flexible particles. I often adjust certain parameters to make the calculation more “strict.” While this increases the runtime (sometimes exceeding 24 hours), it also helps me better generate high-quality volume maps.

In practice, my workflow is as follows:

  1. After completing 2D classification, I proceed directly to ab-initio to generate several initial model classes.
  2. Then, I perform hetero-refine on each class to gradually remove noisy particles and异构体 while optimizing the map quality.
  3. After a few rounds of “ab-initio → hetero-refine,” I select the best-performing class and move on to nu-refine.

This process has become somewhat of a habit for me. Following these steps, I typically proceed with 3D classification, 3DVA, 3DFlex, etc., to organize different dynamic or conformational states and attempt to further improve resolution. Of course, local refinement is also a critical step in this workflow.

However, recently, I noticed that my new colleague follows a slightly different approach: after completing 2D classification, he directly performs ab-initio and then applies nu-refine to each class before proceeding with hetero-refine. This is quite different from what I have learned so far, which makes me wonder if there are other diverse workflows for data processing.

In my view, the key to successful data analysis lies in understanding the function of each step and applying them flexibly based on the specific characteristics of the dataset and the experimental goals. No single method can guarantee perfect results for all datasets, so trial and error are inevitable. These are just some of my current thoughts and experiences—I’d love to hear more insights and suggestions from everyone!

Cheers!
Zhe

1 Like

Hi @zhe

If you have the time either of these https://guide.cryosparc.com/processing-data/tutorials-and-case-studies/case-study-exploratory-data-processing-by-oliver-clarke

were pretty helpful and allow testing different perspectives. sometimes decoy analysis can help, various filtering and O-EM learning rate in 3D classification.

It is hard to get the details of SPA processing, even from published papers, but this community tries. I also found https://guide.cryosparc.com/processing-data/tutorials-and-case-studies/case-study-end-to-end-processing-of-an-inactive-gpcr-empiar-10668 to be extensively detailed.

Remember you can always export out of CryoSparc to try other programs as well.

Best luck to you.

1 Like

Thank you Mark, those are very useful links!

1 Like

Hi Mark-As-Nakasone,
Thank you so much for your detailed reply and for sharing these fantastic CryoSPARC tutorials. I’ve come across them before, but it’s wonderful to see how comprehensively they cover topics like exploratory data analysis and end-to-end processing. Especially Oliver Clarke’s example stands out—his insights on the discussion forums have been incredibly helpful to me in the past.

As Carlos mentioned, “the most difficult decision in SPA is stopping the processing or not. I always think I can improve the map somehow…” This sentiment really resonates with me when working with single-particle data. It’s also interesting how many recent papers simplify their data-processing workflows, likely because of the complexity involved when handling datasets that might require hundreds or even thousands of tasks.

From my own experience, alternating between Relion and CryoSPARC has become a common practice for optimizing results. Now that I’m working on some CryoET projects, multi-software collaboration is more essential than ever—though I wish CryoSPARC were available for this as well. Still, having such an open and active community like CryoSPARC’s is truly rare and invaluable.

I will continue to learn from the resources you’ve shared and apply these insights to my work. Thank you again for your guidance and for fostering a collaborative environment where we can all grow.

Best regards,

Zhe

1 Like

Great to hear @zhe

The only other “controversial” advice I have picked up her…

If you have collected in .eer format, yes you can experiment with different number of fractions - but the interesting one was upsampling the .eer x2 (4k=>8k). Even if the FSCs are clearly not hitting Nyquist this can improve the map and FSC in many cases people on here have tried - if you have the disk space and VRAM.

Hi Carlos
You mention in your reply to Zhe
that one can input a “Very often one can give a starting model (anything with a similar structure) to be used as input for a homogeneous refinement since the beginning.” I’m simply curious, what is this process like? input a volume map of a similar structure?

Indeed, we currently have a cryo-EM equipped with a Falcon 4i camera. In data processing, it typically involves upsampling followed by binning to bin2 during MotionCorr. I’m not entirely clear on the reasoning behind this approach. I’ve also tried skipping the upsampling step and using bin1 during MotionCorr instead, but I’m not sure what difference this makes.

1 Like

In my opinion, there are numerous ways to generate this model. You can create it from scratch using Ab-Initio methods, download a similar homologous protein from online resources, use the same protein but with or without ligands added, convert a PDB model file, or utilize many other sources available. I believe there’s no need to be overly concerned about minor structural differences in these models since low-pass filtering is applied during computations. Therefore, any suitable initial model can be used based on its availability and relevance.

the question was how one input the PDB to cryosparc, I found the answer myself PDB to volume using chimera and then input your volume map to the refinement
Your answer was more like how to do a homology modeling.
but thanks
P

Dear P

I don’t think cryosparc recognizes pdb files, as you said, the conversion of pdb files to mrc files needs to be done in chimera software.

Best
Zhe

1 Like

By chance I’m coming back here… sorry it took a month!

Yes, by “model” I meant a map. Nomenclature is overlapping sometimes.

The map doesn’t need to be very, very similar either, it can even be a different conformation, or a homolog structure, just something with the expected shape. People use alphafold for this, and even when it’s full of spaghetti, it works - provided that it is not completely off.