Difficulty with automated picking for small particles? (Topaz?)

I have found previous posts to be decently helpful in analyzing my SPA dataset (a 150 kDa protein complex with expected C1 “symmetry”). The protein has 3 domains which I’ve confirmed with SEC-MALS, SDS-PAGE, and other methods. At first, I found my results promising after I did a manual pick and 2D classification. I get 3 domains as expected. (2 of the domains are obligately bound, while the 3rd might come off.) However, when I use my manual picks and train a Topaz model for automated particle picking, I start to lose 1 of the domains. This could be possible, since 1 of the domains is reversibly bound. However, I’m wondering if this might be caused by some artifact of overfitted particle picking and/or alignment issues. I’m attaching some images of the workflow below.

First round of manual picking:

First round of 2D classification with purely manually picked particles which showed promising 3-domain classes (2,344):

First topaz model’s predictive performance:

2 representative micrographs that show this Topaz picking model’s strategy:

First round of 2D classification with automated particle picks (via Topaz):

At this point, it looked somewhat promising, though not as good as manual picking. However, I start encountering issues when I select some 2D classes (from the above) to then re-train a new Topaz model. I can’t seem to get a better Topaz model.

Additionally, according to my 3D volumes in ab initio reconstruction, I might not have too much of the 3rd domain. Out of 3 initial volumes, roughly 1/3 of the particles have this 3rd domain. (This is also concerning in its own right, since I would have expected there to be some skewing of the distribution.)

Are there any strategies to increase the Topaz model’s particle-picking accuracy and workflows to increase the detection of that 3rd domain? My models’ max at ~0.4 AUC-PR seems to be on the lower end. I am also wondering if there’s some weird alignment issue because one of my protein’s domains. My hypothesis is that the 2 obligately bound domains have a very distinct “backwards S” shape that allow it to be easily classified in 2D classification. However, the full trimeric complex might look more globular and lose this “backwards S” shape, meaning that noise can easily be classified into the same 2D classes that correspond to the full trimeric complex. I’m trying to be unbiased when I look at my micrographs. I just see so many 3-domain particles to think that the full complex is only a minor species… (As you can see in the manually-picked particles that were 2D classed, the densities are weak with only ~2k particles, but you can clearly see there’s a bias toward seeing all 3 domains…)

So far, the thing that has helped the most is to set “force max poses” off. I appreciate those who have suggested that in previous posts — thank you!

An update and good-ish news! By setting 2D classification’s maximum alignment resolution to 3 Å, it seems like these extra features are starting to appear. Interestingly, my particles’ diameters might be more around ~80 Å, compared to what I expected at around ~110 Å. I wonder if this biases the particle picking in any way? This discrepancy is also interesting because when I perform template picking and inspect those picks, the inspector draws circles that are 136 Å in diameter, and it seems to fit the particles well. Maybe I’m losing information somehow when I go to 2D classification? E.g. one side of my protein gets aligned to the wrong side and “subtracts” density due to misalignment?

I think I will really need to play around with the “maximum alignment resolution” and “maximum resolution” parameters. Does anyone have good starting points? E.g. start at 3 Å, and then slowly decrease with subsequent 2D classifications? Should I take a similar strategy when performing Ab Initio reconstructions and refinements?

Wanted to clarify what I meant by losing information due to misalignment. I read the homogenous refinement documentation. When looking at the results of EMPIAR-10256 (Dang et al. 2019), when symmetry relaxation occurs, you recover the asymmetry of the whole protein complex. This is very similar to my dataset in which the obligate dimer is symmetric, but there is an asymmetric unit that may bind to either side. I wonder if these asymmetries can affect processes even upstream of homogenous refinement. Or is this something only to really concern ourselves later in processing? Thanks!

What size box are you using for extraction? At first glance, it looks too small, and could be affecting class quality

Hi @olibclarke thanks for commenting. I’m currently extracting with a box size of ~167 Å, and downscaling currently by a factor of ~0.53. Am I correct in assuming that, when I import with an “upscale” of 2 for EER files, I have to double the pixel box size? If yes, then it should be ~167 Å. I went off this estimate based on your previous posts about starting at about ~1.5x expected particle size and then going up to ~2x-3x later in the processing pipeline due to high frequency information being delocalized. Maybe my particle size estimate is too large? My particle diameters do look relatively okay when I inspect for picks in micrographs, but maybe that’s not a good way of assessing particle diameter?

You are correct, yes.

I ran into something similar while tuning Topaz on a few datasets. I could be wrong, but if your average precision is getting worse over epochs, I wouldn’t rely on the final model. I’m not sure which version you’re using, but if this is with newer cryoSPARC/Topaz setups, one thing that helped in my case was tuning “num_particles” and radius together rather than using defaults. In several datasets, lower num_particles (e.g. <200) and smaller radius values (1-2 instead of 3) gave more stable training behavior. Also, I found it useful to compare a few short runs rather than pushing a single training too far, and to select the best epoch instead of the last one if the curve starts to decline.

Thanks for the suggestion! I’ll give this a shot and will let update the forum if this ends up working. Interestingly, I did have some success with boosting the AUC-PR with playing around with the loss function too. For this dataset, GE-KL seems to outperform GE-Binomial.