Why is the GFSFC curve with repeated particles analysis strange?

Hi everyone!

T20S protein movies data was used to practice data analysis.

I found that I installed different parameters (such as maximum and minimum particle diameter) to pick the same batch of movies twice and then did 2D classification and select 2D several times respectively.

Then I combined these two select 2D particles datasets for Homo refinement. That means there are duplicate particles remaining. I found the GSFSC curve is quite strange. As the resolution increases, the curve slowly decreases, resulting in higher resolution.

I wonder ①if this is allowed? ②Why would the presence of repeating particles change the curve like this? ③Is the generated map trustworthy(map shows slightly difference between these two results, 2.37Å shows more information of side chain)?

Thanks a lot~~

Use remove duplicates first.

The presence of duplicate particles will cause the artifacts you are seeing - if you have identical particles in both half sets, they will produce spurious correlations (which are reflected in the FSC).

3 Likes

To expand on this - if the same particle images are present in the dataset, both the signal AND noise will correlate, which is why the FSC can’t go to zero. Even at very high frequencies, that noise will correlate.

3 Likes

Thanks a lot for your kind help~ Actually I tried remove duplicates with setting different particle distance several times respectively. But all the results also showed the strange FSC curves. I guess remove duplicate can not remove duplicates completely. In this case, I wanna know the extra density on map whether is reliable, which is because there are more total particles in two sets that are different than in one set alone, although duplicates are hard to remove.

Thanks for your kind answer~ That makes sense. So does this mean that the resolution calculated using duplicates cannot reflect the reality? I mean maybe we cannot say we get higher resolution.

Yes. When you refine a particle, you are counting on the fact that the particles are roughly identical, but the noise is random. Every time you take two particles and add them together, the signal from the particle gets stronger, but the noise cancels itself (being random) and gets weaker. Imagine in some particles for pixel x100, y100, that’s just noise off toward the edge of the image, the intensity value might be 1. For other particles, it might be -1. Therefore, it tends toward 0 in the averaged structure. Whereas the parts of the image that are protein will always have the same (or near the same) value (because they’re real, not noise, and therefore their intensities are not random).

What a GSFSC is telling you is the resolution at which you can no longer distinguish signal from noise. This relies on the assumption that noise cancels as you average particles. However, this only works if the particles are independent observations. If they aren’t independent (even just a fraction of them), the noise will no longer cancel out when you add it together. Therefore, you aren’t learning what the resolution is anymore, you’re just observing a correlation between noise.

Look at where the curve drops off - I’d estimate its true resolution is somewhere in the ~2.8 Å range.

I’m not totally sure how CryoSPARC handles particles that were picked separately but correspond to the same XY location. You could re-extract the particles using both particle location lists as input to the same job, and have it only extract them in they’re a certain distance from each other.

2 Likes

What range of values did you try for the minimum separation distance in remove duplicates, compared to the size of your particle? I suspect something must have gone wrong somewhere

1 Like

What’s interesting is that the FSC does fall to zero - duplicate particles will often mean it fails to reach zero at all, but instead the slow tailoff does indicate a problem.

Duplicates are the first check, second check if beam tilt estimates are extreme.

1 Like

Could this be from only a fraction of the particles being duplicate?

I’ve never seen 100% of a particle set duplicated, although the strategy the OP says they used I’ll admit I’ve never tried.

It’s tickled my curiosity, because the T20S tutorial dataset is pretty forgiving. Gimme a bit, I’m going to see if I can replicate this myself.

I have seen similar FSCs due to a small percentage of duplicates, that would be my first guess. Nothing wrong with the strategy of trying multiple picking strategies and then combining and removing duplicates, but if they are not completely removed this is what it can look like

1 Like

Yes, I’m rather OCD about making sure there aren’t any duplicates… :sweat_smile:

It’s one reason I prefer slightly lower concentrations with some dispersal, over highly concentrated, tightly packed samples (but they can give great results).

I saw something similar once with a dataset I know didn’t have duplicates (because it was fully manually picked) and the beam tilt estimate was orders of magnitude off, which is why I mentioned that.

1 Like

Thanks~ Now I understand. “You could re-extract the particles using both particle location lists as input to the same job, and have it only extract them in they’re a certain distance from each other.” It seems like the function of “template picker” which has a paremeter “Min. separation dist (diameters)”. And “remove duplicate particles” job also can remove duplicates based on the distance between two particles. But it’s hard to remove no more or less particles, especially the particles are crowded. Of course, maybe if we can know how CryoSPARC handles the particles location information, it could help a lot.

I tried 20 and 0.5. With 20, the FSC curve is normal, but the resolution is low(2.91Å). With 0.5, the FSC curve is strange, the resolution is 2.57Å. That makes sense. Because 20 reoves a lot of particles even different particles and 0.5 can not remove all duplicates. But I don’t know how to confirm the boundary value. I found more hight-quality particles can help us get more side chain information on map(I’m not sure now, T20S is D7 symmetry may also effect the result I guss), so I want to pick single particles as many as possible.

20 Ă… will not remove different particles given the size of your particle (the proteasome) - it should exclusively remove duplicates. This is consistent with what you observe - when you use 20Ă…, the FSC is normal, when you use 0.5, the FSC is pathological, due to retained duplicates.

1 Like

This (full) dataset (EMPIAR-10025) was originally published at 2.8 Ă…, the 20 micrographs from the tutorial can match that with modern improvements to image processing and electro-optical parameter refinement. Which shows how good the software developers have got at estimating and correcting for electro-optical aberrations!

2.3 - 2.2 Ă… should be possible with the full dataset if using a hybrid processing scheme (Bayesian polishing) without too much trouble.

Perhaps where some confusion occurred is that CryoSPARC measures distance between picks as multiples of the particle diameter, but when filtering duplicates measures in Angstrom.

Yes. I’m sure there are dupicates. Why I said that, because my operation simply explain is like the workflow. The aim is to miss as few high-quality particles as possible. Based on the particle numbers, I found that the if the identical particles from the same job picker, cryoSPARC will remove the duplicates automatically. If them from different job picker, cryoSPARC cannot recognize them automatically. Maybe need “remove duplicate particles” job to remove.

Yes - identical particles will the same unique identifier (UID) will be de-duplicated automatically. Otherwise, remove duplicates is necessary.

Sorry, maybe I misunderstand. My understanding is that all particles have location information. 20 Å means if the distance between the two particles is less than 20 Å, they will be marked identical and one of them will remove. So in this case, 20 Å is too large for “remove duplicate particles” job, because you can see the image, most particles are very close to each other. So not only indentical particles, the different particles also removed. But I want to remain them for high-resoluton calculated.

T20S is approximately 110 x 160 Ă… (diameter, length) so 20 Ă… will only remove duplicate picks, not independent particles.

Unless the particles are overlapping, which you don’t want anyway.