Remove duplicates in Class 2D (query)


There is an option in Class2D to remove duplicate particles. Does this happen on the fly, in each iteration of 2D, and take into account offsets from alignments.2d? Or only at the start, using the raw x-y coordinates?


I think I’ve answered my own question - running remove duplicates with the same minimal distance after a 2D job in which remove duplicates was enabled still removes a lot of duplicates, suggesting that on the fly duplicate removal is not taking offsets into account.

I think it probably should though? To be consistent with the standalone remove duplicates tool, if nothing else?

Relatedly, in Remove Duplicates, the fields one can use to reject individual duplicates are as follows:

What is the error as opposed to error_min for alignments2D and alignments3D?

It must do something different from the raw coordinates though, the default distance is much smaller than normal minimum distances during picking?

Not necessarily - often particles come from re-extraction with recentering, or combinations of multiple particle sets, both of which can introduce duplicates

It would also be useful to have an option for duplicate removal in extract from micrographs, particularly when extracting with recentering, although maybe that adds too much computational overhead?

Duplicate removal is very cheap if implemented right, I don’t think that would be an issue.

1 Like

I think maybe removing dups in 2D is just broken - I tried an extreme example - set the extraction radius to 1 in Topaz, extracted 8.5M particles from 400 micrographs.

After 2D, zero particles rejected with remove duplicates activated and minimum distance set to 20 Å.

Running Remove Duplicates on all the particles after that round of 2D, with Shift Key set to alignments2D and minimum separation distance set to 20 Å, 8M particles are rejected, 0.5M kept. Even with Shift Key set to None, 8.2M duplicates are rejected, with 0.3M kept.

In other runs the on-the-fly duplicate removal rejected a few hundred particles during 2D, so it is clearly doing something, but it is broken somehow I think.


Hey @olibclarke, remove duplicates in 2D class only runs at the end of the last iteration, and as far as I can tell it does take shifts into account. It also uses pick_stats/ncc_score as the error field. However, there is a bug that if particles also have alignments3d connected, it’ll use the 3D shift field instead of the 2D one - we’ll fix that in a future update.

Though for particles without alignments3d, I’ve been unable to replicate this bug. 2D class with particle sets from different blob/template pickers rejected the same particles as the standalone job (same 20 A separation distance, pick_stats/ncc_score error field, using particles + particles_rejected outputs from the class 2D job which have the same shift as used in 2D class), and remove duplicates on only the particles 2D class output removed nothing.

Could you check that the 2D class particles have pick_stats/ncc_score and don’t have alignments3D, and rerun the standalone remove duplicates with pick_stats/ncc_score? If this persists, I’d love to get the particle .cs file / job log / stream log from your 2D class job to investigate.

Hi @kwang,

The particles going into 2D definitely don’t have alignments3D, as they came directly from Topaz Extract → Extract from Micrographs → 2D classification.

But now that you say it, they probably don’t have NCC score, as they came from Topaz! But in that case, shouldn’t the on-the-fly duplicate removal default to using the alignments2d/error or error_min for rejection, rather than failing silently? Wouldn’t that be the more universal parameter to choose in any case? As there is (1) no guarantee that input particles will have an NCC score and (2) using NCC score may not be sensible if the 2D job has multiple particle inputs generated using different picking approaches?

Happy to provide any files needed to reproduce the bug.



The NCC score is likely not the issue, as running Remove Duplicates standalone with NCC score as the error field works fine… And it is not specific to Topaz, as I see the same thing with the output of Blob Picker (very few duplicates rejected during 2D, but many rejected with standalone remove duplicates using same settings).

Here is an example:

From 2D log:

[CPU:  19.97 GB  Avail: 221.43 GB]
Done Full Iteration 49 took 3479.426s for 482032 images
[CPU:  19.97 GB  Avail: 221.43 GB]
Outputting results...
[CPU:  20.29 GB  Avail: 221.11 GB]
Removed 678 duplicate particles using 20 A minimum separation distance
[CPU:  20.29 GB  Avail: 221.11 GB]
Output particles to J204/J204_049_particles.cs
[CPU:  20.29 GB  Avail: 221.11 GB]
Output class averages to J204/J204_049_class_averages.cs, J204/J204_049_class_averages.mrc
[CPU:  20.29 GB  Avail: 221.11 GB]
Full class2D run took 48063.101s

From Remove Duplicates job using the output from 2D:

[CPU:  313.9 MB  Avail: 240.80 GB]
Loaded particle stack with 481354 items
[CPU:  314.1 MB  Avail: 240.81 GB]
Using alignment shifts from alignments2D.
[CPU:  314.1 MB  Avail: 240.81 GB]
Duplicates with worse alignments2D/error scores will be dropped.
[CPU:  314.1 MB  Avail: 240.81 GB]
Looking for duplicate particle locations...
[CPU:  701.2 MB  Avail: 240.42 GB]
275711 duplicates were found and rejected.

So 2D has rejected 678 particles, while Remove Duplicates using the same settings rejects an additional 275000.

Job report for Class 2D:

Job report for Remove Duplicates:

Hi @olibclarke,

I think we may have found out the cause of this discrepancy. Based on the pixel sizes in the 2D class event log you attached, it looks like downsampling was likely used. Unfortunately the implementation of remove duplicates in 2D Classification (in the case of downsampled data) uses an incorrect pixel size when converting particle coordinates into physical angstrom distances, and this would cause it to reject way fewer particles than expected. We’ll aim to fix this in an upcoming release, and thank you for your detailed reports as always!


1 Like