Does cryosparc remove duplicate particles?

hi all

I have an aligned map at pretty good resolution. There is symmetry that is not related by point group. I used signal subtraction to remove each symmetric region, which at times is very close and even overlapping (depending on the view). This means I did 2 signal subtractions, effectively to double the particles. However when I load both of the above stacks into a new refinement I only end up with half the particles (loading both stacks gives the same # of particles as just one stack). Not sure if this is a bug or rather the particles are so close together that cryosparc is removing duplicates? not sure what’s happening. Thanks.

I’m fairly certain that it does remove duplicate particles, but it would be very helpful for an expert to weigh in and describe exactly when/where/how it does so.

I’ve used this to advantage by running multiple classifications on a particle set then combining the good classes from each run into a single refinement to maximize the number of good particles. The # of particles output by the refinement is always less than the input #, so I assume it’s deleting duplicates. On the other hand, I do not know for sure, nor do I know on what basis it determines duplicates. I suspect (but don’t know) that the particle sets must all come from a single extraction job for this to work; otherwise, you may run into problems of having true duplicates (and incorrect FSC estimation) if they come from different extraction jobs. Again, advice from the experts would be appreciated.

Just occurs to me that I might have stumbled on a solution to your problem, which is to make sure your two sets of particles come from two different extraction jobs. Or maybe you need to back up further and make them come from two different micrograph imports. I’d worry about the believability of the FSC if I were trying this. YMMV.

Here’s how particles are handled in cryoSPARC internally:

When you extract or import a particle stack, cryoSPARC assigns a Unique Identifier (UID) number to each particle. As you process particle stacks through various jobs in cryoSPARC, the UID is remains constant for each particle, even as you find alignments for it or perform signal subtraction.

When you combine two “different” particle groups into a job, the job takes the intersection of the particle groups based on their UID: i.e., it only keeps particles that have the same UID across both groups and uses the resulting particles for processing. Therefore the behaviour @orangeboomerang is seeing is intended.

In the future we may provide an option to allow reassignment of particle UIDs. In the meantime, you can try the following workaround to manually reassign all the UIDs:

  1. Export one of the Particle Subtraction jobs from the job sidebar:
  2. Identify the project directory, Project ID and Job ID for the Particle Subtraction Job. For example, if you are on project 3 with subtraction jobs 42 and your project directory is /home/nick/cryosparc2_projects/P3, the project ID is P3 and the job ID is J42
  3. In a command line, navigate to the cryoSPARC installation directory, then into the cryosparc2_master directory.
  4. Run ./bin/cryosparcm icli to enter cryoSPARC’s interactive CLI mode
  5. Enter the following commands, substituting the PROJECT_DIRECTORY, PROJECT_ID and JOB_ID declarations according to your setup
    # ==== MODIFY THESE DECLARATIONS ACCORDINGLY ====
    PROJECT_DIRECTORY = '/home/nick/cryosparc2_projects/P3'
    PROJECT_ID = 'P3'
    JOB_ID = 'J42'
    # ===============================================
    full_job_id = '{}_{}'.format(PROJECT_ID, JOB_ID)
    particles_location = '{}/exports/jobs/{}_particle_subtract/{}_particles/{}_particles_exported.cs'.format(PROJECT_DIRECTORY, full_job_id, full_job_id, full_job_id)
    from cryosparc2_compute import dataset
    particles = dataset.Dataset()
    particles.from_file(particles_location)
    particles.reassign_uids()
    particles.to_file(particles_location)
    
  6. Press control + D to exit and enter y to confirm
  7. Re-import the job. For the above example the job import path is /home/nick/cryosparc2_projects/P3/exports/jobs/P3_J42_particle_subtract
  8. Reconnect the refinement inputs for the newly imported job instead of the previous exported job (J42) and retry the refinement

Let me know if you have any trouble with that.

Nick

2 Likes

Hi @nfrasser , on this topic: does it mean that if you take two particle select jobs which contain an overlapping set of particles, and you use those as input for a 2D classification job, the 2D classification uses only a unique set of particles, or in other words it gets rid of the duplicates, right ? Many thanks for your answer !

@marino-j this only works if the particles from the two selection jobs ultimately came from the same initial picking job.

For example, you run a single “Template Picker” job (template picker never generate picks that overlap at the same location). You send that output to two “Inspect Particle Picks” jobs, which you apply with different filters. When finished the two jobs have some overlap. You send both outputs to 2D classification. However, 2D Classification filters out duplicates based on their unique ID because the particles were generated by the same Template Picker job.

If, instead, you create two different Template Picker jobs with the same exposures, extract and send both outputs to 2D Classification, 2D classification does NOT filter out particles at the same location because they were generated by independent Template Picker jobs and assigned different random unique IDs.

Hope that makes sense, let me know if there’s anything else I can clarify.

2 Likes

@nfrasser Your explanation does help. Thanks for taking the time to clarify.

RJ

@nfrasser thank you for the clarification, indeed that was the case :slight_smile:

Hi all,

Just to provide an update: In the just released v3.0, we have exposed a standalone Remove Duplicate Particles job located under the Utilities section. This can be used to filter out any particles that may have been picked too closely together, or can be used in more advanced workflows such as safely combining particle picks from multiple different pickers (e.g. combining both template picks and blob picks from two different jobs, for example). A set of particles can be input to this job, and the outputted particles will have duplicates removed and can be used for classification or refinement. A few more details about this job are provided in the guide job page.

Best,
Michael

Hi mmclean,

In version 4.3.1, does Remove Duplicates successfully account for different binning factors when calculating separation distances? I know the job runs with inputs that are binned to varying levels, but I’m concerned about what’s happening while it’s running. My situation is that I have multiple particle stacks with different box sizes and binning factors ranging from 2x to 8x. All original pixel sizes are the same, however.

Also, in the next step for me, I want to extract all the non-duplicate particles at the same box size/binning factor, but I think (if memory serves) in the past the Extract from Micrographs job has failed when the inputs had different box sizes. Is there a way for me to modify the particle input so that job will be able to extract them all together, or will I need to extract them all separately and then remove the duplicates after extraction? Being able to extract only the non-duplicates would be very helpful for time/computer memory conservation.

Thanks

it handles binning no problem. it knows the origin/center of each pick and measures distance in angstroms to calculate too-near of neighbors.

run “remove duplicates” jobtype prior to the multiple extraction jobs you will need to run for each of the different binned inputs. BTW the extraction job wouldn’t have removed duplicates either, even if you’re extracting duplicates. remember also that extraction of particles in a bigger box can result in slightly less particles extracted since the new box runs off the edge of the micrograph. I don’t think that applies here, but it’s an obvious next question.

1 Like

Thanks CryoEM2, I am trying out a Particle Sets Tool job to see if the intersections between the original stacks and the non-duplicate stack can be used to narrow down the amount of particles I need to re-extract (the extracts at bin2 are taking a very long time, and I’m in a bit of a rush lol).

I found that removing the 2D Alignments component of the input particle stacks for the Extract job does resolve the issue of being unable to handle different binning/box sizes, but it’s not worth the loss of alignment information at my stage so I’ll be extracting them separately one way or another.

For anyone in the same situation, this works and will save me a little time.

Basically I have 4 particle stacks of differing box sizes/binning factors that need to be re-extracted, and duplicates have already been removed from within single stacks already, but there are some particles represented in multiple stacks that I don’t want to waste time or space extracting.

I ran a Remove Duplicates Job first on all the particle sets together, which found ~150k duplicates. Then I ran 4 Particle Sets jobs, each with one of the full particle stacks in slot A, the list of particles represented in multiple stacks in B, and set the action option to ‘intersect’. The “A minus B” output is the list of particles in the input stack that are not present in any of the other stacks, and now I can extract only those instead of extracting the full stack including duplicates and removing duplicates from the combined 4 sets later.

brilliant. don’t forget to benchmark the speed of various extraction jobs (CPU version, GPU version with 0 GPU, GPU versions with 1 or more GPU). for us the CPU jobtype is faster than GPU jobtype. Just queue them all at the same time and see which makes it to 100 micros first, then kill the others and you have your best strategy going forward

1 Like