Several extraction jobs (e.g. downsamplig) have an option to select the number of particles to be extracted. When this number is set to less than the possible total how is the selected group chosen?
randomly and equal number between the two independent datasets used to calculate the GSFSC?
I have a massive number of particles (several millions after symmetry expansion) and would like to use this option to reduce the computation load.
The selected group is chosen by the order of particles in the particles.cs file (ie particles 1 → X). Particles in the file are assigned their order the first time when they are picked/extracted, and this is normally in micrograph index order (this is generally the order in which they are read into CS during import).
In terms of GSFSC splits, picking/extraction jobs are not concerned with maintaining proper spilts. Anytime a job type where a split is required for GSFSC calculations, you will see a message in the top of the log detailing the split. For example:
====== Gold Standard Split ======
Particles have input alignments3D connected, so reusing pre-existing split
Set A is greater than set B by 59 particles (0.00604 percent difference relative to the total dataset).
Split A has 488808 particles
Split B has 488749 particles
In the event that the split differs by more than 2%, you will see the following message in the log along with a warning appearing on the job card:
====== Gold Standard Split ======
Particles have input alignments3D connected, so reusing pre-existing split
Set A is greater than set B by 49998 particles (100 percent difference relative to the total dataset).
This is a difference of greater than 2%.
If equally-sized Gold Standard splits are desired, please use the
'Balance half-sets' mode in Particle Sets Tools.
Alternatively, 'Force re-do GS split' may be enabled, but this might not preserve Gold Standard
independence.
Split A has 49999 particles
Split B has 1 particles
In this event, I would recommend using particle sets tools to rebalance.