2d classification becomes very very slow in the last iteration

particles: 2,300,000
Number of 2D classes: 100
GPU: 8 X 2080ti
use ssd: yes

From cycle 1 to cycle 19, each cycle use about 4~5 minute, but for the last cycle, 3 hours past but the job is still running.
Does anyone know why? And how to speed up the job?

Hi @dingwei,

The final iteration of 2D classification by default will process the full set of particles input to the job (2.3M here) whereas each other iteration will process a much smaller subset, equal to the number of 2D classes multiplied by the “batchsize per class” parameter. By default the batchsize per class is 100, so each iteration was only processing 10,000 particles out of the full 2.3M – which explains why the last iteration was a few orders of magnitude slower. You can turn off the final full dataset iteration by setting the “Number of final full iterations” parameter to 0, however, here it will result in most of your dataset not being seen during the 2D classification, which is probably not desirable. If you have multiple GPUs available, another workaround to speed it up would be to use the Particle Sets Tool job to manually split your particle stack into batches, and run separate 2D classifications on each batch. Then you can use Select 2D and combine the good classes from both jobs, and re-run a 2D classification on the “good” particle stack.

Best,
Michael

1 Like

Thank you very much for your detail explanation. I will try to split my particles and rerun the job!
Best!
Wei

1 Like

Hi @mmclean - one query about this - in the initial iterations, is each subset randomly chosen and unique? So if I have 20 initial iterations each seeing 10000 particles, has the classification “seen” 200k particles? Or are the randomly chosen subsets overlapping with one another?

Hey @olibclarke,

The order of the particle stack is first randomized at the start of the job, and then its looped through sequentially in batches with no overlap (unless you reach the end of the stack, in which case all particles have been seen, and it will roll around back to the start batch). So yes – all particles should be seen once before any are seen twice.

Best,
Michael

2 Likes

Ok - so in that case, is there an advantage in doing a final full iteration if all particles have been seen at least once in the initial iterations?

(I’m thinking specifically about cases with low SNR particles where we often perform more initial iterations with larger batch sizes)

In general the biggest benefit is probably in the classification accuracy – the particles that were seen in the first 1-5 iterations were probably classified rather poorly, since the classes often haven’t converged at that stage. Seeing all particles at the end will make sure that each particle is classified against the current best reference, which is likely important if you are using 2D class to filter junk.

Best,
Michael

3 Likes

Hi @mmclean - if this is the case then I’m not sure the numbers (“particles classified”) reported by Live are correct. I just ran a streaming refinement with Live using 180 classes, batchsize 400, and 40 iterations. In this case the number of particles classified should be ~2.8M, right? But it only reports 720k particles as having been classified.

Hey @olibclarke,

The logic in streaming 2D classification is substantially different from that in standard 2D classification partially because we’re continually checking for new particles. If you could send us a screenshot of the sidebar on the live session (including the 2D class stats and also the number of extracted particles) where you’re seeing this discrepancy we can help investigate further.

Best,
Michael