2d classification becomes very very slow in the last iteration

dingwei · March 18, 2021, 4:52am

particles: 2,300,000
Number of 2D classes: 100
GPU: 8 X 2080ti
use ssd: yes

From cycle 1 to cycle 19, each cycle use about 4~5 minute, but for the last cycle, 3 hours past but the job is still running.
Does anyone know why? And how to speed up the job?

mmclean · March 18, 2021, 1:10pm

Hi @dingwei,

The final iteration of 2D classification by default will process the full set of particles input to the job (2.3M here) whereas each other iteration will process a much smaller subset, equal to the number of 2D classes multiplied by the “batchsize per class” parameter. By default the batchsize per class is 100, so each iteration was only processing 10,000 particles out of the full 2.3M – which explains why the last iteration was a few orders of magnitude slower. You can turn off the final full dataset iteration by setting the “Number of final full iterations” parameter to 0, however, here it will result in most of your dataset not being seen during the 2D classification, which is probably not desirable. If you have multiple GPUs available, another workaround to speed it up would be to use the Particle Sets Tool job to manually split your particle stack into batches, and run separate 2D classifications on each batch. Then you can use Select 2D and combine the good classes from both jobs, and re-run a 2D classification on the “good” particle stack.

Best,
Michael

dingwei · March 18, 2021, 1:14pm

Thank you very much for your detail explanation. I will try to split my particles and rerun the job!
Best!
Wei

olibclarke · March 18, 2021, 1:28pm

Hi @mmclean - one query about this - in the initial iterations, is each subset randomly chosen and unique? So if I have 20 initial iterations each seeing 10000 particles, has the classification “seen” 200k particles? Or are the randomly chosen subsets overlapping with one another?

mmclean · March 18, 2021, 1:34pm

Hey @olibclarke,

The order of the particle stack is first randomized at the start of the job, and then its looped through sequentially in batches with no overlap (unless you reach the end of the stack, in which case all particles have been seen, and it will roll around back to the start batch). So yes – all particles should be seen once before any are seen twice.

Best,
Michael

olibclarke · March 18, 2021, 1:36pm

Ok - so in that case, is there an advantage in doing a final full iteration if all particles have been seen at least once in the initial iterations?

(I’m thinking specifically about cases with low SNR particles where we often perform more initial iterations with larger batch sizes)

mmclean · March 18, 2021, 1:39pm

In general the biggest benefit is probably in the classification accuracy – the particles that were seen in the first 1-5 iterations were probably classified rather poorly, since the classes often haven’t converged at that stage. Seeing all particles at the end will make sure that each particle is classified against the current best reference, which is likely important if you are using 2D class to filter junk.

Best,
Michael

olibclarke · March 20, 2021, 9:28pm

Hi @mmclean - if this is the case then I’m not sure the numbers (“particles classified”) reported by Live are correct. I just ran a streaming refinement with Live using 180 classes, batchsize 400, and 40 iterations. In this case the number of particles classified should be ~2.8M, right? But it only reports 720k particles as having been classified.

mmclean · March 22, 2021, 3:38pm

Hey @olibclarke,

The logic in streaming 2D classification is substantially different from that in standard 2D classification partially because we’re continually checking for new particles. If you could send us a screenshot of the sidebar on the live session (including the 2D class stats and also the number of extracted particles) where you’re seeing this discrepancy we can help investigate further.

Best,
Michael