Preload size nearly same after selecting ~1/6 of the particles

KiSchnelle · January 25, 2022, 11:11am

Hey i just noticed a for me strange behaviour with the preloading of the particles onto a SSD.

I did a 2D classification with around 600.000 particles. It did normal preloading with 1.740.649MB. Then i did a Select job and chose 115.000 particles to do an Ab-initio. Again with prealoading to SSD, it was sent to another node, so the particles had to be read in again and now it was 1.733.542MB. Particles were unbinned from the start with a boxsize of 352px.
I then went back to a lot of other older jobs from different projects, just to notice the same behaviour, when doing 2D, then selecting only a part of the particles the following job will read in nearly the same stack size.

How can it be that having only a sixth of the particles, the preloaded stack size is nearly the same?

Cryosparc version 3.3.1, Ubuntu 20.04LTS

cheers
Kilian

mmclean · January 25, 2022, 2:32pm

Hi @KiSchnelle,

The behaviour you’re seeing is due to how particles are organized within files. After extracting from micrographs, all particles from one micrograph are written out in one mrc file. When 2D classification and select 2D are run, even though you end up selecting only 1/6th the particles, you’ll still likely end up selecting a nonzero number of particles from each micrograph (unless some micrographs had mostly junk or very few picks, that didn’t manage to get classified into one of the good classes). Then, when you run a downstream reconstruction/refinement, the cache system will still need to look up which files are referenced by the current set of particles – unless the select 2D job happened to exclude all particles from a particular micrograph, that file will still need to be cached in full. This is why the total file size is nearly the same before/after select 2D.

You can work-around this behaviour by re-writing only the desired particles out into a separate mrc file. This is most easily done via the “Downsample Particles” job – you can run it with the select 2D particles’ job outputs, while keeping the box size parameter the same as before (so that no downsampling actually happens). Then, in downstream jobs you should see that the total file size cached is much less than before.

Hope that helps,
Michael

KiSchnelle · January 25, 2022, 3:56pm

Hi @mmclean ,

Thanks a lot! That of course makes total sense now. I did both, normal extraction and downsampling with the same boxsize and since extraction was much faster i guess thats the way for us to go then.

I mean of course this trades of speed of prereading for storage capacity on the main storage (atleast if you dont delete the old extraction), but i think its worth, considering in this case i apparently still loaded the stack from the very first extraction and rextracting now reduced 1.7TB to 85GB:D. So the gained speed definitely overweights the lost storage.

cheers
Kilian