Why is it that for multiple job types (2D classification, 3D refine, etc.) that selecting fewer particles or splitting a particle stack does not change the amount of SSD space required in the downstream job?
I can provide specific examples if required. I’m running v3.1.
Thanks for any suggestions!
Nick
When you say “Splitting a particle stack” do you mean using the Particle Sets Tool? I’m fairly sure that 2D and 3D classification and the particle sets tool all just select a subset of particles to work from, but leave your particle stack unchanged. To use less space on your SSD, you’ll have to re-extract your selected/split particles, creating a new stack that’s only got what you want.
Extract job writes an mrc (mrcs in relion) file per micrograph with all the particles from their corresponding micrographs. Basically, stack of 2D images of all the particles in a single mrc file. As @posertinlab said, a simple selection job will not change those stack of particles, instead, it will create a cs file (star file in relion) with the metadata referring to the location of selected (or unselected) particles (something like 1@imicrograph01.mrc, 15@micrograph01.mrc 3@micrograph02.mrc … etc. numbers 1,15,3 designates the location of the particles in the stack). It does not create a new stack of files. Therefore subsequent refinement job will need to load the entire stack even though some of it selected to be refined. I am not sure what happens to the stack files from which no particles was selected.
Alpay
Thanks @posertinlab and @alburse, that makes more sense to me now.
In my case, I had run an extract job with ~3.5m particles and I was indeed hoping to split them up in smaller batches using the Particle Sets Tool. Simply splitting the particle stack into smaller batches did not reduce the particle stack size to fit on the SSD, however, I am able to run an Ab-Initio job telling it to only use 300k particles (from the 3.5m stack) and it runs fine on the SSD. This makes me think it should be possible to have an option in the Particle Sets Tool to ‘truly’ split the particle stack, rather than having to re-extract and use up more disk space?
Maybe someone on the CryoSPARC @team can comment on this possibility?
Nick
@nschnick You should be able to re-consolidate each particle sub-stack through Downsample Particles. Just make sure not to apply cropping or resolution filters.
6 Likes
Thanks @leetleyang! I had missed this issue is also mentioned here: https://guide.cryosparc.com/processing-data/tutorials-and-case-studies/tutorial-ssd-particle-caching-in-cryosparc#tips-and-tricks
@team It might be nice if there were an option in 3D refine and some other job types that use SSD to split up the particle stack into batches based on SSD space directly in that job, but ultimately I’m not sure how practical/useful this would be.
2 Likes
ugh, same problem when proceeding with a single class from het refinement job. since our cache wipes regularly, I’m having to rewrite the entire particle set just to perform NU-refinement on a small subset. The workaround is “downsample particles” without downsampling. But this is a fresh extraction, right?Duplicating the data, and requiring a slow GPU-intensive job?
@CryoEM1 that’s right, it is new extraction which duplicates the data.
I hope the CS team will consider adding the feature to ‘truly’ split a particle stack rather than having to re-extract. @spunjani
Hi @nschnick,
Thanks for the suggestion. We’ve added this feature to our tracker. For the time being, the Downsample Particles job is the correct tool to use in this situation.
We’ve considered the idea of splitting up a particle stack without having to duplicate data, but this can turn into a more complex process that might end up introducing more bugs and slower load times during caching.
2 Likes
Hi @stephan - ok great thanks for taking a look! I figured it probably wasn’t so straightforward or you all would’ve already done it!
Good to know there could be other issues associated with this process. I don’t know how much you’re thinking it could slow down the caching, but it would likely still be much better than having to run jobs without SSD or having to re-extract first. For me though, I tend to run into disk space constraints if I have to do this for many datasets.