Hard Drive Usage For Jobs

magdanowski · October 21, 2020, 9:40pm

Forgive me if this has been addressed in the past, if it has please just direct me to that thread, but I’m a new user to cryoSPARC and I have a question for the community.
I took a dataset on the Titan Krios with a K3 super resolution (3500 36frame 3sec movies) but when I was running my Motioncorr job (default patch correction parameters), it crashed because I filled up the entire 2TB hard drive. We were able to change in a 14TB drive and continue the job.
My question is: Will further jobs continue to increase in size? Will I run into this problem later on with CTF estimations or 2D averages, etc. I know I import a lot of data that cryoSPARC had to process, but it doesn’t seem practical for the average user if structure solving session requires many many TBs to complete. Were all these files that filled my drive intermediates that get removed when the job completes?

stephan · October 26, 2020, 10:24pm

Hi @magdanowski

Thanks for asking! Actually what you experienced is an issue that a lot of people in the field have to deal with: cryo-EM data processing creates a lot of data, and storing it takes a lot of space. You have to be intentional with what you choose to keep.

In general, the amount of data you’re describing seems correct- K3 super resolution movies are huge, especially when decompressed (which, in cryoSPARC, happens on-the-fly in memory). This is the first place where you will notice a lot of your storage is consumed (raw data).

The next place is during motion correction, as you encountered. Motion correction creates motion aligned, summed micrographs (36 frames --> 1 frame). Specifically, Patch Motion Correction in cryoSPARC will create 2 micrographs per movie (dose-weighted and non dose-weighted) as well as a background estimate (small) and two more small files for motion estimates (at the global and patch level). The micrographs created here are never duplicated in a typical processing pipeline in cryoSPARC; subsequent jobs will always make references to these files (CTF estimation, particle picking). For a 3500 movie K3 Super Resolution dataset, you are looking at about 2.66TB of data created from this job (based on ~380MB/micrograph, and ~5MB for auxiliary data).

The next place that creates a lot of data are the particle extraction jobs (Extract From Micrographs and Local Motion Correction)- depending on the concentration of particles in your micrographs, cryoSPARC will create a single stack of particles for each micrograph containing all the particles extracted from it. Depending on the size you extracted the particles at, the size here can range anywhere from 250GB-2TB+ for a 3500 movie dataset.
You can calculate the exact size of a particle dataset with the following calculation (assume header_length is 1152):
particle dataset size formula
For example:

A 1,000,000 particle dataset with box size 256 will have a total size of 262.1 GB
A 2,000,000 particle dataset with box size 432 will have a total size of 1493 GB

The reconstruction phase of processing in cryoSPARC doesn’t create nearly as much data as the motion correction and particle extraction jobs- depending on the reconstruction box, refinement jobs create between 2GB-15GB+ of maps per job.

When space is a concern, the use of the Exposure Curation job in cryoSPARC will become crucial- you can use it to filter your dataset based on key attributes and delete files that have been rejected. You can also use the “Clear Intermediate Results” functionality which deletes intermediate results created by iterative jobs (2D Classification, Ab-Initio Reconstruction, Non-Uniform Refinement). Take a look at our Data Management tutorial for more information on tools you can use in cryoSPARC to manage storage:

https://cryosparc.com/docs/tutorials/data-management

I hope this helps.