Reducing number of files

sbliven · May 12, 2022, 7:52pm

I recently archived a large project after processing. It contained 18TB and 1.6 million files. When I discussed this with our sysadmins they were shocked by the number of files and links. Just the file list is >200MB, so file listing and tar operations are very slow. This caused some real issues with our archive system performance. This is perhaps an extreme case, but in my experience projects with hundreds of thousands of files are pretty common.

I wonder if CryoSparc should consider using techniques to reduce the number of files, such as storing images in HDF5 files. This is common in other fields, eg serial crystallography where they have extremely large numbers of images.

jucastil · May 13, 2022, 9:52am

We have a similar issue also: we run intensively CS in a SLURM cluster under a “common” CS account (each user has his own CS web account and can log in onto the web) and the quota reports for the cryosparc account at this moment, after years of usage, above 17M files.

We of course try to delete old accounts and so on, but even so, it’s becoming quite hard to track down all the projects. It will be great to have an automatic tool to track projects with big number of files. The project size is not enough anymore .

wtempel · May 13, 2022, 3:34pm

@sbliven We would like to learn more about the large project use case.

How long has the project been worked on?
Were there multiple contributors?
Have there been any interventions to manage the project’s size (such as: removal of “dead end” jobs, intermediate results)?
Do any job types types stand out in their frequency of use?

marino-j · May 13, 2022, 7:11pm

I would like to thank @sbliven for having posted this here, as we are colleagues and he is referring to a specific case I know firsthand. I can probably answer the question of @wtempel. Yes, intermediate results have been deleted indeed. We have now projects where it becomes easier to collect say 100k + movies to get enough particles, rather than spending a lot of time to optimize sample concentration. For very delicate samples, sometimes it’s really hard to get to high concentrations. So for these projects with a lot of movies it becomes common to end up with large P folders that contain a lot of files.
It might be these are still a minority of cases within the cryosparc panorama, but given that cameras are getting every year faster, and that people are more and more looking at samples that have remained difficult for structural biology, I do not exclude that having large datasets might become more common. Maybe on this light, it could make sense to start thinking on a different way to produce files within cs. Thanks for your help !