Batch clean up of multiple projects

Hi,

I am certain this is not a good thing to do, but nevertheless might be the only feasible way of currently executing the task we have at hands.

As a data curator/facility manager, sometime one faces the situation where the storage server is full of old CS projects and they need to be cleaned up and archived. Frequently it happens that many of the project users are no longer around to do so.
To take action over this and since some of the projects are not connected to an instance any longer, I thought the most practical would be to remove heavy files, such as motion corrected micrographs using the rm command pointing at those large files from specific directories: e.g., ‘rm /path/to/cs-projectname/J*/motioncorrected/*’ to remove any motion corrected micrographs. This could be applied to all project we want to archive if we place them all in a shared directory, which is very practical.

I can already have an idea of possible difficulties and the need to provide instructions for project recovery for users that could come up the idea to dig into this old data. For instance, upon attaching the project, the motion correction job would still be marked as complete but any jobs depending on motion corrected micrographs would not be able to find those files. A simple workaround I think should be feasible is to clear the motion correction job and rerun it prior to launching any new jobs.

Most likely there will be also other problems as project recovered from our archive would probably end up in a completely different path in the file system than its original location… here I imagine the problems would be bigger.

I want to hear and discuss here the implications and follow-up complications of removing data in this way if any of these projects is ever to be recovered from our storage archive/tape. I am also interested to hear if there is already anyone implementing cryosparcm cli or cryosparc tools for such workflows.

Please do not hesitate in sharing some of your experience!

Thanks,

André

While a project directory is attached to and under a control of a CryoSPARC v ≥ 4.3 instance, a cleanup tool is available.
Removal, when initiated outside CryoSPARC, of files from project directories will likely lead to:

  • in case of attached projects, CryoSPARC malfunction
  • in case of unattached project directories, projects that can no longer be attached to and/or function with CryoSPARC

We recommend the implementation of user offboarding procedures that minimize the risk of accumulating large amounts of “orphaned” data in the future.

Hello @wtempel

Thanks a lot for the reply. The cleanup tool is fantastic and we are very thankful Structure have developed it and made it so available for all of us to be able to clean the projects.

However most of the projects I talk about date back to 2018-2022 and for a long time haven’t been associated with any cryosparc instance… is there a recommendation for such projects? Most of them aren’t even on a storage that can be connected to a CryoSPARC instance. To attach them would require to transfer them first to a storage attached to an instance and that will be painfully slow. We really need to move forward with this task within a month and my rough calculations show that we would be able to go through more than 30% of the projects if we do it that way and that’s problematic.

Offboarding procedures are now implemented to me projects/users.

Thank you,

André