Data curation - remove unused micrographs

We have collected many datasets within which exist bad micrographs or micrographs that dont contain any particles that went into the final refinment. We would love to be rid of these mics as soon as possible, but I cannot find a trivial solution to this problem in the GUI. Can any of the users here advise? We cannot use the cryo-sparc tools util because of the way CS is configured in our HPC env and our user-base is quite varied with a large pool of novice users that prefer to use the GUI.

Best,
Pranav

3 Likes

@pranav Please can you specify whether you are referring to micrographs created by CryoSPARC and stored inside CryoSPARC project directories or to raw data that have been imported to CryoSPARC and are represented by symbolic links inside project directories.

Hi,
I wasnt referring to any particular result per se. But the wish is to have a tool or a protcol that would yield a list of movies (raw/ CS processed) that could be written into a text file so that the HPC sysadmin can easily parse and remove those files from the file system.
Best,
Pranav

Edited to deal with symlinks:

This will depend on the job type, but you could always locate the naming convention for the rejected movies/micrographs. For instance, live sessions currently output the file that abides this name template

P*_S*_rejected_live_exposures.cs

Once you located the cs-file (these are numpy format data array dumps), something like this would yield the list

import numpy as np
import os
cspath = <enter your cs-file path here>
x=np.load(cspath)
print(‘\n’.join([t.decode() for t in x[‘movie_blob/path’]]))

This will likely list relative symbolic links. You can deal with that upstream, or modify the last line to print out real paths

project_path = os.path.dirname(os.path.dirname(cspath))
print(‘\n’.join([os.path.realpath(os.path.join(project_path,t.decode())) for t in x[‘movie_blob/path’]]))

there will be variations to this if you want to remove rejected micrographs and not just movies, but you can work it out from this example.

You can also use csparc2star.py to make micrograph star files along the lines of @pozharski suggestion. I’d encourage you to keep records about the removed files, for example when you fill out the Table 1 in a manuscript you should report the true size of the data.

Thanks @pranav. We made a note of your suggestion.