Cluster exposures by gridsquares

wjnicol · September 13, 2024, 6:57am

Hello,

I use EPU for data collection, which dumps the exposures in folders called Gridsquare_XXXXX.

For optimization reasons, I am processing the data square-wise and have been manually importing the data from every individually. Is there a way, after doing a live export, to cluster the exposures by “parent folder” name in their path using the Exposure Group Utilities?

Thank you!

William

olibclarke · September 13, 2024, 1:58pm

Assuming you have imported them such that they still have the parent folder in the micrograph path, you should be able to do this with exposure group utilities (adjusting the slice index and number of characters appropriately):

wjnicol · September 19, 2024, 1:57am

Hey Oliver,

Thank you, good to know I can do it with the Exposure Group Utility. However, I tried multiple combinations and I don’t understand what they mean by “Start Slice Index” and “Numbers of characters to Consider”.

If the path for an exposure is /fsx/cryoem-raw-data/unicorn-data/12Sept2k24_sdm1k24aug22b/Images-Disc1/GridSquare_12279953/Data/FoilHole_12305835_Data_12281190_11_20240912_180511_EER.eer, and I want to seggregate by the Folder “GridSquare_XXXXXXXX”, how do I set those parameters?

Thank you for the help,

William

olibclarke · September 19, 2024, 3:29pm

The start slice index is where in the string to look for the pattern to cluster by (IIRC you can either start from the end or the beginning, but let’s assume the latter).

The number of characters to consider is the number of characters that are present in the variable substring you want to cluster on. So in your example, you want to catch just the 12279953 after GridSquare_.

There are 83 characters before this point, so the start slice index should be I think 82 (it starts from 0 if I recall), and number of characters to consider should be 8. But just try it, have a look at what it is doing in the log, and adjust accordingly and rerun if needed.

Cheers
Oli

kstachowski · September 19, 2024, 7:46pm

Hi all,

@wjnicol did you process/export these mics on a gridsquare by gridsquare basis in CS live such that you have X gridsquares and X export exposure jobs?

@olibclarke the full path should not be retained upon import as the symlinks are built and relative paths are established for further processing. Have you ever seen a partial or full path resolved as part of location/micrograph_path?

Best,
Kye

wjnicol · September 19, 2024, 8:14pm

Hello,

The CS Live session has the exposure group set to the Parent folder containing all the GridSquare folders (EPU directory tree type). It is set to recursive and continuous. There are certain squares that have different ice quality and I would like to separate them from the rest. Also, some grid squares were acquired tilted and would also like to separate these from the rest.

I then do “Export Exposures” to work on them in a workspace.

Trying

Results in an error:

Is the path of the movies lost when I do “Export”?

Would these be the right parameters in order to seggregate by grid square?

Thank you,

wjnicol · September 19, 2024, 8:26pm

Nevermind. I have confirmation from log it’s losing the path because exported somewhere else.

I may need to reimport movies in a different way.

olibclarke · September 19, 2024, 9:31pm

No you’re absolutely right Kye, I didn’t think about that.

Maybe the best way to retain the info would be to rename (or symlink) the micrographs such that they have the grid square ID somewhere in the filename, so that exposure groups has something to work with? Could do it with a bash script fairly easily I think.

kstachowski · September 20, 2024, 7:55pm

Hi All,

Thanks for your responses! There is currently no easy way to do this via the GUI, unless you import each set of movies by gridsquare (setting the exposure group ID parameter for each set in the import job), reprocess the mics through Patch CTF estimation, and then reassign particles to those micrographs. This is definitely less than ideal, therefore @nfrasser made a cs-tools script that would be able to accomplish what you seek to do @wjnicol

To use this script (located at the bottom of my message):

Ensure you have a working and version matched cryosparc-tools environment.
Copy script from below and save as split_exposures_by_grid_square.py and ensure it has the correct permissions to run.
Change the license, email, password, host, and base_port values in the script to match your instance.
Launch script from the command line using the following command python split_exposures_by_grid_square.py P1 W2 J3 accepted_exposures where:
- Project is the project containing the Live Exposure Export
- Workspace is the workspace you would like the split exposure groups to be placed
- Job is the Live Export Exposures job from your CS-live session
- ‘accepted_exposures’ is the output group from the Live Export Exposures job (this does not change)

If you wanted to use the script on the outputs of a Patch CTF Estimation job, then you would need to use the correct project and job for the Patch CTF Estimation job and then change the output group to exposures.

Additionally, we have noted a feature request for maintaining full path info in some format such that analyses like these can be performed within the GUI.

Please let me know if you have any issues.

Best,
Kye

# e.g,. python split_exposures_by_grid_square.py P3 W4 J42 accepted_exposures
import sys
from pathlib import Path
from cryosparc.tools import CryoSPARC

# Parse arguments
assert len(sys.argv) == 5, f"Usage: python {sys.argv[0]} <PROJECT-ID> <WORKSPACE-ID> <MOVIES-JOB-ID> <MOVIES-OUTPUT-NAME>"
project_uid, workspace_uid, job_uid, input_name = sys.argv[1:]

# Connect to CryoSPARC
cs = CryoSPARC(  # SUBSTITUTE CRYOSPARC INSTANCE DETAILS HERE
    license="<LICENSE ID>",
    email="<EMAIL>",
    password="<PASSWORD>",
    host="<HOST NAME>",
    base_port=<PORT NUMBER>,
) 
assert cs.test_connection()

# Load entities
project = cs.find_project(project_uid)
workspace = project.find_workspace(workspace_uid)
job = project.find_job(job_uid)
movies = job.load_output(input_name, slots=["movie_blob"])

# Split up movies dataset by resolving symlinks
print(f"Splitting {len(movies)} exposures by grid square folder...")
project_dir = Path(project.dir())
grid_square_idxs: dict[str, list[int]] = {}
for i, link_path in enumerate(movies["movie_blob/path"]):
    link_path_abs = project_dir / str(link_path)
    movie_path_abs = link_path_abs.resolve()
    if movie_path_abs.parent.name != "Data":
        print(f"WARNING: Movie is not in a GridSquare/Data folder: {movie_path_abs}", file=sys.stderr)
        continue

    # Resolve original movie path and add to a grid square index group
    grid_square_dir_path = movie_path_abs.parent.parent
    if grid_square_dir_path.name not in grid_square_idxs:
        print(f"Found grid square folder {grid_square_dir_path.name}")
        grid_square_idxs[grid_square_dir_path.name] = []
    grid_square_idxs[grid_square_dir_path.name].append(i)

assert grid_square_idxs, f"ERROR: Selected exposures output has no movies or no matching movies with the correct absolute path format"

# Create external job and add a group for each grid square
print(f"Saving {len(grid_square_idxs)} outputs job...")
for grid_square, idxs in grid_square_idxs.items():
    saved_job_uid = workspace.save_external_result(
        movies.take(idxs),
        type="exposure",
        name=grid_square,
        slots=["movie_blob"],
        passthrough=(job_uid, input_name),
        title=f"Exposures for {grid_square}",
    )
    print(f"Saved {grid_square} to {saved_job_uid}")

wjnicol · September 20, 2024, 10:49pm

Hello,

I wasn’t expecting so much, this is awesome. Thanks!

The ban of my existence right now is that the filename created by epu does not contain gridsquare info. Not that I could figure out at least…so if the full path is lost, the info is lost.

For sample optimization when trying to find good ice conditions, it’s very useful to associate good classes → which particles → which images → which squares → ultimately which holes (that one is also hard to figure out with EPU/ATHENA because the names are somewhat generated randomly and don’f follow any logic).

The way I am getting around it for now is to add every gridsquare folder as a separate exposure group. It’s manageable when I only have <10 but after wards it’s prone to confusion, mistakes and is tidious to do.

Anyways, thank you for the script and looking forward to the implementation of the full path soon!

Best,

Will

kstachowski · September 22, 2024, 2:26pm

Hi @wjnicol,

I should have mentioned what this script does. It will resolve the live export exposures symlinks into full paths and then using the gridsquare information from the resolved path to establish the groups you would like established. It will then save these exposures groups as outputs of indidivual external results jobs.

Best,
Kye