Purpose of exposure group curation with regex in End-to_end automation workflows

samhaysom · October 23, 2025, 10:47am

I’ve been having a look at the new automated workflows just published by Structura in BioRxiv and they look incredibly useful. However I cannot work out from the paper or accompanying website material what the Exposure Group Utilities job is for. I get that it is splitting the micrographs into groups based on the filename but not why this is useful and what subsequent parts of the workflow this is used by. My guess is maybe to allow CTF refinements later to use beamshift groups but to my knowledge that requires a different job type and import of accompanying xml files.

Could someone from Structura elaborate as to what this step is supposed to do? I also notice that for the different EMPIAR datasets different regex patterns needed to be used. How can we work out what regex pattern we should use for our data?

Thanks again for these workflows!

samhaysom · October 23, 2025, 10:50am

One thought. Is the exposure grouping to split datasets that include micrographs from multiple different collections? If so, does this step need to be skipped if we only have one set of micrographs in our data?

kstachowski · October 23, 2025, 1:23pm

Hi @samhaysom

The purpose of the Exposure Group Utilities job is to assign optics groups for Global CTF refinement that is carried out later in the pipeline. For all datasets processed, we used regex because the EMPIAR datasets all had optics group information present in the filenames. SerialEM and EPU have options to append the image shift groups via a string of text in the filename, typically in some form of _12345_. If you are not collecting data with beam image-shift, or you are utilizing XML files, the workflow would need to be edited to account for these changes.

Best,

Kye

samhaysom · October 24, 2025, 9:11am

Great, thanks for the clarification. It might be useful to others to add something more to the documentation on how to tailor the regex to a particular output format (unless its already there and I’m being dim).