XML files missing for some data — how to deal with exposure groups?

MetabolicNerd · August 31, 2025, 7:12pm

Hi CryoSPARC community! Long time lurker here. Excited to be hands-on involved in the community now that I have data to process.

I had microscope time on a Titan Krios a few days ago. During the run, I collected two datasets on two grids with my protein (.EER format). I finished my run on the first grid on the first day, and on the following day, I re-Atlas’ed, re-calibrated image shift, etc. for the second grid. Then, I collected. However, during the second collection, EPU failed to write .JPG/.MRC/.XML files. Those files are there in the first grid but not on the second. The .EER files are fine for all grids, which are of course the most important. However, as I’m learning more about the data processing pipeline, I’ve learned that jobs like global CTF refinement require exposure group information. In our lab, we typically use .XML files to extract this information, but I won’t have this in the second dataset this time around. I see that you can extract exposure groups from the .EER file names (e.g. FoilHole_11111111_Data_11111111_50_YYYYMMDD_111111_EER.eer, and “50” is the exposure group). However, I am concerned that I will be unable to “merge” the two datasets. Is there any way to make a 1-to-1 mapping of these two datasets’ exposure groups? Is “Exposure Group 50” of Dataset 1 comparable to the “Exposure Group 50” of Dataset 2? I plan to analyze the data on both Relion and CryoSPARC, and I noticed that on Relion there’s a way to k-means cluster exposure groups (still a bit unfamiliar with this process, so I’m reading up on how it works). Maybe that should take care of the discrepancy I have with .XML file availability?

Thank you all for your help.

leetleyang · September 1, 2025, 8:27am

Hi,

Before considering equivalence of AFIS-group numbering (EPU’s user guide is vague on this point), the hole- and beam-image shift patterns are likely different between the two sessions due to unique grid orientations. You’d want to assign non-overlapping exposure groups to the two datasets regardless.

For the second dataset, it’s fine to assign exposures by string-splitting the pathname.

Cheers,
Yang

MetabolicNerd · September 1, 2025, 8:41am

I see, thank you for your input. Given that this is the case, at what point should I begin analyzing these two grids as separate datasets? Should I import them all together as a single dataset? Or should I analyze separately and “merge” at some point—and if so, what would that point be? Thank you!

leetleyang · September 1, 2025, 9:02am

Hi,

It’s relatively straightforward to separate post hoc the two datasets and the exposure groups therein. As a matter of convenience, I’d strongly consider processing the two particle image stacks together at the outset until a relatively homogeneous subset has been obtained. This is especially as imaging conditions are unlikely to be significantly different in this scenario.

Unless a reason to separate them emerges, e.g. a classification step or subset statistics indicating a clean split between the two, one may consider continuing to treat them as one dataset throughout.

One common reason to keep two datasets separate is if imaging conditions are likely to be different, e.g. collected on two different microscopes or, to a lesser degree, the same microscope but separated in time. In that case, it’s sometimes more prudent to ensure each half is internally consistent before attempting to merge them.

YMMV.

Cheers,
Yang