Exposure group splitting using regular expressions

Dear cryoSparc developers,

I wonder whether you can add regular expression in the Exposure Group Utilities function. Right now this function seems to only work with data collected by EPU. While the data collected by other softwares, such as serialEM, maybe more customized. Please let me know if I did not describe the problem clearly. Thanks!

Best,
Wei

Hi @wxh180,

Which data collection software are you using at the moment? We would love to know how different applications distinguish different “exposure groups”. This would allow us to make an easy-to-use interface for users that aren’t comfortable enough to create regular expression strings.

In SerialEM, one records image shift groups by including the navigator position in the output file name. Take for example the file da_20190507_ProtX_2-2_126_0006_noDW.mrc. Here, I used the base name “da_20190507_ProtX_2-2” (for reasons), and SerialEM has added the navigator position (126) and the image number at this position (6). MotionCor2 also added “noDW.”

We usually employ a 9-hole image shift strategy, so the shift groups are indicated by the image number being 0001 - 0009. Splitting these files into shift groups can be supported trivially by specifying a field delimiter (e.g. “_”) and field index (e.g. 5) to select the group number.

The currently available splitting strategy should work, for example setting the index position to back, the start slice to 9, and considering 1 character, but it doesn’t work in practice. The output doesn’t show how the string is being parsed (for example you could bold the characters that will be used in the output display, to help the user get it right).

1 Like

Hi @stephan,

Here I am referring to serialEM. The file name is {sampleID}{navigationItemID}{expoGroup}.tif. The problem comes from the different string lengths of navigationItemID. If there is a way to slice last 9 letters of the file name and use it to split the exposure groups, it will work as well. I tried to reprocess some old data with the new functions.

For live, we have been using *_{expoGroup}.tif to separate them into different exposure groups already.

Thanks,
Wei

@wxh180 To use the last 9 characters, it should be possible to set the index position to back, the slice index to 9, and the number of characters to consider to 9. I haven’t been able to make this work, perhaps there is a bug with “back.”

@DanielAsarnow Yes, I tried that as well. It doesn’t work for me either. It seems to have a bug in the code for the range of the string to slice.

Hi @wxh180, @DanielAsarnow,

Thank you for the suggestions! I’ll double check that the “back” option works properly, and add an option to instead specify a “split by” character and use a “field index” to capture the correct group. Also, I’ll see if I can create a mode that will let you test your “split-by tokens”.

Hi @wxh180, @DanielAsarnow,

I’ve added the regular expression option and the option to create exposure groups using a separator in v2.12.4. I’ve also fixed the bug in the back index position for splitting using character indexes. Please try it out and let me know what you think.

More information on how to use the new options here:

1 Like

@stephan Tested. Works well now. Thanks!

1 Like

@DanielAsarnow Do you have plan to update pyem to facilitate the conversion between cryosparc and relion for recent updates? Thanks for your handy tool!

Yes, I’ve been working on fluent Relion 3.1 star file handling. I did make some changes today that make it very easy to support all new fields. (And as a side effect, converting micrograph jobs like CTF estimation now works properly).

Early next year :wink: