Import ImageShift from .mdoc files (SerialEM)

ccgauvin94 · November 16, 2023, 3:44am

Short script using cryosparc-tools to import ImageShifts from micrograph-associated .mdoc files. To use, fill out the variables at the top, and run this in the directory with the .mdoc files. It should make a new job in the specified workspace with an exposure output that you can then put into the Exposure Group Utilities job, and use to cluster via image shift.

Example job:

Exposure group utilities follow-up job:

(hopefully your shift groups look better - we were having hole centering issues on this dataset)

from pathlib import Path
import re
from cryosparc.tools import CryoSPARC
from getpass import getpass

##### Instructions #####
# Place a copy of this file in each directory you have .mdoc
# files in, load the cryosparc-tools conda environment (if using)
# and run this script: python3 cs_mdoc_image_shift.py

##### Variables #####
expected_groups = 25
cs_license = 'asdf1348-replace-me-123lkj12'
cs_hostname = 'mycryosparcinstance.mysupercoolinstitution.com'
cs_port = 39000
cs_email = 'myemail@mydomain.com'
cs_project = 'P83'
cs_workspace = 'W1'
cs_job = 'J56'
cs_job_output = 'exposure'

##### (il)Logic #####
#Set directory and get list of mdocs
mdoc_directory = Path(".")
mdoc_list = list(mdoc_directory.glob('**/*.mdoc'))

#Initialize list that will hold dictionary with exposure information
exposures = []

#Iterate through list of mdoc files
for i in range(len(mdoc_list)):

    #Open individual mdoc file
    with open(mdoc_list[i]) as mdoc:

        #Read each line in mdoc file
        for line in mdoc:

            #if the line matches, store the name of the file and also the image shift values into a list
            if re.search("ImageShift",line):
                shift_data = {
                    "exposure": str(mdoc_list[i]).rsplit(sep='.',maxsplit = 1)[0],
                    "x_shift": line.rsplit()[2],
                    "y_shift": line.rsplit()[3],
                    }
                exposures.append(shift_data)

#Get CS password
print("Please enter CryoSPARC password: ")
cs_password = getpass()

#Connect to CS
cs = CryoSPARC(license=cs_license, email=cs_email, password=cs_password, host=cs_hostname, base_port=cs_port)
assert cs.test_connection()

#Find dataset and copy it to a new dataset
project = cs.find_project(cs_project)
job = cs.find_job(cs_project, cs_job)
cs_exposures = job.load_output(cs_job_output)
grouped_exposures = cs_exposures.copy()

#For every row in the dataset, search entire mdoc data and see if it fits a known movie and image shift. There's probably a much faster way to do this.
for row in grouped_exposures.rows():
    for i in range(len(exposures)):
        if exposures[i]['exposure'] == row['movie_blob/path'].rsplit(sep='/')[-1]:
            row['mscope_params/beam_shift'][0] = exposures[i]['x_shift']
            row['mscope_params/beam_shift'][1] = exposures[i]['y_shift']
            row['mscope_params/beam_shift_known'] = 1
            break

#Make a new external job with the updated values, and passthrough the remaining, non-updated values
project.save_external_result(
    workspace_uid=cs_workspace,
    dataset=grouped_exposures,
    type="exposure",
    name="exposure",
    slots=["mscope_params"],
    passthrough=(cs_job,cs_job_output),
    title="ImageShift Updated Exposures"
    )

github.com

ccgauvin94/cs_jiffies/blob/main/cs_mdoc_image_shift.py

from pathlib import Path
import re
import sys
from cryosparc.tools import CryoSPARC
from getpass import getpass
from dotenv import dotenv_values
import argparse
from typing import Iterable, Dict, List
import numpy as np



##### Instructions #####
# Place a copy of this file in each directory you have .mdoc
# files in, load the cryosparc-tools conda environment (if using)
# and run this script: python3 cs_mdoc_image_shift.py

##### Variables #####
CONFIG = dotenv_values(".env")
CS_JOB_OUTPUT_TYPE = 'all_exposures'

This file has been truncated. show original

wtempel · November 16, 2023, 4:06pm

@ccgauvin94 Thanks for sharing your script!

Dmitry · June 6, 2024, 8:38pm

@wtempel is it possible to incorporate this script to the future versions of CS as protocol?

Thank you!

Dmitry · June 7, 2024, 4:34pm

hello @ccgauvin94 and @wtempel

Could you please clarify what “J” job has to be specified?

I tried the Import micrographs J job // Patch CTF J job but I am always having the following error -

File “/home/cryosparc_user/.local/lib/python3.11/site-packages/cryosparc/job.p y”, line 444, in load_output
raise TypeError(f"Job {self.project_uid}-{self.uid} does not have any result s for output {name}")
TypeError: Job P15-J28 does not have any results for output all_exposures

Thank you.

Kind regards,
Dmitry

The full command I am running looks like that

cryosparc_user@cryoem1:/data/dmitry//Movies$ python cs_mdoc_image_shift.py -e dmitry.XXXXX@XXXXXXX -p P15 -w W1 -j J28 -d /data/dmitry/Roman/PSI_lhca1_lhca4/set1_160424/240413_PS1_SPA/Movies

DanielAsarnow · June 7, 2024, 9:26pm

@ccgauvin94 have you been able to test if the CTF refinement is better using clustered shifts vs. shift pattern indices? Do the nominal shift groups differ from the index based groups at all?

I don’t normally save MDOC files per movie because it slows down data collection but interested to know!

ccgauvin94 · June 13, 2024, 10:19pm

I’ve seen improvements in the 5/100ths of an Angstrom on some datasets - nothing super noteworthy.

ccgauvin94 · June 13, 2024, 10:20pm

Any job that outputs all_exposures

If you don’t have that available, just run a curate exposures job.

Dmitry · July 25, 2024, 3:50pm

hello @ccgauvin94 , I dont know - something does not work here -
I am receiving the following error -

(cryosparc-tools-env) cryosparc_user@cryoem1:/data/dmitry/XX/XXX/set1_160424/240413_PS1_SPA/Movies$ python cs_mdoc_image_shift_4.py -e dmitry.XXXXX@XXXX -p P15 -w W1 -j J27 -d .
Please enter CryoSPARC password:
Connection succeeded to CryoSPARC command_core at http://cryoem1.itqb.unl.pt:39002
Connection succeeded to CryoSPARC command_vis at http://cryoem1.itqb.unl.pt:39003
Connection succeeded to CryoSPARC command_rtp at http://cryoem1.itqb.unl.pt:39005
Traceback (most recent call last):
File “cs_mdoc_image_shift_4.py”, line 83, in
cs_project, cs_job, grouped_exposures = get_exposures_from_job(args.project, args.job)
File “cs_mdoc_image_shift_4.py”, line 58, in get_exposures_from_job
print(f"Available output types for job {project}-{job}: {cs_job.output_types}")
AttributeError: ‘Job’ object has no attribute ‘output_types’

I tried both scripts. (by the way - what is the difference between them?)

Any ideas?

Thank you.

Kind regards,
Dmitry

ccgauvin94 · July 25, 2024, 4:52pm

Hi Dmitry,

One script clusters locally, the other uses clustering in CryoSPARC. I prefer the latter.

print(f"Available output types for job {project}-{job}: {cs_job.output_types}")

AttributeError: ‘Job’ object has no attribute ‘output_types’
It looks like the job you selected has no outputs. Can you verify the job number has exposure outputs?

Dmitry · July 26, 2024, 3:26pm

Thank you @ccgauvin94,

I have revised the script and used the last one.
here is my command -

(cryosparc-tools-env) cryosparc_user@cryoem1:/data/dmitry/XXXX/XXXXXX/set1_160424/240413_XXXXX/Movies$ python u_cs_mdoc_image_shift.py -e dmitry.XXX@XXX.XXX.XXX -p P15 -w W1 -j J311 -d .

here is the image of J311
The error is the following -

(cryosparc-tools-env) cryosparc_user@cryoem1:/data/dmitry/XXXXX/XXXX/set1_160424/240413_XXXX/Movies$ python u_cs_mdoc_image_shift.py -e dmitry.XXXXX@XXX.XXX.XX -p P15 -w W1 -j J311 -d .
Please enter CryoSPARC password:
Connection succeeded to CryoSPARC command_core at http://cryoem1.XXX.XXX.XXX:39002
Connection succeeded to CryoSPARC command_vis at http://cryoem1.XXX.XXX.XX:39003
Connection succeeded to CryoSPARC command_rtp at http://cryoem1.XXX.XXX.XX:39005
Traceback (most recent call last):
File “u_cs_mdoc_image_shift.py”, line 79, in
cs_project, cs_job, grouped_exposures = get_exposures_from_job(args.project,args.job)
File “u_cs_mdoc_image_shift.py”, line 61, in get_exposures_from_job
cs_exposures = cs_job.load_output(CS_JOB_OUTPUT_TYPE)
File “/home/cryosparc_user/.conda/envs/cryosparc-tools-env/lib/python3.7/site-packages/cryosparc/job.py”, line 444, in load_output
raise TypeError(f"Job {self.project_uid}-{self.uid} does not have any results for output {name}")
TypeError: Job P15-J311 does not have any results for output all_exposures

stefan · July 30, 2025, 3:09am

Most likely your movie input job generates an imported_movies output rather than an all_exposures output.
Another issue I ran into with this script is that it does not account for some of the random (to me) numbers cryosparc appends before the filename when importing it. I think I was able to fix it:

from pathlib import Path
import re
import sys
from cryosparc.tools import CryoSPARC
from getpass import getpass
from dotenv import dotenv_values
import argparse
from typing import Iterable, Dict, List
import numpy as np



##### Instructions #####
# Place a copy of this file in each directory you have .mdoc
# files in, load the cryosparc-tools conda environment (if using)
# and run this script: python3 cs_mdoc_image_shift.py

##### Variables #####
CONFIG = dotenv_values(".env")
CS_JOB_OUTPUT_TYPE = 'imported_movies'


def parser(args:List) -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument('--email','-e',type=str, help="Email or account used in cryosparc")
    parser.add_argument('--project', '-p',type=str, help='Cryosparc project. i.e. P83')
    parser.add_argument('--workspace','-w',type=str, help='Workspace number, i.e. W1')
    parser.add_argument('--job','-j',type=str,help='Input job number, i.e. J56')
    parser.add_argument('--mdoc_dir','-d',type=str, help='Directory where the exposures mdoc files are located')
    parser.add_argument('--recursive', '-r', action='store_true', default=False, help="Whether to look for the mdoc file recursively from the mdoc_dir.")
    args =  parser.parse_args(args)
    if any([args.email is None, args.project is None, args.job is None, args.workspace is None]):
        parser.print_help()
        sys.exit(1)
    return args


def recursively_find_mdoc(mdoc_dir:Path, filename:'str') -> Iterable:
    ##### (il)Logic #####
    #Set directory and get list of mdocs
    mdoc_directory = Path(mdoc_dir)
    if not mdoc_dir.exists():
        raise NotADirectoryError(f'Directory {mdoc_dir} does not exist.')
    
    mdoc_list = mdoc_directory.glob(f'**/{filename}')
    return next(mdoc_list)

#Iterate through list of mdoc files
def parse_mdoc(mdoc_file:Path) -> Dict:
    #Open individual mdoc file
    mdoc = mdoc_file.read_text()
    pattern = r"ImageShift = (-?\d+\.\d+)\s+(-?\d+\.\d+)"
    matches = re.findall(pattern, mdoc)
    if matches:
        return np.array(matches[0]).astype(float)
                

def get_exposures_from_job(project:str,job) -> np.ndarray:
    cs_project = cs.find_project(project)
    cs_job = cs.find_job(project,job)
    cs_exposures = cs_job.load_output(CS_JOB_OUTPUT_TYPE)
    return cs_project, cs_job, cs_exposures.copy()    


def progress(i,total,interval=2):
    completion = round(i/total*100)
    if completion % interval == 0:
        print(f'{completion}% completed. {i}/{total}', end='\r')

if __name__ == "__main__":
    
    args = parser(sys.argv[1:])
    cs_password = getpass("Please enter CryoSPARC password: ")
    #Connect to CS
    cs = CryoSPARC(license=CONFIG['cs_license'], email=args.email, password=cs_password, host=CONFIG['cs_hostname'], base_port=int(CONFIG['cs_port']))
    assert cs.test_connection(), "Connection to cryosparc failed"

    #Find dataset and copy it to a new dataset
    cs_project, cs_job, grouped_exposures = get_exposures_from_job(args.project,args.job)
    #For every row in the dataset, search entire mdoc data and see if it fits a known movie and image shift. There's probably a much faster way to do this.
    total = len(grouped_exposures)
    for ind,row in enumerate(grouped_exposures.rows()):
#        exposure = row['movie_blob/path'].rsplit(sep='/')[-1]
#        mdocfile = Path(args.mdoc_dir,exposure+'.mdoc')
#        if args.recursive:
#            mdocfile = recursively_find_mdoc(Path(args.mdoc_dir),filename=exposure+'.mdoc')  
###      new snippet inserted by SP
        original_exposure = row['movie_blob/path'].rsplit(sep='/')[-1]
        # Remove CryoSPARC's numeric prefix
        clean_exposure = re.sub(r'^\d+_', '', original_exposure)
        mdocfile = Path(args.mdoc_dir, clean_exposure + '.mdoc')
        if args.recursive:
            mdocfile = recursively_find_mdoc(Path(args.mdoc_dir), filename=clean_exposure + '.mdoc')
###
        mdoc = parse_mdoc(mdocfile)
        # if exposures[i]['exposure'] == row['movie_blob/path'].rsplit(sep='/')[-1]:
        row['mscope_params/beam_shift'] = mdoc
        row['mscope_params/beam_shift_known'] = 1
        progress(ind,total)

    #Make a new external job with the updated values, and passthrough the remaining, non-updated values
    cs_project.save_external_result(
        workspace_uid=args.workspace,
        dataset=grouped_exposures,
        type="exposure",
        name="exposure",
        slots=["mscope_params"],
        passthrough=(args.job,CS_JOB_OUTPUT_TYPE),
        title="ImageShift Updated Exposures"
        )

ccgauvin94 · July 30, 2025, 2:30pm

Hi Folks,

I’d recommend using this script to just convert them into XML files, and then import the XML files natively into CryoSPARC:

github.com

ccgauvin94/cs_jiffies/blob/main/mdoc_xml.py

#!/usr/bin/env python3
import os
import sys


# Help Messages
def help():
    print("Usage: python mdoc_xml.py <command>")
    print("\nCommands:")
    print(" --help, -h      -- Display help")
    print(" --dir,  -d      -- Specify directory")
    print("\nWill default to current directory unless specified.")


# Find MDOC files and add to list
def get_mdocs(directory):
    mdocs = []
    with os.scandir(directory) as entries:
        for entry in entries:
            if entry.is_file() and entry.name.endswith(".mdoc"):

This file has been truncated. show original

It’s simpler, uses only system libraries, and is much faster. I have stopped using my old script in favor of this new one.

ccgauvin94 · July 30, 2025, 2:32pm

And yes, double check whether it should be accepted_exposures or exposures_accepted - kind of annoying that they are different.

stefan · July 30, 2025, 5:51pm

ccgauvin94:


# Help Messages
def help():
    print("Usage: python mdoc_xml.py <command>")
    print("\nCommands:")
    print(" --help, -h      -- Display help")
    print(" --dir,  -d      -- Specify director

Thank you @ccgauvin94 !