Particle order in refinement matters?

olibclarke · May 24, 2025, 7:00pm

Hi all,

This is more of an observation than anything - basically I’ve noticed that for a large, “dirty” particle stack, aligning on the best particles first seems to make refinement more stable, higher resolution, with less overfitting artefacts, than if I read particles randomly.

I do this by (after I have identified a small, homogeneous particle stack that refines to decent resolution on its own) supplying it as the first particle input to a new refinement, with the second being a superset (e.g. the full particle stack before classification), and then disabling randomization of particles (ignore the absurdly fine sampling & large box, I was testing something unrelated):

Without randomization:

With randomization:

In this case, the “good” subset is 155k, and goes to 3.1 Å with a cFAR of 0.7; the 800k stack is the raw extracted stack, with no classification (straight after blob picking). This is this dataset (https://www.biorxiv.org/content/10.1101/2024.05.28.596281v1) - ~55kDa ordered mass.

If this genuinely gives better alignments for the “bad” particles as well as the good ones, it may facilitate subsequent analysis by 3D classification, particle subtraction, 3D-VA and CryoDRGN, which depend on the quality of the input poses. Would be keen to hear thoughts/discussion/further testing.

Cheers
Oli

rbs_sci · May 24, 2025, 11:42pm

Makes sense, and aligns with my own observations. It’s very dependent on quality of the stack. I’ll see if I can sort some examples out as when it happens I usually just reset the refinement and/or do some more cleaning, so haven’t got any examples on hand.

DanielAsarnow · May 25, 2025, 3:52am

I guess it makes sense as beginning iterations don’t use all particles. Then visiting all the particles like Relion, using a larger initial batch, or a shortcut to intermediate resolution with a higher reference resolution would weaken the effect.

olibclarke · May 25, 2025, 2:48pm

Yes - I am testing with batching disabled, so it uses the full set from the beginning - will report.

My gut feeling is that using a smaller, cleaner batch for the initial alignments might still be better - by resulting in a clean ~4Å map before attempting to then align the full set - but we’ll see!

EDIT: Switching off batches gives a worse refinement and map than the one in which the good particles are processed first:

olibclarke · May 30, 2025, 5:55pm

While investigating this I wrote (with assistance of ChatGPT) a CS tools script to plot the alignments3d/shift values (mostly to have an idea whether using them for recentering would be helpful).

Posting it here in case it is useful for anyone else (or to future-me!).

Script:

#!/usr/bin/env python3
"""
plot_shifts_external.py  –  raw-vs-converted shift histograms (constant psize)

Positional UIDs (any order, all required)
    Pxx   project
    Wxx   workspace
    Jxx   refinement job

Optional flag
    --bins  INT   Histogram bin count (default 50)
"""
import argparse, json, sys
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
from cryosparc.tools import CryoSPARC


# ───────── helpers ─────────
def connect():
    cfg = json.load(Path("~/instance_info.json").expanduser().open())
    cs = CryoSPARC(**cfg)
    assert cs.test_connection()
    return cs


def classify(tokens):
    out = {"P": None, "W": None, "J": None}
    for tok in tokens:
        tag = tok[0].upper()
        if tag not in out:
            raise RuntimeError(f"Bad UID '{tok}' (must start P/W/J)")
        if out[tag]:
            raise RuntimeError(f"Duplicate {tag} UID ('{out[tag]}' & '{tok}')")
        out[tag] = tok
    missing = [k for k, v in out.items() if v is None]
    if missing:
        raise RuntimeError("Missing UID(s): " + ", ".join(missing))
    return out["P"], out["W"], out["J"]


def pixel_size(parts, job):
    cols = [
        "alignments3D/psize_A",
        "microscope_parameters/psize_A",
        "movies/psize_A",
        "image_original_pixelsize_A",
        "psize_A",
    ]
    for col in cols:
        if col in parts.fields():
            return float(parts[col][0]), f"particle column '{col}'"
    return float(job.load_output("volume")["map/psize_A"][0]), "refinement volume 'map/psize_A'"


def shift_px(parts):
    if "alignments3D/shift_px" in parts.fields():
        vec = parts["alignments3D/shift_px"]
    elif "alignments3D/shift" in parts.fields():
        vec = parts["alignments3D/shift"]     # stored in px
    else:
        raise RuntimeError("No shift vector found.")
    return np.linalg.norm(vec, axis=1)


def plot_hist(data, unit, title, bins):
    fig = plt.figure()
    plt.hist(data, bins=bins)
    plt.xlabel(f"Shift magnitude ({unit})")
    plt.ylabel("Number of particles")
    plt.title(title)
    plt.tight_layout()
    return fig


# ───────── CLI ─────────
cli = argparse.ArgumentParser(description=__doc__,
                              formatter_class=argparse.RawDescriptionHelpFormatter)
cli.add_argument("uids", nargs="+", help="Three UIDs (P… W… J…) in any order")
cli.add_argument("--bins", type=int, default=50, help="Histogram bins")
args = cli.parse_args()

PUID, WUID, JUID = classify(args.uids)

# ───────── CryoSPARC data ─────────
cs     = connect()
proj   = cs.find_project(PUID)
job    = proj.find_job(JUID)
parts  = job.load_output("particles")

shift_px_vec       = shift_px(parts)
apix, psrc         = pixel_size(parts, job)          # constant by assumption
shift_A_vec        = shift_px_vec * apix

# ───────── External job ─────────
ext = proj.create_external_job(workspace_uid=WUID,
                               title=f"Shift histograms for {JUID}")
ext.add_input(type="particle", name="input_particles")
ext.connect("input_particles", JUID, "particles")

with ext.run():
    # concise pixel-size info
    ext.log(f"Pixel size used: {apix:.4f} Å/px  (source: {psrc})")

    # raw pixels
    ext.log_plot(
        plot_hist(shift_px_vec, "px",
                  f"{PUID}/{JUID} – Shift magnitude (raw pixels)", args.bins),
        text=(f"Raw pixel shifts; N={len(shift_px_vec):,}, "
              f"mean={np.mean(shift_px_vec):.2f} px, "
              f"σ={np.std(shift_px_vec):.2f} px"),
        formats=["png", "pdf"]
    )

    # converted Å
    ext.log_plot(
        plot_hist(shift_A_vec, "Å",
                  f"{PUID}/{JUID} – Shift magnitude (Å)", args.bins),
        text=(f"Converted shifts (Å); N={len(shift_A_vec):,}, "
              f"mean={np.mean(shift_A_vec):.2f} Å, "
              f"σ={np.std(shift_A_vec):.2f} Å"),
        formats=["png", "pdf"]
    )

    ext.log("Generated by plot_shifts_external.py (raw + Å)")

print(f"External job created: {ext.uid}")

Note instance_info.json needs to be in home dir, format:

Output:

hbridges1 · June 2, 2025, 4:47pm

Hi @olibclarke, thanks for sharing this finding - this is an interesting observation!

We wondered if you could try running the refinement of the “dirty” particles (with randomisation) but using the volume from the “good” subset with an initial lowpass filter of something like ~6Å to see if this achieves a similar result as you got by putting the “good” particles first and turning off randomisation?

olibclarke · June 2, 2025, 5:02pm

Thanks Hannah - I’ll try! All these were with a 15Å initial lowpass.

olibclarke · June 3, 2025, 1:08pm

Hi @hbridges1,

Using a higher res initial lowpass (without batching) is still worse than aligning good particles first:

The refinement is still running, but you can clearly see overfitting streaks that are not present when good particles are aligned to the reference first, and more anisotropy in the map.

EDIT: after completion, same story:

hbridges1 · June 9, 2025, 4:27pm

Thanks @olibclarke for reporting back! This is intriguing - we will try out your idea and see if we can replicate the same phenomena on other datasets!

CryoEM2 · June 9, 2025, 7:22pm

seeing this utility, I would also like to have the option to disable randomization during heterogeneous classification to allow a quality subset of particles to “seed” the refinement (along with the reference bias) and establishment of euler distribution when also applying to multiple-millions of particles. Which reminds me that a “streaming heterogeneous refinement” would be extremely useful as well. Streaming heterogeneous refinement (feature request)
The references that work well for 1 million particles generally work just as well for 10 million or for 10 batches of 1 million. Would be great to link the streaming 2D particles (maybe even just those which are selected) into a continuing hetereogeneous refinement so I don’t have to export and run particle sets over and over (or wait until collection is complete to start het refine of multiple millions).

hbridges1 · July 22, 2025, 3:26pm

Hi @olibclarke,

We have now also found examples where we also see an improvement when “good” particles are used in early refinement iterations, and will continue to investigate. Thanks so much for sharing your findings!

hbridges1 · July 29, 2025, 4:22pm

Thanks for you suggestions @CryoEM2!

Would you be able to test if following improves your Heterogeneous Refinement results?:

Use just a quality subset of particles for Hetero Refine 1
Use all of the particles (good and bad) and the output volumes from Hetero Refine 1 as inputs to Hetero Refine 2, and set a higher resolution initial lowpass.