3D classification - output particles from intermediate iteration in failed job

rcastellsg · August 15, 2022, 7:47pm

Dear all,
I have a 3D classification job that after a few days it completes the final iteration but then it terminates abnormally, it is labelled as failed and generates no outputs.

[CPU: 102.73 GB] Finished iteration 7273 in 47171.310s.
[CPU: 11.1 MB] ====== Job process terminated abnormally.

Is there any way to output the particles for the different classes from the last iteration?

I have tried using the option “Mark as complete” but the only thing that I get is:
[CPU: 63.8 MB] Finalizing Job…

[CPU: 63.9 MB] Passing through outputs for output group particles_all_classes from input group particles

[CPU: 63.9 MB] This job outputted results [‘alignments_class3D_0’, ‘alignments_class3D_1’, ‘alignments_class3D_2’, ‘alignments_class3D_3’, ‘alignments_class3D_4’, ‘alignments_class3D_5’, ‘alignments_class3D_6’, ‘alignments_class3D_7’, ‘alignments_class3D_8’, ‘alignments_class3D_9’, ‘alignments_class3D_10’, ‘alignments_class3D_11’, ‘alignments_class3D_12’, ‘alignments_class3D_13’, ‘alignments_class3D_14’]
[CPU: 63.9 MB] Loaded output dset with 0 items

with no outputs in the end.

Any advice/tips would be very much appreciated.
Thanks a lot for your help!

wtempel · August 15, 2022, 10:07pm

Welcome to the forum @rcastellsg.

We would like to see additional lines from this log. I will send you a direct message with details.

rcastellsg · August 15, 2022, 11:07pm

Great, thanks a lot!

Regarding this, I was also wondering if it is possible to “continue” a 3D classification cryosparc job from a specific iteration after it has failed.

CryoEM1 · August 16, 2022, 6:56pm

Following. This happens too often, and I’m fairly certain the answers to your questions are No and No. Continuation from checkpoints is such a nice part of relion, whether due to jobs failing or to harness the benefit of changing parameters halfway through a job, but is entirely absent in cryosparc. Its value is akin to 3D variability analysis with 3DVA Display job. Its so great to quickly generate different display types of the results without redoing the costly analysis. This is about the only csparc example I can think of tho.

wtempel · August 17, 2022, 6:52pm

@rcastellsg @CryoEM1 We agree that resumption from checkpoints can be useful, and this facility is on our radar. You may appreciate that its implementation across numerous job types is a complex undertaking.

CryoEM1 · August 17, 2022, 7:07pm

Yes of course. Very exciting to hear of its consideration

rcastellsg · August 18, 2022, 7:10am

Thanks a lot for the replies @CryoEM1 and @wtempel !

wtempel · August 23, 2022, 9:23pm

Only now did I notice how much RAM is being used. This prompts some additional questions:

What are the total amount of RAM on that computer (free -g) and
the GPU model (nvidia-smi)?
Were there other concurrent workloads?
What is the particle box size?
How many particles were there?

rcastellsg · August 23, 2022, 10:00pm

Hi Wolfram,

Regarding your questions:

1- What are the total amount of RAM on that computer (free -g) → 128 Gb
2 - the GPU model (nvidia-smi)? → TITAN RTX GPU with 24 Gb of memory
3- Were there other concurrent workloads? → No other concurrent workloads
4- What is the particle box size? → 288 and 96
5- How many particles were there? ~21,800,000

Thanks!

CryoEM1 · August 23, 2022, 10:19pm

22mil NOW we’re talking.

vperetroukhin · August 24, 2022, 4:28pm

@rcastellsg we talked over this internally! With such a large dataset, the job is most likely running out of memory during the final output stages (the largest dataset we’ve used for one classification job, or any job for that matter, is around 2 million particles). A couple suggestions:

Perhaps you can use the Particle sets tool job to select a random subset of 1-2 million particles, run the 3D class job on that subset and see if you can identify salient heterogeneity.
If you see heterogeneity in that random subset, then perhaps you can chunk your dataset into (contiguous) subsets of 1-2 million particles each, classify each and then combine particle sets by manual inspection.

For the 3D class job itself, you can also try:

reducing O-EM epochs to 2-3 (you likely won’t need as many runs through the dataset with so many particles)
turning on Output data after every full iter so that you can inspect outputs after every full batch EM in case something goes wrong with the final job output
adjusting the learning rate and full EM iterations so that the average class ESS is close to 1 at the end of the run – you can try using the settings in this post.

Valentin

rcastellsg · August 25, 2022, 7:43am

@vperetroukhin thanks a lot for looking into this and for the additional advice/information!

For the O-EM learning rate init if we change it form 0.1 to 0.75, in which way does it affect the job?

vperetroukhin · December 14, 2022, 6:55pm

@rcastellsg, apologies for the long delay: a larger learning rate tends to empty-out classes and collapse the heterogeneity into a few dominant modes. FYI: 3D class in v4+ has a number of algorithmic changes. If you’ve had a chance to upgrade and use it, we’d love any feedback on its performance with very large particle datasets.