3D classification - output particles from intermediate iteration in failed job

Dear all,
I have a 3D classification job that after a few days it completes the final iteration but then it terminates abnormally, it is labelled as failed and generates no outputs.

[CPU: 102.73 GB] Finished iteration 7273 in 47171.310s.
[CPU: 11.1 MB] ====== Job process terminated abnormally.

Is there any way to output the particles for the different classes from the last iteration?

I have tried using the option “Mark as complete” but the only thing that I get is:
[CPU: 63.8 MB] Finalizing Job…

[CPU: 63.9 MB] Passing through outputs for output group particles_all_classes from input group particles

[CPU: 63.9 MB] This job outputted results [‘alignments_class3D_0’, ‘alignments_class3D_1’, ‘alignments_class3D_2’, ‘alignments_class3D_3’, ‘alignments_class3D_4’, ‘alignments_class3D_5’, ‘alignments_class3D_6’, ‘alignments_class3D_7’, ‘alignments_class3D_8’, ‘alignments_class3D_9’, ‘alignments_class3D_10’, ‘alignments_class3D_11’, ‘alignments_class3D_12’, ‘alignments_class3D_13’, ‘alignments_class3D_14’]
[CPU: 63.9 MB] Loaded output dset with 0 items

with no outputs in the end.

Any advice/tips would be very much appreciated.
Thanks a lot for your help!

Welcome to the forum @rcastellsg.

We would like to see additional lines from this log. I will send you a direct message with details.

1 Like

Great, thanks a lot!

Regarding this, I was also wondering if it is possible to “continue” a 3D classification cryosparc job from a specific iteration after it has failed.

Following. This happens too often, and I’m fairly certain the answers to your questions are No and No. Continuation from checkpoints is such a nice part of relion, whether due to jobs failing or to harness the benefit of changing parameters halfway through a job, but is entirely absent in cryosparc. Its value is akin to 3D variability analysis with 3DVA Display job. Its so great to quickly generate different display types of the results without redoing the costly analysis. This is about the only csparc example I can think of tho.

2 Likes

@rcastellsg @CryoEM1 We agree that resumption from checkpoints can be useful, and this facility is on our radar. You may appreciate that its implementation across numerous job types is a complex undertaking.

3 Likes

Yes of course. Very exciting to hear of its consideration

1 Like

Thanks a lot for the replies @CryoEM1 and @wtempel !

Only now did I notice how much RAM is being used. This prompts some additional questions:

  1. What are the total amount of RAM on that computer (free -g) and
  2. the GPU model (nvidia-smi)?
  3. Were there other concurrent workloads?
  4. What is the particle box size?
  5. How many particles were there?

Hi Wolfram,

Regarding your questions:

1- What are the total amount of RAM on that computer (free -g) → 128 Gb
2 - the GPU model (nvidia-smi)? → TITAN RTX GPU with 24 Gb of memory
3- Were there other concurrent workloads? → No other concurrent workloads
4- What is the particle box size? → 288 and 96
5- How many particles were there? ~21,800,000

Thanks!

22mil NOW we’re talking.

3 Likes

@rcastellsg we talked over this internally! With such a large dataset, the job is most likely running out of memory during the final output stages (the largest dataset we’ve used for one classification job, or any job for that matter, is around 2 million particles). A couple suggestions:

  • Perhaps you can use the Particle sets tool job to select a random subset of 1-2 million particles, run the 3D class job on that subset and see if you can identify salient heterogeneity.
  • If you see heterogeneity in that random subset, then perhaps you can chunk your dataset into (contiguous) subsets of 1-2 million particles each, classify each and then combine particle sets by manual inspection.

For the 3D class job itself, you can also try:

  • reducing O-EM epochs to 2-3 (you likely won’t need as many runs through the dataset with so many particles)
  • turning on Output data after every full iter so that you can inspect outputs after every full batch EM in case something goes wrong with the final job output
  • adjusting the learning rate and full EM iterations so that the average class ESS is close to 1 at the end of the run – you can try using the settings in this post.

Valentin

2 Likes

@vperetroukhin thanks a lot for looking into this and for the additional advice/information!

For the O-EM learning rate init if we change it form 0.1 to 0.75, in which way does it affect the job?