Reference based motion correction error ====== Job process terminated abnormally

Dmitry · January 18, 2024, 7:30am

Dear colleagues,

Regardless of the settings, the Reference based motion stops everytime on the same step with the following error

[CPU: 91.7 MB Avail: 506.62 GB]
====== Job process terminated abnormally.

Is there any solution to fix that?

Thank you.

Kind regards,
Dmitry

rbs_sci · January 18, 2024, 7:38am

Do you have 512GB RAM? That free memory makes me think it ran out of system RAM. Check dmesg to see whether the kernel freaked out and the OOM reaper kicked in…

Dmitry · January 18, 2024, 11:04am

Hello @rbs_sci ,

Thank you for your answer.

Yes, that is the RAM I have. Isn’t that enough? What is your opinion?

And what can I do now with this protocol to make it run?

I thought that having the memory restriction in the c would work well. But that is not the case.

What I mean is that these settings at the bottom of the

Thank you.

Kind regards,
Dmitry

wtempel · January 18, 2024, 2:43pm

@Dmitry Do you have additional error messages in the job log (under Metadata|Log)?

Dmitry · January 18, 2024, 3:39pm

hello @wtempel ,

not really. I have the following continues message

Kind regards,
Dmitry

wtempel · January 18, 2024, 4:19pm

@Dmitry What are the lines at the very end of the job log?
Please can you also check for indicators that the system has run out of memory, like:

What is the output of the command

cryosparcm icli # enter the CryoSPARC interactive cli
puid, juid = 'P5', 'J64' # substitute actual project and job uids
cli.get_job(puid, juid, 'params_spec', 'heartbeat_at', 'killed_at', 'failed_at')
exit()

[edited to clarify the need for icli]

Dmitry · January 18, 2024, 4:43pm

hello @wtempel ,

Sorry my confusion -

I can not open messages in the job log (under Metadata |Log ) - it crashes everytime when I click on Log giving the following window

About the other commands - they seems to provide no further information in my case.

dmitry@cryoem1:~$ sudo dmesg | grep -i OOM

dmitry@cryoem1:~$ sudo journalctl | grep -i OOM

Dmitry · January 18, 2024, 4:53pm

I have also opened the log file externaly (~250 Mb) and here is the latest lines
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
========= main process now complete at 2024-01-17 19:52:23.319551.
========= monitor process now complete at 2024-01-17 19:52:24.701361.

Dmitry · January 18, 2024, 5:36pm

About the command - is it complete one?

I tried is as the one below and got an error for the syntax

puid, juid = ‘P2’, ‘J139’ cli.get_job(puid, juid, ‘params_spec’, ‘heartbeat_at’, ‘killed_at’, ‘failed_at’)

Kind regards,
Dmitry

wtempel · January 18, 2024, 7:18pm

It was not; I apologize. The commands were intended to be run from within the interactive icli, which you can access with the command
cryosparcm icli

Dmitry · January 18, 2024, 9:50pm

hello @wtempel ,

here is the result.

In [2]: puid, juid = ‘P2’, ‘J139’
…: cli.get_job(puid, juid, ‘params_spec’, ‘heartbeat_at’, ‘killed_at’, ‘failed_at’)
…:
Out[2]:
{‘_id’: ‘65a852521e3a4e5ed243eb02’,
‘failed_at’: ‘Thu, 18 Jan 2024 05:42:24 GMT’,
‘heartbeat_at’: ‘Thu, 18 Jan 2024 05:42:23 GMT’,
‘killed_at’: None,
‘params_spec’: {‘compute_num_gpus’: {‘value’: 4},
‘hyparam_search_thoroughness’: {‘value’: ‘Extensive’}},
‘project_uid’: ‘P2’,
‘uid’: ‘J139’}

Kind regards,
Dmitry

rbs_sci · January 19, 2024, 12:25am

First thing I’d test is run on a single GPU and set the oversubscription threshold over the GPU VRAM (e.g. on 16GB, leave at 20, on 24GB, set to 30, etc) and the in-memory cache to 300.

RBMC system RAM usage oscillates a bit based on particles on each micrograph, so it’s possible one micrograph needs >80GB of system RAM… EER data can be particularly heavy if upsampling 2 (I’ve not tried 16K in CryoSPARC…)

Dmitry · January 19, 2024, 11:08am

hello @rbs_sci ,

Thank you for your answer.

Set accordingly:

I will update on the results.

Kind regards,
Dmitry

p.s.
usually this protocol dies on the 19th checkpoint becoming too slow. I believe we will have results today.

rbs_sci · January 19, 2024, 12:02pm

I hope so! Good luck!

Dmitry · January 22, 2024, 3:22pm

hello @wtempel and @rbs_sci,

I am still having the same error. Please see the summary below:

Issue summary –

Reference based motion correction stops all the time on Checkpoint 19
When applying 1 GPU 30 Ram // 300 RAM, I got an error at the Checkpointe 21 with duplicated particles (just as described here - Crash of Reference Based Motion Correction - Motion Correction - CryoSPARC Discuss)
When fixed the error running – Remove duplicate particles – again, the stop error from 1. Returns.

So currently the error remains the same.

I am testing the refinement - if that could caused and issue.

But apart from that I have no idea for now.

Any tips?

Thank you.

Kind regards,
Dmitry

rbs_sci · January 22, 2024, 10:38pm

Hi @Dmitry, that seems very odd…! Sorry, it really sounded like it was running out of memory…

Let me just check that step 3 was run with the same conditions as step 2 (1GPU 30 VRAM, 300 RAM)? It seems odd to me that increasing the hardware parameters changes the crash (and makes it later) but upon fixing the duplicate issue it crashes back where it used to once again…

Does dmesg have anything which might shed further light on the situation?

This is crashing on hyperparameter search…? Can you try fewer particles (say, 5,000)? Even try 1,000 just to see if it can complete successully (although don’t use that parameter set if it does work…

Part of the issue is the TIFFRead spam in the log, which makes it really hard to track anything else (RELION has the same issue by default when motion correcting/polishing EER because libtiff doesn’t understand the EER headers)… if you cat [/path/to/logfile] | grep -v TIFFReadDirectory > ~/tempLog.txt it will filter out all the pointless TIFFRead warnings and make it a little easier to see what’s going on (if it has any information)…

Dmitry · January 23, 2024, 1:33pm

hello @rbs_sci,

I will give a try and report back.

Many thanks.

Regards,
Dmitry

Dmitry · January 27, 2024, 5:20pm

Hello @rbs_sci and @wtempel

I tried to re-run the Reference based motion correction using just a few images as a training and running input.

The results are always the same – the protocol stops at checkpoint #19 regardless of the settings.

The first attempt

The second attempt

The third attempt

– @wtempel , and @hsnyder do you have the test dataset I can run the Reference based motion correction to see what is wrong?
– Can it be some issue with CS installation?
– Can the initial wrong input parameters cause such an issue? What about the total dose?

Additionally, when I try to load the report from each failed protocol (reference-based motion cor)
I get the following error

But finally, the report is being downloaded anyhow.
Let me know if I can send you the reports to you.

Kind regards,
Dmitry

wtempel · January 29, 2024, 12:09am

For testing, you may run Reference Based Motion Correction with outputs from Extensive Validation with the T20s subset:

“Movies” from the validation’s Patch Motion Correction
particles and volume from the validation’s Homogeneous Refinement

Is your CryoSPARC master computer also acting as a GPU worker for the aforementioned motion correction jobs? Was a job running when you attempted to download the report?

Dmitry · January 30, 2024, 6:55am

hello @wtempel ,

you said: “Is your CryoSPARC master computer also acting as a GPU worker for the aforementioned motion correction jobs? Was a job running when you attempted to download the report?”

I am not sure - how to check that?
I tried to reproduce the downloading error - starting the Reference based motion correction and downloading the report, also when CS and system is not running any protocol. But currently the report downloads work without an issue.

About test - I will run it and report back.

Thank you.

Kind regards,
Dmitry