Local motion correction job that does not finish

AndreGraca · November 8, 2023, 2:43pm

Hi everyone!

Did anyone ever encountered a problem with local motion correction when the job seems to output all extracted particle files, but never gets to output the metadata file and the job is never given as finished?

I have this local motion correction job that is running for over >70 hours and in the last 12 hours it has not output any new file (and I am quite sure it has been through all micrographs). If I kill the job, I know I won’t get the metadata file needed for the follow up jobs… how can I rescue this situation?

The cryoSPARC version is v. 4.1.2. If I had the oportunity, I would upgrade to the latest version, but it isn’t an option at the moment.

Thank you,
André

wtempel · November 8, 2023, 6:12pm

@AndreGraca Would you like to

email us the job report for this Local Motion Correction job.
confirm that on the worker, two processes corresponding to this job exist. For example,
```
ps axww | grep 'project P'
```

AndreGraca · November 8, 2023, 7:38pm

Thank you for the answer!

Job report emailed.

Yes, two processes corresponding to this job exist:

ps axww | grep ‘project P79’
6720 ? S 0:00 bash /home/ldbuser/Software/cryosparc/cryosparc_worker/bin/cryosparcw run --project P79 --job J256 --master_hostname wks-hajen --master_command_core_port 39002
6745 ? Sl 6:27 python -c import cryosparc_compute.run as run; run.run() --project P79 --job J256 --master_hostname wks-hajen --master_command_core_port 39002
6746 ? Sl 6:16 python -c import cryosparc_compute.run as run; run.run() --project P79 --job J256 --master_hostname wks-hajen --master_command_core_port 39002
54283 pts/18 S+ 0:00 grep --color=auto project P79

wtempel · November 8, 2023, 8:19pm

Could it be that the master computer is “struggling”?
What are the outputs of these command, run on the CryoSPARC master host:

free -g
ps -eo pid,ppid,start,vsz,rsz,pmem,pcpu,cmd | grep cryosparc

AndreGraca · November 8, 2023, 8:43pm

Could it be?

Here you go!

free -g
              total        used        free      shared  buff/cache   available
Mem:          376          12           4           5         359         355
Swap:            49           0          49

ps -eo pid,ppid,start,vsz,rsz,pmem,pcpu,cmd | grep cryosparc
 6720 71296   Nov 05  27240  4784  0.0  0.0 bash /home/ldbuser/Software/cryosparc/cryosparc_worker/bin/cryosparcw run --project P79 --job J256 --master_hostname wks-hajen --master_command_core_port 39002
 6745  6720   Nov 05 13594480 212488  0.0 0.1 python -c import cryosparc_compute.run as run; run.run() --project P79 --job J256 --master_hostname wks-hajen --master_command_core_port 39002
 6746  6745   Nov 05 2923140 1690576  0.4 0.1 python -c import cryosparc_compute.run as run; run.run() --project P79 --job J256 --master_hostname wks-hajen --master_command_core_port 39002
48297 53973 21:45:15  15756   936  0.0  0.0 grep --color=auto cryosparc
71055  2718   Oct 28  52312 22220  0.0  0.1 python /home/ldbuser/Software/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /home/ldbuser/Software/cryosparc/cryosparc_master/supervisord.conf
71173 71055   Oct 28 5384708 3666852  0.9 5.5 mongod --auth --dbpath /home/ldbuser/Software/cryosparc/cryosparc_database --port 39001 --oplogSize 64 --replSet meteor --nojournal --wiredTigerCacheSizeGB 4 --bind_ip_all
71296 71055   Oct 28 1852976 962848  0.2 3.4 python -c import cryosparc_command.command_core as serv; serv.start(port=39002)
71341 71055   Oct 28 1595580 324444  0.0 0.1 python -c import cryosparc_command.command_vis as serv; serv.start(port=39003)
71392 71055   Oct 28 952376 260732  0.0 2.1 python -c import cryosparc_command.command_rtp as serv; serv.start(port=39005)
71470 71055   Oct 28 1606172 587376  0.1 1.6 /home/ldbuser/Software/cryosparc/cryosparc_master/cryosparc_app/api/nodejs/bin/node ./bundle/main.js

wtempel · November 8, 2023, 10:29pm

We are not sure what is going on. A lot of memory is allocated to buff/cache. You could try

sudo sh -c "sync; echo 3 >/proc/sys/vm/drop_caches"

(info) and see if this allows the job to progress.
If this does not help, please can you post a screenshot of the last few lines of the job’s Event Log.

AndreGraca · November 8, 2023, 11:36pm

I wish I could do that and much more on this workstation, but sadly I do not have sudo privilegies in this workstation =(

To post a screenshot of the last few line of the job’s Event Log that means that I have to scroll down for a few minutes on a local motion correction job, isn’t that right?

AndreGraca · November 9, 2023, 3:53pm

I contacted the admin to get privileges to run the command you suggested @wtempel, but before he could attend my request, meanwhile the job failed and I think that was because we ran out of storage in our shared network file storage unit… I guess the extracted particles cannot be used in follow-up jobs without the metadata file?..

wtempel · November 9, 2023, 8:08pm

Yes please. This information may help us in answering your earlier question:

[Edited for clarity]

AndreGraca · November 9, 2023, 9:57pm

After 20 minutes scrolling I arrived there. I made timestamps visible. It would be great if the screenshot would reveal something, but I think it is quite uneventful.

In your last reply you quoted me two times, but it seems that you missed contextualizing the last quote.

wtempel · November 9, 2023, 10:13pm

We wanted to see if the end of the event log would suggest a way of making the existing output of the failed job usable. I am waiting to hear back from our team.

nfrasser · November 16, 2023, 7:30pm

Hi @AndreGraca, you can use a script like this to recover a failed local motion correction job:

gist.github.com

https://gist.github.com/nfrasser/41ce5e34e3cab26f9bf860595f299133

recover_local_motion.py

import numpy as n
from cryosparc import mrc
from cryosparc.tools import CryoSPARC
from cryosparc.dataset import Dataset
from cryosparc.util import first

cs = CryoSPARC(
    # INSERT INSTANCE/CREDENTIALS HERE
    license="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    email="nick@example.com",

This file has been truncated. show original

When installing cryosparc-tools into a Python environment, use the version that corresponds to the CryoSPARC version. For example, if running CryoSPARC v4.1.2, install a tools version >=4.1.0,<4.2.0:

pip install cryosparc-tools~=4.1.0

There’s also some substitution in the script you have to make specific to your CryoSPARC installation/project.

Once you have everything set up, you can run it from the command-line like this:

python3 recover_local_motion.py

Note that I’ve only done some rudimentary testing for this; there may be something else wrong with the job that I have not accounted for which could prevent the script from working. You may modify it as required.

AndreGraca · November 18, 2023, 12:07pm

Very grateful for your help @nfrasser!

I have beeen struggling a bit to put this into practice, but I overcame parts of it.

There a few questions, though:

First I assumed that the parent_picker_job was the imediate parent job of the local motion correction job, but is it actually the upstream particle picking job? Just so you are aware, I am not local motion correcting particles coming straight from a picker job (but I think you would guess that, as normally people clean well the particle stack before going to local motion correction). In fact, between the particle picking job and the local motion correction job I also curated exposures and rejected a number of them.
I do not understand “zero_shift_frame = frame_start + (frame_end - frame_start) / 2”. What is it suppose to be and should I change anything there?
After running the script, what shall I expect/do? Will the motion correction job be marked as completed? Do I have to restart cryoSPARC to see the differences? I did run already the script successfuly, by suplying the particle picking job as “parent_picker_job” and leaving “zero_shift_frame” as is, but I did not see any changes (in the CryoSPARC GUI nor the directory of the localmo job) after the script completed its task. Currently the workstation is running another long job, so I don’t want to try restarting CryoSPARC now.

I may or may not get my answers once I manage to restart CryoSPARC, however I think other users that may have this problem would like to have answer to these questions as well.

Thank you for all the help!
André

nfrasser · November 20, 2023, 4:08pm

Hi André, glad to hear you’re getting some results with this. To answer you questions:

Your first assumption was correct, parent_picker_job is the job that provides the input particle picks to local motion correction. This may be a picking job, or it may be a filtered subset of the original picks from a classification job.
zero_shift_frame is the frame for local motion correction to consider as the “central” frame, i.e., the frame from which particle offsets across frames are calculated by local motion. For a movie with n frames, this value defaults to n/2
After running the script, you should see a completed “External” job in the selected workspace with the motion-corrected particles as outputs. There is no need to restart CryoSPARC. If you do not see this, please post any errors you see in the output of the Python script.

Hope that helps!