Filament Tracer - Job process terminated abnormally

bertrand · February 7, 2022, 11:18am

Hi all,

I have been trying to setup Filament tracer for a project and so far on the full dataset I always end up to have the following message:

[CPU: 598.2 MB] Done in 116.16s
[CPU: 91.5 MB] ====== Job process terminated abnormally.

So before starting I played with 100 micrographs to setup the parameters. Everything went well. Then I setup a full analysis on the dataset around 2000 micrographs. Unfortunately the jobs crash randomly…

In order to determine if there was so corrupt images I have submitted the by increment of 100 and try to remove the images that might have create the issue… unfortunately it did not help…

test 1- 355 mics

[CPU: 937.3 MB] Completed 354 of 355 : S2/motioncorrected/FoilHole_4458234_Data_4457001_4457003_20220205_234501_EER_patch_aligned_doseweighted.mrc
Picked 176 particles in 0.51s (260.80s total)
[CPU: 90.9 MB] ====== Job process terminated abnormally.

test 2- 353 mics

[CPU: 937.3 MB] Completed 352 of 353 : S2/motioncorrected/FoilHole_4458234_Data_4456995_4456997_20220205_234438_EER_patch_aligned_doseweighted.mrc
Picked 119 particles in 0.29s (114.67s total)
[CPU: 91.0 MB] ====== Job process terminated abnormally.

test 2- 350 mics

[CPU: 941.6 MB] Completed 349 of 350 : S2/motioncorrected/FoilHole_4458233_Data_4457019_4457021_20220205_234230_EER_patch_aligned_doseweighted.mrc
Picked 166 particles in 0.29s (115.89s total)

[CPU: 598.2 MB] Done in 116.16s
[CPU: 91.5 MB] ====== Job process terminated abnormally.

Of course it is a bit hard to determine where the error come from since there is no error message…

Lastly once trying one more time with more micrographs I ended up with this error message:

[CPU: 938.4 MB] Completed 359 of 360 : S2/motioncorrected/FoilHole_4458234_Data_4457016_4457018_20220205_234601_EER_patch_aligned_doseweighted.mrc
Picked 102 particles in 0.55s (263.81s total)
[CPU: 595.0 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 85, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/template_picker_gpu/run.py”, line 67, in cryosparc_compute.jobs.template_picker_gpu.run.run
File “/home/dcigesrv5/cryoSPARC/cryosparc_worker/cryosparc_compute/dataset.py”, line 465, in append_many
newdata[field][startidx:startidx+num] = dset.data[field][mask]
KeyError: ‘ctf/phase_shift_rad’

I wonder if all those errors / abnormally termination are due to empty or no filaments in the micrographs…

can you help me we this issue, that would be great.

Just for information we are running cryosparc 3.3.1.

Best

Bertrand

mmclean · February 10, 2022, 1:32am

Hi @bertrand,

Thanks for reporting this! The Job process terminated abnormally error is sometimes seen when the job uses more RAM than it was allocated and gets killed, although at a CPU RAM of 900MB it seems unlikely that that is the case. Did you happen to try the same micrograph/template dataset with the standard template picker? As well, did you run CTF estimation (e.g. patch CTF estimation, CTFFIND, or otherwise) prior to running the tracer?

Also, could you check the log files of any of the failed jobs with the “job process terminated abnormally” error, and DM these to me? I have not seen the last error previously, but I will follow up over DM to help investigate further.

Best,
Michael

bertrand · February 11, 2022, 8:14am

Hi Michael,

So here is a quick summary of what has been done prior the running filament tracer jobs:

all the micrographs collected on the microscope are monitored via cryosparc live session were patch alignment and CTF estimation is performed on the fly.
standard blob picker is also performed but it is unfortunately not really reliable for filament picking while using cryosparc live session. Just on the note are you planning to integrate filament tracer for cryosparc live session? that would be great!
Based on CTF estimation and defocus range and ice thickness the selected exposures are exported from the cryosparc live session to the workspace for further processing.

From data collection point of view, the density of the filament per micrographs is far from being optimal with some micrographs being definitely empty.

So to go back to the error let proceed in order:

For the error 1 – jobs terminating abnormally here is the current job.log message:

================= CRYOSPARCW ======= 2022-02-07 09:04:16.799443 =========

Project P13 Job J106

Master dcigesrv5 Port 39002

===========================================================================

========= monitor process now starting main process

MAINPROCESS PID 3988080

MAIN PID 3988080

template_picker_gpu.run cryosparc_compute.jobs.jobregister

========= monitor process now waiting for main process

========= sending heartbeat

free(): invalid next size (fast)

========= main process now complete.

========= monitor process now complete.

For the Error 2:

[CPU: 595.0 MB] Traceback (most recent call last): File “cryosparc_worker/cryosparc_compute/run.py”, line 85, in cryosparc_compute.run.main File “cryosparc_worker/cryosparc_compute/jobs/template_picker_gpu/run.py”, line 67, in cryosparc_compute.jobs.template_picker_gpu.run.run File “/home/dcigesrv5/cryoSPARC/cryosparc_worker/cryosparc_compute/dataset.py”, line 465, in append_many newdata[field][startidx:startidx+num] = dset.data[field][mask] KeyError: ‘ctf/phase_shift_rad’

And the job.log is the following one :

Running job on hostname %s dcigesrv5

Allocated Resources : {‘fixed’: {‘SSD’: False}, ‘hostname’: ‘dcigesrv5’, ‘lane’: ‘default’, ‘lane_type’: ‘default’, ‘license’: True, ‘licenses_acquired’: 1, ‘slots’: {‘CPU’: [36, 37, 38, 39], ‘GPU’: [6], ‘RAM’: [12]}, ‘target’: {‘cache_path’: ‘/mnt/scratch/cryosparc_cache’, ‘cache_quota_mb’

: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 47641198592, ‘name’: ‘NVIDIA A40’}, {‘id’: 1, ‘mem’: 47641198592, ‘name’: ‘NVIDIA A40’}, {‘id’: 2, ‘mem’: 47641198592, ‘name’: ‘NVIDIA A40’}, {‘id’: 3, ‘mem’: 47641198592, ‘name’: ‘NVIDIA A40’}, {‘id’: 4, ‘mem’: 47641

198592, ‘name’: ‘NVIDIA A40’}, {‘id’: 5, ‘mem’: 47641198592, ‘name’: ‘NVIDIA A40’}, {‘id’: 6, ‘mem’: 47641198592, ‘name’: ‘NVIDIA A40’}, {‘id’: 7, ‘mem’: 47641198592, ‘name’: ‘NVIDIA A40’}], ‘hostname’: ‘dcigesrv5’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘dcigesrv5’, 'resource_fixe

d’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,

64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], ‘GPU’:

[0, 1, 2, 3, 4, 5, 6, 7], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, ‘ssh_str’:

‘dcigesrv5@dcigesrv5’, ‘title’: ‘Worker node dcigesrv5’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/dcigesrv5/cryoSPARC/cryosparc_worker/bin/cryosparcw’}}

min: -0.000079 max: 0.000082

min: -0.000069 max: 0.000073

min: -0.000062 max: 0.000065

HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty

min: -112.958197 max: 112.770471

min: -332.227082 max: 331.476180

min: -123.093775 max: 122.963240

min: -352.273185 max: 351.751046

min: -135.881165 max: 135.700775

min: -147.156035 max: 146.999513

min: -150.380116 max: 150.254642

min: -137.038977 max: 136.853342

min: -128.948427 max: 128.818480

min: -126.053513 max: 125.876472

min: -123.258000 max: 123.171726

min: -123.991157 max: 123.754822

**** handle exception rc

set status to failed

========= main process now complete.

========= monitor process now complete.

Just on the note both jobs were submitted with the same number and same pool of micrographs and we do not have phase plate for the data collection.

Best

Bertrand