I’m having a similar issue with Patch Motion Correction, with error message “Job is unresponsive - no heartbeat received in 180 seconds.”
Here is the result of
cryosparcm cli "get_scheduler_targets()"
cryosparcm eventlog $projectid $jobid | head -n 40
cryosparcm joblog $projectid $jobid | tail -n 20
[{'cache_path': '/mnt/Scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}], 'hostname': '10.250.102.2', 'lane': 'Amphitrite', 'monitor_port': None, 'name': '10.250.102.2', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparcuser@10.250.102.2', 'title': 'Worker node 10.250.102.2', 'type': 'node', 'worker_bin_path': '/home/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw'}]
[Tue, 24 Sep 2024 00:44:30 GMT] License is valid.
[Tue, 24 Sep 2024 00:44:30 GMT] Launching job on lane Amphitrite target 10.250.102.2 ...
[Tue, 24 Sep 2024 00:44:30 GMT] Running job on remote worker node hostname 10.250.102.2
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Job J203 Started
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Master running v4.5.3, worker running v4.5.3
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Working in directory: /home/cryosparcuser/Chiron/Shelly_wickham/Jiahe/Cryosparc/Cryosparc_projects/CS-2x2cube/J203
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Running on lane Amphitrite
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Resources allocated:
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Worker: 10.250.102.2
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] CPU : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] GPU : [0, 1, 2, 3]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] RAM : [0, 1, 2, 3, 4, 5, 6, 7]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] SSD : False
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] --------------------------------------------------------------
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Importing job module for job type patch_motion_correction_multi...
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job ready to run
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] ***************************************************************
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job will process this many movies: 971
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job will output denoiser training data for this many movies: 200
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Random seed: 690708223
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] parent process is 57387
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57453
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57456
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57454
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57455
[Tue, 24 Sep 2024 00:44:39 GMT] [CPU RAM used: 294 MB] -- 2.0: processing 4 of 971: J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc
loading /home/cryosparcuser/Chiron/Shelly_wickham/Jiahe/Cryosparc/Cryosparc_projects/CS-2x2cube/J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc
Loading raw movie data from J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc ...
Done in 15.53s
Processing ...
Done in 17.77s
Completed rigid and patch motion with (Z:5,Y:8,X:8) knots
Writing non-dose-weighted result to J203/motioncorrected/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_patch_aligned.mrc ...
Done in 0.18s
Writing 120x120 micrograph thumbnail to J203/thumbnails/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_thumb_@1x.png ...
Done in 0.01s
Writing 240x240 micrograph thumbnail to J203/thumbnails/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_thumb_@2x.png ...
Done in 0.01s
Writing dose-weighted result to J203/motioncorrected/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_patch_aligned_doseweighted.mrc ...
Done in 0.15s
Traceback (most recent call last):
File "<string>", line 9, in <module>
BrokenPipeError: [Errno 32] Broken pipe
========= sending heartbeat at 2024-09-24 11:55:54.559373
========= sending heartbeat at 2024-09-24 11:56:04.579781
========= sending heartbeat at 2024-09-24 11:56:14.600115
========= sending heartbeat at 2024-09-24 11:56:24.620202
========= sending heartbeat at 2024-09-24 11:56:34.640542
========= sending heartbeat at 2024-09-24 11:56:44.662026
========= sending heartbeat at 2024-09-24 11:56:54.682136
========= sending heartbeat at 2024-09-24 11:57:04.701806
========= sending heartbeat at 2024-09-24 11:57:14.720815
========= sending heartbeat at 2024-09-24 11:57:24.741521
========= sending heartbeat at 2024-09-24 11:57:34.761958
========= sending heartbeat at 2024-09-24 11:57:44.782397
========= sending heartbeat at 2024-09-24 11:57:54.803461
========= sending heartbeat at 2024-09-24 11:58:04.824165
========= sending heartbeat at 2024-09-24 11:58:14.844469
========= sending heartbeat at 2024-09-24 11:58:24.864399
========= sending heartbeat at 2024-09-24 11:58:34.885077
========= sending heartbeat at 2024-09-24 11:58:44.906332
========= sending heartbeat at 2024-09-24 11:58:54.926720
/home/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 57387 Terminated python -c "import cryosparc_compute.run as run; run.run()" "$@"
There was no “Out of memory” errors when the job failed (from the sudo journalctl | grep -i oom
)
When we try running the job again, sometimes it fails (at a different time) sometimes it finishes.
Any idea on how we could troubleshoot what’s causing the heartbeat to be lost?