'No heartbeat' Error

Hello,

I am using CryoSPARC v4.2.0 and I get the error:

Job is unresponsive - no heartbeat received in 180 seconds

Small jobs run fine such as importing movies, but when I come to do patch motion correction the job fails at anywhere between 100-500 micrographs. Only once did the PMC complete 1,000 micrographs. It is impossible to process a large amount of data meaning that I am at a dead end. Does anyone have any advice or experience with this issue? Is it caused by the way CryoSPARC is perhaps is set up?

I have put the event log below:

[CPU:  367.9 MB  Avail: 249.92 GB]
--------------------------------------------------------------

[CPU:  367.9 MB  Avail: 249.92 GB]
Processed 500 of 3157 movies in 5109.30s 

[CPU:   1.95 GB  Avail: 251.21 GB]
-- 0.0: processing 503 of 3157: J73/imported/011536600233019576460_FoilHole_22123076_Data_22074515_22074517_20230225_213212_Fractions.mrc
        loading /opt/cryosparc/cryosparc_worker/cryosparc_compute/CS-rrp2-glacios/CS-rrp2-glacios-ii/J73/imported/011536600233019576460_FoilHole_22123076_Data_22074515_22074517_20230225_213212_Fractions.mrc
        Loading raw movie data from J73/imported/011536600233019576460_FoilHole_22123076_Data_22074515_22074517_20230225_213212_Fractions.mrc ...
        Done in 6.26s
        Processing ...
        Done in 2.75s
        Completed rigid and patch motion with (Z:4,Y:6,X:6) knots
        Writing non-dose-weighted result to J79/motioncorrected/011536600233019576460_FoilHole_22123076_Data_22074515_22074517_20230225_213212_Fractions_patch_aligned.mrc ...
        Done in 0.03s
        Writing 120x120 micrograph thumbnail to J79/thumbnails/011536600233019576460_FoilHole_22123076_Data_22074515_22074517_20230225_213212_Fractions_thumb_@1x.png ...
        Done in 0.00s
        Writing 240x240 micrograph thumbnail to J79/thumbnails/011536600233019576460_FoilHole_22123076_Data_22074515_22074517_20230225_213212_Fractions_thumb_@2x.png ...
        Done in 0.00s
        Writing dose-weighted result to J79/motioncorrected/011536600233019576460_FoilHole_22123076_Data_22074515_22074517_20230225_213212_Fractions_patch_aligned_doseweighted.mrc ...
        Done in 0.04s
        Writing background estimate to J79/motioncorrected/011536600233019576460_FoilHole_22123076_Data_22074515_22074517_20230225_213212_Fractions_background.mrc ...
        Done in 0.00s
        Writing motion estimates...
        Done in 0.00s

[CPU:   3.23 GB  Avail: 249.95 GB]
-- 0.0: processing 504 of 3157: J73/imported/016050821538335556846_FoilHole_22123077_Data_22074515_22074517_20230225_213043_Fractions.mrc
        loading /opt/cryosparc/cryosparc_worker/cryosparc_compute/CS-rrp2-glacios/CS-rrp2-glacios-ii/J73/imported/016050821538335556846_FoilHole_22123077_Data_22074515_22074517_20230225_213043_Fractions.mrc
        Loading raw movie data from J73/imported/016050821538335556846_FoilHole_22123077_Data_22074515_22074517_20230225_213043_Fractions.mrc ...
        Done in 6.23s
        Processing ...
        Done in 2.74s
        Completed rigid and patch motion with (Z:4,Y:6,X:6) knots
        Writing non-dose-weighted result to J79/motioncorrected/016050821538335556846_FoilHole_22123077_Data_22074515_22074517_20230225_213043_Fractions_patch_aligned.mrc ...
        Done in 0.03s
        Writing 120x120 micrograph thumbnail to J79/thumbnails/016050821538335556846_FoilHole_22123077_Data_22074515_22074517_20230225_213043_Fractions_thumb_@1x.png ...
        Done in 0.00s
        Writing 240x240 micrograph thumbnail to J79/thumbnails/016050821538335556846_FoilHole_22123077_Data_22074515_22074517_20230225_213043_Fractions_thumb_@2x.png ...
        Done in 0.00s
        Writing dose-weighted result to J79/motioncorrected/016050821538335556846_FoilHole_22123077_Data_22074515_22074517_20230225_213043_Fractions_patch_aligned_doseweighted.mrc ...
        Done in 0.03s
        Writing background estimate to J79/motioncorrected/016050821538335556846_FoilHole_22123077_Data_22074515_22074517_20230225_213043_Fractions_background.mrc ...
        Done in 0.00s
        Writing motion estimates...
        Done in 0.00s

[CPU:   1.92 GB  Avail: 251.25 GB]
-- 0.0: processing 505 of 3157: J73/imported/002847857095426660098_FoilHole_22123080_Data_22074515_22074517_20230225_215711_Fractions.mrc
        loading /opt/cryosparc/cryosparc_worker/cryosparc_compute/CS-rrp2-glacios/CS-rrp2-glacios-ii/J73/imported/002847857095426660098_FoilHole_22123080_Data_22074515_22074517_20230225_215711_Fractions.mrc
        Loading raw movie data from J73/imported/002847857095426660098_FoilHole_22123080_Data_22074515_22074517_20230225_215711_Fractions.mrc ...
        Done in 6.39s
        Processing ...
        Done in 2.68s
        Completed rigid and patch motion with (Z:4,Y:6,X:6) knots
        Writing non-dose-weighted result to J79/motioncorrected/002847857095426660098_FoilHole_22123080_Data_22074515_22074517_20230225_215711_Fractions_patch_aligned.mrc ...
        Done in 0.03s
        Writing 120x120 micrograph thumbnail to J79/thumbnails/002847857095426660098_FoilHole_22123080_Data_22074515_22074517_20230225_215711_Fractions_thumb_@1x.png ...
        Done in 0.00s
        Writing 240x240 micrograph thumbnail to J79/thumbnails/002847857095426660098_FoilHole_22123080_Data_22074515_22074517_20230225_215711_Fractions_thumb_@2x.png ...
        Done in 0.00s
        Writing dose-weighted result to J79/motioncorrected/002847857095426660098_FoilHole_22123080_Data_22074515_22074517_20230225_215711_Fractions_patch_aligned_doseweighted.mrc ...
        Done in 0.03s
        Writing background estimate to J79/motioncorrected/002847857095426660098_FoilHole_22123080_Data_22074515_22074517_20230225_215711_Fractions_background.mrc ...
        Done in 0.00s
        Writing motion estimates...
        Done in 0.00s

[CPU:   3.23 GB  Avail: 249.95 GB]
-- 0.0: processing 506 of 3157: J73/imported/004062294440860625095_FoilHole_22123081_Data_22074515_22074517_20230225_215145_Fractions.mrc
        loading /opt/cryosparc/cryosparc_worker/cryosparc_compute/CS-rrp2-glacios/CS-rrp2-glacios-ii/J73/imported/004062294440860625095_FoilHole_22123081_Data_22074515_22074517_20230225_215145_Fractions.mrc
        Loading raw movie data from J73/imported/004062294440860625095_FoilHole_22123081_Data_22074515_22074517_20230225_215145_Fractions.mrc ...
        Done in 6.88s
        Processing ...
        Done in 2.66s
        Completed rigid and patch motion with (Z:4,Y:6,X:6) knots
        Writing non-dose-weighted result to J79/motioncorrected/004062294440860625095_FoilHole_22123081_Data_22074515_22074517_20230225_215145_Fractions_patch_aligned.mrc ...
        Done in 0.03s
        Writing 120x120 micrograph thumbnail to J79/thumbnails/004062294440860625095_FoilHole_22123081_Data_22074515_22074517_20230225_215145_Fractions_thumb_@1x.png ...
        Done in 0.00s
        Writing 240x240 micrograph thumbnail to J79/thumbnails/004062294440860625095_FoilHole_22123081_Data_22074515_22074517_20230225_215145_Fractions_thumb_@2x.png ...
        Done in 0.00s
        Writing dose-weighted result to J79/motioncorrected/004062294440860625095_FoilHole_22123081_Data_22074515_22074517_20230225_215145_Fractions_patch_aligned_doseweighted.mrc ...
        Done in 0.03s
        Writing background estimate to J79/motioncorrected/004062294440860625095_FoilHole_22123081_Data_22074515_22074517_20230225_215145_Fractions_background.mrc ...
        Done in 0.00s
        Writing motion estimates...
        Done in 0.00s

[CPU:   1.92 GB  Avail: 251.26 GB]
-- 0.0: processing 507 of 3157: J73/imported/002500836666533175262_FoilHole_22123082_Data_22074515_22074517_20230225_215159_Fractions.mrc
        loading /opt/cryosparc/cryosparc_worker/cryosparc_compute/CS-rrp2-glacios/CS-rrp2-glacios-ii/J73/imported/002500836666533175262_FoilHole_22123082_Data_22074515_22074517_20230225_215159_Fractions.mrc
        Loading raw movie data from J73/imported/002500836666533175262_FoilHole_22123082_Data_22074515_22074517_20230225_215159_Fractions.mrc ...
        Done in 6.54s
        Processing ...
        Done in 2.62s
        Completed rigid and patch motion with (Z:4,Y:6,X:6) knots
        Writing non-dose-weighted result to J79/motioncorrected/002500836666533175262_FoilHole_22123082_Data_22074515_22074517_20230225_215159_Fractions_patch_aligned.mrc ...
        Done in 0.03s
        Writing 120x120 micrograph thumbnail to J79/thumbnails/002500836666533175262_FoilHole_22123082_Data_22074515_22074517_20230225_215159_Fractions_thumb_@1x.png ...
        Done in 0.00s
        Writing 240x240 micrograph thumbnail to J79/thumbnails/002500836666533175262_FoilHole_22123082_Data_22074515_22074517_20230225_215159_Fractions_thumb_@2x.png ...
        Done in 0.00s
        Writing dose-weighted result to J79/motioncorrected/002500836666533175262_FoilHole_22123082_Data_22074515_22074517_20230225_215159_Fractions_patch_aligned_doseweighted.mrc ...
        Done in 0.03s
        Writing background estimate to J79/motioncorrected/002500836666533175262_FoilHole_22123082_Data_22074515_22074517_20230225_215159_Fractions_background.mrc ...
        Done in 0.00s
        Writing motion estimates...
        Done in 0.00s

[CPU:   3.23 GB  Avail: 249.95 GB]
-- 0.0: processing 508 of 3157: J73/imported/011861353430844636542_FoilHole_22123083_Data_22074515_22074517_20230225_215216_Fractions.mrc
        loading /opt/cryosparc/cryosparc_worker/cryosparc_compute/CS-rrp2-glacios/CS-rrp2-glacios-ii/J73/imported/011861353430844636542_FoilHole_22123083_Data_22074515_22074517_20230225_215216_Fractions.mrc
        Loading raw movie data from J73/imported/011861353430844636542_FoilHole_22123083_Data_22074515_22074517_20230225_215216_Fractions.mrc ...
        Done in 6.54s
        Processing ...
        Done in 2.62s
        Completed rigid and patch motion with (Z:4,Y:6,X:6) knots
        Writing non-dose-weighted result to J79/motioncorrected/011861353430844636542_FoilHole_22123083_Data_22074515_22074517_20230225_215216_Fractions_patch_aligned.mrc ...
        Done in 0.03s
        Writing 120x120 micrograph thumbnail to J79/thumbnails/011861353430844636542_FoilHole_22123083_Data_22074515_22074517_20230225_215216_Fractions_thumb_@1x.png ...
        Done in 0.00s
        Writing 240x240 micrograph thumbnail to J79/thumbnails/011861353430844636542_FoilHole_22123083_Data_22074515_22074517_20230225_215216_Fractions_thumb_@2x.png ...
        Done in 0.00s
        Writing dose-weighted result to J79/motioncorrected/011861353430844636542_FoilHole_22123083_Data_22074515_22074517_20230225_215216_Fractions_patch_aligned_doseweighted.mrc ...
        Done in 0.03s
        Writing background estimate to J79/motioncorrected/011861353430844636542_FoilHole_22123083_Data_22074515_22074517_20230225_215216_Fractions_background.mrc ...
        Done in 0.00s
        Writing motion estimates...
        Done in 0.00s

[CPU:   1.95 GB  Avail: 251.20 GB]
-- 0.0: processing 509 of 3157: J73/imported/007513752052181863553_FoilHole_22123084_Data_22074515_22074517_20230225_215228_Fractions.mrc
        loading /opt/cryosparc/cryosparc_worker/cryosparc_compute/CS-rrp2-glacios/CS-rrp2-glacios-ii/J73/imported/007513752052181863553_FoilHole_22123084_Data_22074515_22074517_20230225_215228_Fractions.mrc
        Loading raw movie data from J73/imported/007513752052181863553_FoilHole_22123084_Data_22074515_22074517_20230225_215228_Fractions.mrc ...
        Done in 6.59s
        Processing ...
        Done in 2.58s
        Completed rigid and patch motion with (Z:4,Y:6,X:6) knots
        Writing non-dose-weighted result to J79/motioncorrected/007513752052181863553_FoilHole_22123084_Data_22074515_22074517_20230225_215228_Fractions_patch_aligned.mrc ...
        Done in 0.03s
        Writing 120x120 micrograph thumbnail to J79/thumbnails/007513752052181863553_FoilHole_22123084_Data_22074515_22074517_20230225_215228_Fractions_thumb_@1x.png ...
        Done in 0.00s
        Writing 240x240 micrograph thumbnail to J79/thumbnails/007513752052181863553_FoilHole_22123084_Data_22074515_22074517_20230225_215228_Fractions_thumb_@2x.png ...
        Done in 0.00s
        Writing dose-weighted result to J79/motioncorrected/007513752052181863553_FoilHole_22123084_Data_22074515_22074517_20230225_215228_Fractions_patch_aligned_doseweighted.mrc ...
        Done in 0.03s
        Writing background estimate to J79/motioncorrected/007513752052181863553_FoilHole_22123084_Data_22074515_22074517_20230225_215228_Fractions_background.mrc ...
        Done in 0.00s
        Writing motion estimates...
        Done in 0.00s

[CPU:   3.23 GB  Avail: 249.93 GB]
-- 0.0: processing 510 of 3157: J73/imported/002167456673077019549_FoilHole_22123085_Data_22074515_22074517_20230225_215241_Fractions.mrc
        loading /opt/cryosparc/cryosparc_worker/cryosparc_compute/CS-rrp2-glacios/CS-rrp2-glacios-ii/J73/imported/002167456673077019549_FoilHole_22123085_Data_22074515_22074517_20230225_215241_Fractions.mrc
        Loading raw movie data from J73/imported/002167456673077019549_FoilHole_22123085_Data_22074515_22074517_20230225_215241_Fractions.mrc ...
        Done in 6.61s
        Processing ...

Job is unresponsive - no heartbeat received in 180 seconds.

Many thanks in advance!
Chloe

Welcome to the forum @cms219.
Please can you inspect the job log (Metadata|Log tab) for additional hints?
image

Hi wtempel thanks for the welcome.

This is the start of the metadata log:
{
“id”: “6407686c4adb21c74b7bd49b”,
“children”: [],
“cloned_from”: “J2”,
“completed_at”: null,
“created_at”: “2023-03-07T16:38:04.120Z”,
“created_by_user_id”: “63861a37779893a0e6f31afb”,
“deleted”: false,
“description”: “Enter a description.”,
“failed_at”: “2023-03-07T17:40:01.030Z”,
“interactive”: false,
“interactive_hostname”: “rcgpu03.rc-harwell.ac.uk”,
“interactive_port”: null,
“job_type”: “patch_motion_correction_multi”,
“killed_at”: null,
“last_exported”: “2023-03-07T16:38:04.183Z”,
“launched_at”: “2023-03-07T16:38:11.867Z”,
“output_group_images”: {“micrographs”: “6407689c8aa50ef4a1a9f372”},
“output_result_groups”: [
{
“uid”: “J76-G0”,
“type”: “exposure”,
“name”: “micrographs”,
“title”: “Micrographs”,
“description”: “”,
“contains”: [
{
“uid”: “J76-R0”,
“type”: “exposure.micrograph_blob”,
“group_name”: “micrographs”,
“name”: “micrograph_blob_non_dw”,
“passthrough”: false
},
{
“uid”: “J76-R1”,
“type”: “exposure.thumbnail_blob”,
“group_name”: “micrographs”,
“name”: “micrograph_thumbnail_blob_1x”,
“passthrough”: false
},
{
“uid”: “J76-R2”,
“type”: “exposure.thumbnail_blob”,
“group_name”: “micrographs”,
“name”: “micrograph_thumbnail_blob_2x”,
“passthrough”: false
},
{
“uid”: “J76-R3”,
“type”: “exposure.micrograph_blob”,
“group_name”: “micrographs”,
“name”: “micrograph_blob”,
“passthrough”: false
},
{
“uid”: “J76-R4”,
“type”: “exposure.stat_blob”,
“group_name”: “micrographs”,
“name”: “background_blob”,
“passthrough”: false
},
{
“uid”: “J76-R5”,
“type”: “exposure.motion”,
“group_name”: “micrographs”,
“name”: “rigid_motion”,
“passthrough”: false
},
{
“uid”: “J76-R6”,
“type”: “exposure.motion”,
“group_name”: “micrographs”,
“name”: “spline_motion”,
“passthrough”: false
},
{
“uid”: “J76-R7”,
“type”: “exposure.movie_blob”,
“group_name”: “micrographs”,
“name”: “movie_blob”,
“passthrough”: true
},
{
“uid”: “J76-R8”,
“type”: “exposure.mscope_params”,
“group_name”: “micrographs”,
“name”: “mscope_params”,
“passthrough”: true
}
],
“passthrough”: “movies”,
“num_items”: 320,
“summary”: {},
“summary_stats”: [
{
“motion_total_pix_hist”: [
1,
0,

This is the end of the metadata log:
{
“id”: “6407686c4adb21c74b7bd49b”,
“children”: [],
“cloned_from”: “J2”,
“completed_at”: null,
“created_at”: “2023-03-07T16:38:04.120Z”,
“created_by_user_id”: “63861a37779893a0e6f31afb”,
“deleted”: false,
“description”: “Enter a description.”,
“failed_at”: “2023-03-07T17:40:01.030Z”,
“interactive”: false,
“interactive_hostname”: “rcgpu03.rc-harwell.ac.uk”,
“interactive_port”: null,
“job_type”: “patch_motion_correction_multi”,
“killed_at”: null,
“last_exported”: “2023-03-07T16:38:04.183Z”,
“launched_at”: “2023-03-07T16:38:11.867Z”,
“output_group_images”: {“micrographs”: “6407689c8aa50ef4a1a9f372”},
“output_result_groups”: [
{
“uid”: “J76-G0”,
“type”: “exposure”,
“name”: “micrographs”,
“title”: “Micrographs”,
“description”: “”,
“contains”: [
{
“uid”: “J76-R0”,
“type”: “exposure.micrograph_blob”,
“group_name”: “micrographs”,
“name”: “micrograph_blob_non_dw”,
“passthrough”: false
},
{
“uid”: “J76-R1”,
“type”: “exposure.thumbnail_blob”,
“group_name”: “micrographs”,
“name”: “micrograph_thumbnail_blob_1x”,
“passthrough”: false
},
{
“uid”: “J76-R2”,
“type”: “exposure.thumbnail_blob”,
“group_name”: “micrographs”,
“name”: “micrograph_thumbnail_blob_2x”,
“passthrough”: false
},
{
“uid”: “J76-R3”,
“type”: “exposure.micrograph_blob”,
“group_name”: “micrographs”,
“name”: “micrograph_blob”,
“passthrough”: false
},
{
“uid”: “J76-R4”,
“type”: “exposure.stat_blob”,
“group_name”: “micrographs”,
“name”: “background_blob”,
“passthrough”: false
},
{
“uid”: “J76-R5”,
“type”: “exposure.motion”,
“group_name”: “micrographs”,
“name”: “rigid_motion”,
“passthrough”: false
},
{
“uid”: “J76-R6”,
“type”: “exposure.motion”,
“group_name”: “micrographs”,
“name”: “spline_motion”,
“passthrough”: false
},
{
“uid”: “J76-R7”,
“type”: “exposure.movie_blob”,
“group_name”: “micrographs”,
“name”: “movie_blob”,
“passthrough”: true
},
{
“uid”: “J76-R8”,
“type”: “exposure.mscope_params”,
“group_name”: “micrographs”,
“name”: “mscope_params”,
“passthrough”: true
}
],
“passthrough”: “movies”,
“num_items”: 320,
“summary”: {},
“summary_stats”: [
{
“motion_total_pix_hist”: [
1,
0,

I am not sure what to look for that could be giving the possible error, thanks in advance for your help!

@cms219 It kooks like you posted contents from Data sub-tab under Metadata. What does the Log sub-tab show?

My apologies! I have pasted the log sub-tab output below:

================= CRYOSPARCW ======= 2023-03-08 18:56:06.592706 =========
Project P6 Job J79
Master rcgpu03.rc-harwell.ac.uk Port 39002

========= monitor process now starting main process at 2023-03-08 18:56:06.592762
MAINPROCESS PID 3306
MAIN PID 3306
motioncorrection.run_patch cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process


Running job on hostname %s rcgpu03.rc-harwell.ac.uk
Allocated Resources : {‘fixed’: {‘SSD’: False}, ‘hostname’: ‘rcgpu03.rc-harwell.ac.uk’, ‘lane’: ‘default’, ‘lane_type’: ‘node’, ‘license’: True, ‘licenses_acquired’: 1, ‘slots’: {‘CPU’: [0, 1, 2, 3, 4, 5], ‘GPU’: [0], ‘RAM’: [0, 1]}, ‘target’: {‘cache_path’: ‘/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 11554848768, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 1, ‘mem’: 11552227328, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}], ‘hostname’: ‘rcgpu03.rc-harwell.ac.uk’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘rcgpu03.rc-harwell.ac.uk’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc@rcgpu03.rc-harwell.ac.uk’, ‘title’: ‘Worker node rcgpu03.rc-harwell.ac.uk’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/opt/cryosparc/cryosparc_worker/bin/cryosparcw’}}
========= sending heartbeat at 2023-03-08 18:56:28.678946
gpufft: creating new cufft plan (plan id 0 pid 3356)
gpu_id 0
ndims 2
dims 576 576 0
inembed 576 578 0
istride 1
idist 332928
onembed 576 289 0
ostride 1
odist 166464
batch 81
type R2C
wkspc automatic
Python traceback:

gpufft: creating new cufft plan (plan id 1 pid 3356)
gpu_id 0
ndims 2
dims 4096 4096 0
inembed 4096 4098 0
istride 1
idist 16785408
onembed 4096 2049 0
ostride 1
odist 8392704
batch 1
type R2C
wkspc manual
Python traceback:

gpufft: creating new cufft plan (plan id 2 pid 3356)
gpu_id 0
ndims 2
dims 8192 8192 0
inembed 8192 4097 0
istride 1
idist 33562624
onembed 8192 8194 0
ostride 1
odist 67125248
batch 1
type C2R
wkspc manual
Python traceback:

gpufft: creating new cufft plan (plan id 3 pid 3356)
gpu_id 0
ndims 2
dims 8192 8192 0
inembed 8192 8194 0
istride 1
idist 67125248
onembed 8192 4097 0
ostride 1
odist 33562624
batch 1
type R2C
wkspc manual
Python traceback:

gpufft: creating new cufft plan (plan id 4 pid 3356)
gpu_id 0
ndims 2
dims 4096 4096 0
inembed 4096 2049 0
istride 1
idist 8392704
onembed 4096 4098 0
ostride 1
odist 16785408
batch 1
type C2R
wkspc manual
Python traceback:

========= sending heartbeat at 2023-03-08 18:56:38.695628
========= sending heartbeat at 2023-03-08 18:56:48.709868
========= sending heartbeat at 2023-03-08 18:56:58.724717
========= sending heartbeat at 2023-03-08 18:57:08.740761
========= sending heartbeat at 2023-03-08 18:57:18.752378
========= sending heartbeat at 2023-03-08 18:57:28.769173
========= sending heartbeat at 2023-03-08 18:57:38.776455
========= sending heartbeat at 2023-03-08 18:57:48.785724
========= sending heartbeat at 2023-03-08 18:57:58.798619
========= sending heartbeat at 2023-03-08 18:58:08.812622
========= sending heartbeat at 2023-03-08 18:58:18.829477
========= sending heartbeat at 2023-03-08 18:58:28.843690
========= sending heartbeat at 2023-03-08 18:58:38.860819
========= sending heartbeat at 2023-03-08 18:58:48.877424
========= sending heartbeat at 2023-03-08 18:58:58.894351
========= sending heartbeat at 2023-03-08 18:59:08.910505
========= sending heartbeat at 2023-03-08 18:59:18.925774
========= sending heartbeat at 2023-03-08 18:59:28.939771
========= sending heartbeat at 2023-03-08 18:59:38.953845
========= sending heartbeat at 2023-03-08 18:59:48.968102
========= sending heartbeat at 2023-03-08 18:59:58.982718
========= sending heartbeat at 2023-03-08 19:00:08.999806
========= sending heartbeat at 2023-03-08 19:00:19.016343
========= sending heartbeat at 2023-03-08 19:00:29.033006
========= sending heartbeat at 2023-03-08 19:00:39.049232
========= sending heartbeat at 2023-03-08 19:00:49.065986
========= sending heartbeat at 2023-03-08 19:00:59.080091
========= sending heartbeat at 2023-03-08 19:01:09.094416
========= sending heartbeat at 2023-03-08 19:01:19.108370
========= sending heartbeat at 2023-03-08 19:01:29.122666
========= sending heartbeat at 2023-03-08 19:01:39.137022
========= sending heartbeat at 2023-03-08 19:01:49.153698
========= sending heartbeat at 2023-03-08 19:01:59.170510
========= sending heartbeat at 2023-03-08 19:02:09.187100
========= sending heartbeat at 2023-03-08 19:02:19.203605
========= sending heartbeat at 2023-03-08 19:02:29.219915
========= sending heartbeat at 2023-03-08 19:02:39.233738
========= sending heartbeat at 2023-03-08 19:02:49.245669
========= sending heartbeat at 2023-03-08 19:02:59.260032
========= sending heartbeat at 2023-03-08 19:03:09.272785
========= sending heartbeat at 2023-03-08 19:03:19.289151
========= sending heartbeat at 2023-03-08 19:03:29.306033
========= sending heartbeat at 2023-03-08 19:03:39.323146
========= sending heartbeat at 2023-03-08 19:03:49.340338
========= sending heartbeat at 2023-03-08 19:03:59.356650
========= sending heartbeat at 2023-03-08 19:04:09.372732
========= sending heartbeat at 2023-03-08 19:04:19.387387
========= sending heartbeat at 2023-03-08 19:04:29.401506
========= sending heartbeat at 2023-03-08 19:04:39.409727
========= sending heartbeat at 2023-03-08 19:04:49.424001
========= sending heartbeat at 2023-03-08 19:04:59.440976
========= sending heartbeat at 2023-03-08 19:05:09.457702
========= sending heartbeat at 2023-03-08 19:05:19.474555
========= sending heartbeat at 2023-03-08 19:05:29.491682
========= sending heartbeat at 2023-03-08 19:05:39.508828
========= sending heartbeat at 2023-03-08 19:05:49.525891
========= sending heartbeat at 2023-03-08 19:05:59.542479
========= sending heartbeat at 2023-03-08 19:06:09.556245
========= sending heartbeat at 2023-03-08 19:06:19.570074
========= sending heartbeat at 2023-03-08 19:06:29.584297
========= sending heartbeat at 2023-03-08 19:06:39.598453
========= sending heartbeat at 2023-03-08 19:06:49.615265
========= sending heartbeat at 2023-03-08 19:06:59.631255
========= sending heartbeat at 2023-03-08 19:07:09.648167
========= sending heartbeat at 2023-03-08 19:07:19.664879
========= sending heartbeat at 2023-03-08 19:07:29.681926
========= sending heartbeat at 2023-03-08 19:07:39.697988
========= sending heartbeat at 2023-03-08 19:07:49.712239
========= sending heartbeat at 2023-03-08 19:07:59.726642
========= sending heartbeat at 2023-03-08 19:08:09.741080
========= sending heartbeat at 2023-03-08 19:08:19.757636
========= sending heartbeat at 2023-03-08 19:08:29.774056
========= sending heartbeat at 2023-03-08 19:08:39.790420
========= sending heartbeat at 2023-03-08 19:08:49.807039
========= sending heartbeat at 2023-03-08 19:08:59.823919
========= sending heartbeat at 2023-03-08 19:09:09.838084
========= sending heartbeat at 2023-03-08 19:09:19.852453
========= sending heartbeat at 2023-03-08 19:09:29.866441
========= sending heartbeat at 2023-03-08 19:09:39.880436
========= sending heartbeat at 2023-03-08 19:09:49.895242
========= sending heartbeat at 2023-03-08 19:09:59.911845
========= sending heartbeat at 2023-03-08 19:10:09.928757
========= sending heartbeat at 2023-03-08 19:10:19.945006
========= sending heartbeat at 2023-03-08 19:10:29.961409
========= sending heartbeat at 2023-03-08 19:10:39.977404
========= sending heartbeat at 2023-03-08 19:10:49.992305
========= sending heartbeat at 2023-03-08 19:11:00.006778
========= sending heartbeat at 2023-03-08 19:11:10.020938
========= sending heartbeat at 2023-03-08 19:11:20.035463
========= sending heartbeat at 2023-03-08 19:11:30.051454
========= sending heartbeat at 2023-03-08 19:11:40.066126
========= sending heartbeat at 2023-03-08 19:11:50.082385
========= sending heartbeat at 2023-03-08 19:12:00.099548
========= sending heartbeat at 2023-03-08 19:12:10.115914
========= sending heartbeat at 2023-03-08 19:12:20.131956
========= sending heartbeat at 2023-03-08 19:12:30.144355
========= sending heartbeat at 2023-03-08 19:12:40.158120
========= sending heartbeat at 2023-03-08 19:12:50.172504
========= sending heartbeat at 2023-03-08 19:13:00.187261
========= sending heartbeat at 2023-03-08 19:13:10.203823
========= sending heartbeat at 2023-03-08 19:13:20.220690
========= sending heartbeat at 2023-03-08 19:13:30.237001
========= sending heartbeat at 2023-03-08 19:13:40.253514
========= sending heartbeat at 2023-03-08 19:13:50.269939
========= sending heartbeat at 2023-03-08 19:14:00.286478
========= sending heartbeat at 2023-03-08 19:14:10.303471
========= sending heartbeat at 2023-03-08 19:14:20.319581
========= sending heartbeat at 2023-03-08 19:14:30.333679
========= sending heartbeat at 2023-03-08 19:14:40.348158
========= sending heartbeat at 2023-03-08 19:14:50.365209
========= sending heartbeat at 2023-03-08 19:15:00.381781
========= sending heartbeat at 2023-03-08 19:15:10.398784
========= sending heartbeat at 2023-03-08 19:15:20.415553
========= sending heartbeat at 2023-03-08 19:15:30.432790
========= sending heartbeat at 2023-03-08 19:15:40.449524
========= sending heartbeat at 2023-03-08 19:15:50.465682
========= sending heartbeat at 2023-03-08 19:16:00.479952
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
min: -641776.562500 max: 572624.281250
min: -191241.972656 max: 174484.324219
min: -654414.875000 max: 583121.593750
min: -209562.205078 max: 190873.623047
min: -788177.820312 max: 722426.429688
min: -245954.465820 max: 227463.674805
min: -615234.976562 max: 547811.867188
min: -197324.496094 max: 180175.089844
min: -618457.687500 max: 548741.625000
min: -193157.965820 max: 175343.588867
min: -694351.273438 max: 626337.695312
min: -240365.065430 max: 221422.028320
min: -732289.812500 max: 663732.906250
min: -244909.000977 max: 227243.186523
min: -647900.277344 max: 581469.628906
min: -208593.427734 max: 191225.900391
min: -600519.632812 max: 531352.429688
min: -188919.782227 max: 170908.608398
min: -644096.019531 max: 573264.136719
min: -197018.568359 max: 178414.275391
min: -594455.402344 max: 523190.472656
min: -185863.918945 max: 167279.440430
min: -710557.148438 max: 641540.695312
min: -238922.260742 max: 220941.083008
min: -642745.007812 max: 567995.273438
min: -197964.875000 max: 179538.367188
min: -604379.781250 max: 534836.843750
min: -188782.970703 max: 170995.662109
min: -602277.976562 max: 530385.335938
min: -183329.297852 max: 166042.030273
min: -877510.441406 max: 810801.246094
min: -296653.422852 max: 279374.139648
min: -691158.812500 max: 624485.187500
min: -241116.757812 max: 223823.992188
min: -658808.132812 max: 591257.273438
min: -225517.277344 max: 208008.597656
min: -650919.929688 max: 579780.382812
min: -220457.255859 max: 202110.369141
min: -619327.703125 max: 551188.109375
min: -193456.656738 max: 177199.577637
min: -623789.496094 max: 555675.597656
min: -201806.374023 max: 184004.993164
min: -633371.722656 max: 561942.839844
min: -208044.336914 max: 189348.038086
min: -633503.628906 max: 563267.339844
min: -206359.802734 max: 187869.259766
min: -800275.058594 max: 730604.503906
min: -278035.362305 max: 259721.340820
min: -829482.820312 max: 761279.367188
min: -292316.752930 max: 273544.606445
min: -656598.242188 max: 587752.164062
min: -212927.205078 max: 193715.060547
min: -624082.667969 max: 554785.519531
min: -199751.759766 max: 181493.537109
min: -610800.992188 max: 543005.101562
min: -194674.453125 max: 177101.671875
min: -613409.476562 max: 540820.554688
min: -194321.422852 max: 175331.967773
min: -615033.648438 max: 546916.882812
min: -201736.604492 max: 184037.731445
min: -640568.328125 max: 569868.578125
min: -207051.189453 max: 188358.701172
min: -625595.234375 max: 554791.796875
min: -201640.612305 max: 182941.106445
min: -646084.371094 max: 577195.066406
min: -198838.219727 max: 182334.327148
min: -586696.910156 max: 517076.871094
min: -178513.762695 max: 161877.745117
min: -589615.113281 max: 517892.449219
min: -177467.962891 max: 160244.732422
min: -590533.683594 max: 519207.003906
min: -183256.637695 max: 163327.807617
min: -595871.535156 max: 527275.683594
min: -184291.535156 max: 166668.480469
min: -721034.140625 max: 649710.390625
min: -241076.748047 max: 222505.001953
min: -682404.527344 max: 614553.785156
min: -233137.676758 max: 215562.151367
min: -651899.925781 max: 584155.855469
min: -220102.049805 max: 202569.872070
min: -660355.156250 max: 590639.687500
min: -215954.384766 max: 197853.193359
min: -643619.789062 max: 571286.742188
min: -207184.756836 max: 188433.540039
min: -609655.984375 max: 536945.890625
min: -195692.598633 max: 175548.518555
min: -658922.718750 max: 586521.468750
min: -205930.160156 max: 186951.128906
min: -632227.660156 max: 561848.183594
min: -204136.345703 max: 185714.099609
min: -680657.261719 max: 609861.925781
min: -215989.880859 max: 198845.759766
min: -639165.312500 max: 568030.843750
min: -203198.720703 max: 184429.427734
min: -650890.476562 max: 579535.992188
min: -211834.450195 max: 193101.377930
min: -651992.429688 max: 583071.007812
min: -218551.438477 max: 199375.749023
min: -643422.773438 max: 573168.351562
min: -205725.643555 max: 187247.918945
min: -658660.750000 max: 587620.062500
min: -209436.891602 max: 192237.327148
min: -594165.968750 max: 527443.843750
min: -184631.174316 max: 168693.872559
min: -654687.140625 max: 583292.765625
min: -212360.744141 max: 193733.412109
min: -591014.355469 max: 519697.113281
min: -183051.890625 max: 164457.328125
min: -656327.722656 max: 585491.339844
min: -215706.336914 max: 197173.772461
min: -717042.171875 max: 648884.359375
min: -246591.175781 max: 228693.761719
min: -604435.265625 max: 533107.734375
min: -192218.518555 max: 172287.622070
min: -599538.773438 max: 529235.132812
min: -188994.397461 max: 170663.040039
min: -588863.656250 max: 520319.687500
min: -188191.375977 max: 170124.147461
min: -609330.265625 max: 537054.484375
min: -187800.292969 max: 168843.644531
min: -871621.720703 max: 806844.154297
min: -272147.219727 max: 256533.733398
min: -768467.855469 max: 696111.613281
min: -227256.238281 max: 206999.808594
min: -585813.488281 max: 511853.324219
min: -180256.891602 max: 161300.889648
min: -599738.617188 max: 529693.226562
min: -183937.105469 max: 167042.714844
min: -674117.347656 max: 603129.933594
min: -230473.288086 max: 211915.196289
min: -639669.996094 max: 572176.160156
min: -207243.318359 max: 191134.228516
min: -659690.824219 max: 591421.769531
min: -209159.789551 max: 192888.210449
min: -605236.218750 max: 533021.531250
min: -191000.091797 max: 172226.666016
min: -638787.675781 max: 566029.292969
min: -209576.843750 max: 190601.828125
min: -629897.464844 max: 558857.410156
min: -200559.301758 max: 183469.573242
min: -626644.554688 max: 558014.476562
min: -202361.601562 max: 184359.867188
min: -621149.203125 max: 550997.765625
min: -197159.115234 max: 178718.876953
min: -636733.437500 max: 564679.750000
min: -201549.292969 max: 183067.957031
min: -670740.171875 max: 601271.546875
min: -219457.823242 max: 201282.661133
min: -661715.660156 max: 591472.933594
min: -220172.929688 max: 201763.851562
min: -649322.480469 max: 578624.769531
min: -214269.990234 max: 195814.072266
min: -638406.523438 max: 567051.882812
min: -202715.228516 max: 185551.943359
min: -616948.519531 max: 548919.355469
min: -196714.735352 max: 178862.530273
min: -600300.121094 max: 530495.597656
min: -192245.525391 max: 173935.880859
min: -640836.281250 max: 570007.218750
min: -210457.972656 max: 192060.417969
min: -688396.718750 max: 622601.875000
min: -241469.333984 max: 224400.900391
min: -599570.773438 max: 530241.039062
min: -187777.479492 max: 169610.692383
min: -611358.257812 max: 540263.367188
min: -190811.192383 max: 172191.034180
min: -609290.406250 max: 538669.062500
min: -195495.863281 max: 177115.972656
min: -720595.542969 max: 648163.738281
min: -247448.144531 max: 228453.292969
min: -612632.730469 max: 542370.550781
min: -198695.968750 max: 180584.585938
min: -733211.253906 max: 661400.058594
min: -233285.159180 max: 214639.122070
min: -620854.796875 max: 551948.796875
min: -201052.118164 max: 182925.686523
min: -615556.304688 max: 546796.882812
min: -201483.201172 max: 183326.658203
min: -807248.910156 max: 742998.371094
min: -287865.810547 max: 271048.205078
min: -672942.152344 max: 602634.597656
min: -231031.649414 max: 211339.584961
min: -704926.210938 max: 635474.164062
min: -245330.805664 max: 227037.897461
min: -707459.355469 max: 639285.738281
min: -249374.923828 max: 231458.701172
min: -641738.183594 max: 569024.503906
min: -207078.015625 max: 188068.953125
min: -635354.101562 max: 565260.429688
min: -211490.873047 max: 193038.314453
min: -606450.828125 max: 537357.203125
min: -191224.144531 max: 173073.027344
min: -662341.734375 max: 594133.171875
min: -213075.834961 max: 196737.290039
min: -646476.441406 max: 576389.183594
min: -212125.317383 max: 192408.698242
min: -654753.437500 max: 583794.406250
min: -209814.413086 max: 192455.680664
min: -637119.316406 max: 568667.871094
min: -186515.682617 max: 170039.333008
min: -602719.753906 max: 532063.714844
min: -187373.257812 max: 170416.984375
min: -601941.515625 max: 531657.984375
min: -188170.198242 max: 170119.848633
min: -665594.066406 max: 595613.433594
min: -226106.278320 max: 207607.565430
min: -637250.425781 max: 566735.730469
min: -209941.474609 max: 191295.306641

Due to the word limit I could not include it all, but it basically alternates between saying the ‘sending heartbeat’ line a number of times and then the ‘min’ line a number of times.
Thanks!

Thanks @cms219 Please can you also post the final lies of that log, including the latest occurrences of lines that mention “heartbeat”.

Hi @wtempel,

Please see below:

========= sending heartbeat at 2023-03-08 20:08:56.350635
========= sending heartbeat at 2023-03-08 20:09:06.366888
========= sending heartbeat at 2023-03-08 20:09:16.380812
========= sending heartbeat at 2023-03-08 20:09:26.394885
========= sending heartbeat at 2023-03-08 20:09:36.409161
========= sending heartbeat at 2023-03-08 20:09:46.425856
========= sending heartbeat at 2023-03-08 20:09:56.442456
========= sending heartbeat at 2023-03-08 20:10:06.459345
========= sending heartbeat at 2023-03-08 20:10:16.475182
========= sending heartbeat at 2023-03-08 20:10:26.491448
========= sending heartbeat at 2023-03-08 20:10:36.508137
========= sending heartbeat at 2023-03-08 20:10:46.524336
========= sending heartbeat at 2023-03-08 20:10:56.540910
========= sending heartbeat at 2023-03-08 20:11:06.557010
========= sending heartbeat at 2023-03-08 20:11:16.573356
========= sending heartbeat at 2023-03-08 20:11:26.589831
========= sending heartbeat at 2023-03-08 20:11:36.606808
========= sending heartbeat at 2023-03-08 20:11:46.623533
========= sending heartbeat at 2023-03-08 20:11:56.637927
========= sending heartbeat at 2023-03-08 20:12:06.651695
========= sending heartbeat at 2023-03-08 20:12:16.665803
========= sending heartbeat at 2023-03-08 20:12:26.680004
========= sending heartbeat at 2023-03-08 20:12:36.696323
========= sending heartbeat at 2023-03-08 20:12:46.705785
========= sending heartbeat at 2023-03-08 20:12:56.722187
========= sending heartbeat at 2023-03-08 20:13:06.738761
========= sending heartbeat at 2023-03-08 20:13:16.755677
========= sending heartbeat at 2023-03-08 20:13:26.771139
========= sending heartbeat at 2023-03-08 20:13:36.788461
========= sending heartbeat at 2023-03-08 20:13:46.804818
========= sending heartbeat at 2023-03-08 20:13:56.821816
========= sending heartbeat at 2023-03-08 20:14:06.838683
========= sending heartbeat at 2023-03-08 20:14:16.855677
========= sending heartbeat at 2023-03-08 20:14:26.871883
========= sending heartbeat at 2023-03-08 20:14:36.886186
========= sending heartbeat at 2023-03-08 20:14:46.900246
========= sending heartbeat at 2023-03-08 20:14:56.914798
========= sending heartbeat at 2023-03-08 20:15:06.929042
========= sending heartbeat at 2023-03-08 20:15:16.945542
========= sending heartbeat at 2023-03-08 20:15:26.961950
========= sending heartbeat at 2023-03-08 20:15:36.978811
========= sending heartbeat at 2023-03-08 20:15:46.995804
========= sending heartbeat at 2023-03-08 20:15:57.012236
========= sending heartbeat at 2023-03-08 20:16:07.028823
========= sending heartbeat at 2023-03-08 20:16:17.045824
========= sending heartbeat at 2023-03-08 20:16:27.062265
========= sending heartbeat at 2023-03-08 20:16:37.079052
========= sending heartbeat at 2023-03-08 20:16:47.095463
========= sending heartbeat at 2023-03-08 20:16:57.111741
========= sending heartbeat at 2023-03-08 20:17:07.126049
========= sending heartbeat at 2023-03-08 20:17:17.139791
========= sending heartbeat at 2023-03-08 20:17:27.153717
========= sending heartbeat at 2023-03-08 20:17:37.169939
========= sending heartbeat at 2023-03-08 20:17:47.186476
========= sending heartbeat at 2023-03-08 20:17:57.203211
========= sending heartbeat at 2023-03-08 20:18:07.219523
========= sending heartbeat at 2023-03-08 20:18:17.236387
========= sending heartbeat at 2023-03-08 20:18:27.253339
========= sending heartbeat at 2023-03-08 20:18:37.269898
========= sending heartbeat at 2023-03-08 20:18:47.286002
========= sending heartbeat at 2023-03-08 20:18:57.302962
========= sending heartbeat at 2023-03-08 20:19:07.319332
========= sending heartbeat at 2023-03-08 20:19:17.333551
========= sending heartbeat at 2023-03-08 20:19:27.347517
========= sending heartbeat at 2023-03-08 20:19:37.361690
========= sending heartbeat at 2023-03-08 20:19:47.376914
========= sending heartbeat at 2023-03-08 20:19:57.394350
========= sending heartbeat at 2023-03-08 20:20:07.411033
========= sending heartbeat at 2023-03-08 20:20:17.418587
========= sending heartbeat at 2023-03-08 20:20:27.433773
========= sending heartbeat at 2023-03-08 20:20:37.441334
========= sending heartbeat at 2023-03-08 20:20:47.457317
========= sending heartbeat at 2023-03-08 20:20:57.473880
========= sending heartbeat at 2023-03-08 20:21:07.490179
========= sending heartbeat at 2023-03-08 20:21:17.506729
========= sending heartbeat at 2023-03-08 20:21:27.522913
========= sending heartbeat at 2023-03-08 20:21:37.540165
========= sending heartbeat at 2023-03-08 20:21:47.553908
========= sending heartbeat at 2023-03-08 20:21:57.567816
========= sending heartbeat at 2023-03-08 20:22:07.582160
========= sending heartbeat at 2023-03-08 20:22:17.596776
========= sending heartbeat at 2023-03-08 20:22:27.612388
========= sending heartbeat at 2023-03-08 20:22:37.628973

@cms219 We would like to investigate the possibility that at the time of the failure, an error message is emitted that does not get captured by the CryoSPARC logs. For this purpose, please

  • clone the job that previously failed and queue the clone.
  • kill the job immediately once it starts running. Note the job’s id.
  • logged on as the Linux user that runs CryoSPARC, create a screen or tmux session on the rcgpu03 computer. Running the command below inside either screen or tmux reduces the risk that terminal output of a long-running job would be lost due to a network disruption.
  • in a directory with right access, run (substitute for JXY the job id you noted above)
    /opt/cryosparc/cryosparc_worker/bin/cryosparcw run --project P6 --job JXY --master_hostname rcgpu03.rc-harwell.ac.uk --master_command_core_port 39002 > P6_JXY_test.out 2> P6_JXY_test.err
    
  • note down any messages printed to the terminal when the job fails. (A successful job should not print anything to the terminal, should write only write to P6_JXY_test.out or, possibly, to P6_JXY_test.err).

Please let us know what your observations.