No Heartbeat after 180 Seconds v4.7.0

LucasDillard · April 28, 2025, 6:46pm

Hello Everyone, I am using v.4.7 on a local workstation Running rocky linux v 9.2. I’m encountering an issue where jobs are timing out after 180 seconds during classification and refinement.

Aside from the timeout, the interface lags severely --even when working directly from the local machine. I have TransparentHugepages set to never, and I have tried setting them to madvise.

I also monitor GPU memory usage with nvidia-smi and a job never takes more than about 25% of the GPU.

I followed this thread, and was unable to find a solution, despite the symptoms appearing to be silmilar

Here is the tail end of the event log from the run:

[CPU:   5.55 GB  Avail: 434.22 GB]

Clearing previous iteration...
[CPU:   5.55 GB  Avail: 434.22 GB]

Deleting last_output_file_path_abs: /data/cryosparc-database/CS-cob/J695/J695_006_particles.cs
[CPU:   5.55 GB  Avail: 434.22 GB]

Deleting last_output_file_path_abs: /data/cryosparc-database/CS-cob/J695/J695_006_class_averages.cs
[CPU:   5.55 GB  Avail: 434.22 GB]

Deleting last_output_file_path_abs: /data/cryosparc-database/CS-cob/J695/J695_006_class_averages.mrc
[CPU:   5.55 GB  Avail: 434.22 GB]

Removed output results for P3 J695
[CPU:   5.56 GB  Avail: 434.22 GB]

Start of Iteration 8
[CPU:   5.56 GB  Avail: 434.22 GB]

-- DEV 3 THR 0 NUM 3500 TOTAL 18.896414 ELAPSED 168.12674 --
[CPU:   5.57 GB  Avail: 434.25 GB]

Finished engine iteration 8 in 171.522s
[CPU:   5.57 GB  Avail: 434.25 GB]

  -- Effective number of classes per image: min 48.65 | 25-pct 249.64 | median 265.54 | 75-pct 277.14 | max 298.63 
[CPU:   5.57 GB  Avail: 434.25 GB]

  -- Probability of best class per image: min 0.00 | 25-pct 0.01 | median 0.01 | 75-pct 0.01 | max 0.10 

**** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****

Job is unresponsive - no heartbeat received in 180 seconds.

HERE IS THE job.log for a 2D classification that failed

========= sending heartbeat at 2025-04-28 12:37:40.059044
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(735): warning #177-D: variable "zint" was declared but never referenced


  warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(35): warning #68-D: integer conversion resulted in a change of sign

kernel(44): warning #68-D: integer conversion resulted in a change of sign

kernel(17): warning #177-D: variable "N_I" was declared but never referenced


  warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(735): warning #177-D: variable "zint" was declared but never referenced


  warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(735): warning #177-D: variable "zint" was declared but never referenced


  warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(735): warning #177-D: variable "zint" was declared but never referenced


  warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(735): warning #177-D: variable "zint" was declared but never referenced


  warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(100): warning #177-D: variable "zint" was declared but never referenced

kernel(325): warning #177-D: variable "zint" was declared but never referenced


  warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(868): warning #177-D: variable "pos" was declared but never referenced


  warnings.warn(msg)
========= sending heartbeat at 2025-04-28 12:37:50.077168
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:40:00.391794
========= sending heartbeat at 2025-04-28 12:40:10.400795
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:40:20.419553
========= sending heartbeat at 2025-04-28 12:40:30.437451
========= sending heartbeat at 2025-04-28 12:40:40.455302
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:40:50.473320
========= sending heartbeat at 2025-04-28 12:41:00.491223
========= sending heartbeat at 2025-04-28 12:41:10.509190
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:41:20.527091
========= sending heartbeat at 2025-04-28 12:41:30.545132
========= sending heartbeat at 2025-04-28 12:41:40.563796
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:41:50.580791
========= sending heartbeat at 2025-04-28 12:42:00.598797
<string>:1: RuntimeWarning: invalid value encountered in true_divide
/home/cryosparc/cryosparc-install/cryosparc_worker/cryosparc_compute/plotutil.py:2014: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  fig = plt.figure(figsize=figsize)
========= sending heartbeat at 2025-04-28 12:44:07.376818
========= sending heartbeat at 2025-04-28 12:44:17.394968
========= sending heartbeat at 2025-04-28 12:44:27.408821
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:44:37.426401
========= sending heartbeat at 2025-04-28 12:44:47.444354
========= sending heartbeat at 2025-04-28 12:44:57.462185
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:45:07.480835
========= sending heartbeat at 2025-04-28 12:45:17.499240
========= sending heartbeat at 2025-04-28 12:45:27.517051
========= sending heartbeat at 2025-04-28 12:45:37.528061
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:45:47.546079
========= sending heartbeat at 2025-04-28 12:45:57.563980
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:46:07.581781
========= sending heartbeat at 2025-04-28 12:46:17.594770
========= sending heartbeat at 2025-04-28 12:46:27.612483
========= sending heartbeat at 2025-04-28 12:46:37.626798
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:46:47.645372
========= sending heartbeat at 2025-04-28 12:46:57.662787
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:47:07.680905
========= sending heartbeat at 2025-04-28 12:47:17.698876
========= sending heartbeat at 2025-04-28 12:47:27.716790
========= sending heartbeat at 2025-04-28 12:47:37.736141
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:47:47.754060
========= sending heartbeat at 2025-04-28 12:47:57.772489
========= sending heartbeat at 2025-04-28 12:48:07.790505
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:50:16.095664
========= sending heartbeat at 2025-04-28 12:50:26.113785
========= sending heartbeat at 2025-04-28 12:50:36.131939
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:50:46.149782
========= sending heartbeat at 2025-04-28 12:50:56.167789
========= sending heartbeat at 2025-04-28 12:51:06.186788
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:51:16.203804
========= sending heartbeat at 2025-04-28 12:51:26.221985
========= sending heartbeat at 2025-04-28 12:51:36.230784
========= sending heartbeat at 2025-04-28 12:51:46.247780
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:51:56.265719
Terminated

Here is information on the machine I’m using

cryosparcm cli "get_job('P3', 'J695', 'job_type', 'version', 'instance_information', 'params_spec', 'status', 'started_at')"
{'_id': '680ebbba0228303d4f08027d', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '429.29GB', 'cpu_model': 'AMD Ryzen Threadripper PRO 5975WX 32-Cores', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 21158428672, 'name': 'NVIDIA RTX A4500', 'pcie': '0000:01:00'}, {'id': 1, 'mem': 21156200448, 'name': 'NVIDIA RTX A4500', 'pcie': '0000:02:00'}, {'id': 2, 'mem': 21158428672, 'name': 'NVIDIA RTX A4500', 'pcie': '0000:41:00'}, {'id': 3, 'mem': 21158428672, 'name': 'NVIDIA RTX A4500', 'pcie': '0000:42:00'}], 'ofd_hard_limit': 524288, 'ofd_soft_limit': 1024, 'physical_cores': 32, 'platform_architecture': 'x86_64', 'platform_node': 'taken-out', 'platform_release': '5.14.0-284.25.1.el9_2.x86_64', 'platform_version': '#1 SMP PREEMPT_DYNAMIC Wed Aug 2 14:53:30 UTC 2023', 'total_memory': '440.02GB', 'used_memory': '6.71GB'}, 'job_type': 'class_2D_new', 'params_spec': {'class2D_K': {'value': 300}, 'class2D_max_res': {'value': 15}, 'class2D_num_full_iter_batch': {'value': 50}, 'class2D_num_full_iter_batchsize_per_class': {'value': 200}, 'class2D_sigma_init_factor': {'value': 4}, 'class2D_window_inner_A': {'value': 320}, 'compute_num_gpus': {'value': 4}}, 'project_uid': 'P3', 'started_at': 'Mon, 28 Apr 2025 16:37:19 GMT', 'status': 'failed', 'uid': 'J695', 'version': 'v4.7.0'}

wtempel · April 28, 2025, 7:35pm

Please can you post the outputs of these commands in a fresh shell:

eval $(cryosparcm env) # no output expected
host $CRYOSPARC_MASTER_HOSTNAME
cryosparcm eventlog P3 J695 | tail -n 10

LucasDillard · April 28, 2025, 7:41pm

@wtempel

output for:

eval $(cryosparcm env) # no output expected
host $CRYOSPARC_MASTER_HOSTNAME
cryosparcm eventlog P3 J695 | tail -n 10

mario.twomey.jhmi.edu has address 10.99.5.35
[Mon, 28 Apr 2025 16:51:25 GMT] [CPU RAM used: 5552 MB] Deleting last_output_file_path_abs: /data/cryosparc-database/CS-cob/J695/J695_006_class_averages.cs
[Mon, 28 Apr 2025 16:51:25 GMT] [CPU RAM used: 5552 MB] Deleting last_output_file_path_abs: /data/cryosparc-database/CS-cob/J695/J695_006_class_averages.mrc
[Mon, 28 Apr 2025 16:51:25 GMT] [CPU RAM used: 5552 MB] Removed output results for P3 J695
[Mon, 28 Apr 2025 16:51:25 GMT] [CPU RAM used: 5557 MB] Start of Iteration 8
[Mon, 28 Apr 2025 16:51:25 GMT] [CPU RAM used: 5557 MB] -- DEV 3 THR 0 NUM 3500 TOTAL 18.896414 ELAPSED 168.12674 --
[Mon, 28 Apr 2025 16:54:16 GMT] [CPU RAM used: 5574 MB] Finished engine iteration 8 in 171.522s
[Mon, 28 Apr 2025 16:54:17 GMT] [CPU RAM used: 5575 MB]   -- Effective number of classes per image: min 48.65 | 25-pct 249.64 | median 265.54 | 75-pct 277.14 | max 298.63
[Mon, 28 Apr 2025 16:54:17 GMT] [CPU RAM used: 5575 MB]   -- Probability of best class per image: min 0.00 | 25-pct 0.01 | median 0.01 | 75-pct 0.01 | max 0.10
[Mon, 28 Apr 2025 16:56:59 GMT]  **** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****
[Mon, 28 Apr 2025 16:57:57 GMT]  Job is unresponsive - no heartbeat received in 180 seconds.

wtempel · May 5, 2025, 7:26pm

Thanks @LucasDillard for posting the additional information.
May I also ask

Do no heartbeat received in 180 seconds errors tend to occur when the system (system RAM and/or CPUs) are under heavy load?

What are the outputs of the commands

free -h
sudo journalctl | grep -i oom
cryosparcm cli "get_scheduler_targets()"
df -Th /path/to/cache # substitute cache_path value from previous command

Do you have an example single-GPU job, such as a refinement job, that failed with no heartbeat received in 180 seconds, for which you could post the following information:

csprojectid="P99" # substitute actual project ID
csjobid="J999" # substitute actual job ID
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"
cryosparcm joblog $csprojectid $csjobid | tail -n 20
cryosparcm eventlog $csprojectid $csjobid | tail -n 20

LucasDillard · May 9, 2025, 9:01pm

No, while monitoring the gpu usage with nvidia-smi no significant load was given to the machine. GPU consumption was never above 25%. The issue also occurs in small classification jobs with less than 100k particles.
Here are the Outputs for:

free -h

               total        used        free      shared  buff/cache   available
Mem:           440Gi        10Gi        42Gi       533Mi       390Gi       429Gi
Swap:            9Gi       155Mi       9.8Gi

cryosparcm cli “get_scheduler_targets()”

[{'cache_path': '/scr/cryosparc-scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 21158428672, 'name': 'NVIDIA RTX A4500'}, {'id': 1, 'mem': 21156200448, 'name': 'NVIDIA RTX A4500'}, {'id': 2, 'mem': 21158428672, 'name': 'NVIDIA RTX A4500'}, {'id': 3, 'mem': 21158428672, 'name': 'NVIDIA RTX A4500'}], 'hostname': 'mario.twomey.jhmi.edu', 'lane': 'default', 'monitor_port': None, 'name': 'mario.twomey.jhmi.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, 'ssh_str': 'cryosparc@mario.twomey.jhmi.edu', 'title': 'Worker node mario.twomey.jhmi.edu', 'type': 'node', 'worker_bin_path': '/home/cryosparc/cryosparc-install/cryosparc_worker/bin/cryosparcw'}]

df -Th /scr/cryosparc-scratch

Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/nvme0n1p1 ext4  7.0T   79G  6.6T   2% /scr

There was no output for sudo journalctl | grep -i oom

I have not been able to run a refinement job, because I have not curated particles through 2D classification due to this issue. I will run a job now and post the output here as soon as it is available.
To add another symptom, I receive an error when creating small jobs (like picking 2D classes that says “Unable to create job: NetworkError when attempting to fetch resource.” The issue seems to be related to that, but I’m not sure why there would be a network issue on a workstation. Perhaps, updating disrupted a network configuration that I have not been able to locate yet?

wtempel · May 9, 2025, 9:59pm

GPU load may not be the only reason for the system to be “overloaded”. RAM or CPU usage, as well as the bandwidth of the connection to project directory storage might also cause bottlenecks.

Do you also observe such errors when connecting to the CryoSPARC UI via a http://localhost:<portnumber> url (when you either run the browser directly on mario or use ssh port forwarding)?

What is the output of the command

cryosparcm call env | grep -i proxy

?

If you are comfortable with the implications of adding mario.twomey.jhmi.edu to the end of the line inside /etc/hosts that begins with 127.0.0.1 (and mario.twomey.jhmi.edu is mot mentioned elsewhere inside /etc/hosts), you may try if this makes a difference.