Hello Everyone, I am using v.4.7 on a local workstation Running rocky linux v 9.2. I’m encountering an issue where jobs are timing out after 180 seconds during classification and refinement.
Aside from the timeout, the interface lags severely --even when working directly from the local machine. I have TransparentHugepages set to never, and I have tried setting them to madvise.
I also monitor GPU memory usage with nvidia-smi
and a job never takes more than about 25% of the GPU.
I followed this thread, and was unable to find a solution, despite the symptoms appearing to be silmilar
Here is the tail end of the event log from the run:
[CPU: 5.55 GB Avail: 434.22 GB]
Clearing previous iteration...
[CPU: 5.55 GB Avail: 434.22 GB]
Deleting last_output_file_path_abs: /data/cryosparc-database/CS-cob/J695/J695_006_particles.cs
[CPU: 5.55 GB Avail: 434.22 GB]
Deleting last_output_file_path_abs: /data/cryosparc-database/CS-cob/J695/J695_006_class_averages.cs
[CPU: 5.55 GB Avail: 434.22 GB]
Deleting last_output_file_path_abs: /data/cryosparc-database/CS-cob/J695/J695_006_class_averages.mrc
[CPU: 5.55 GB Avail: 434.22 GB]
Removed output results for P3 J695
[CPU: 5.56 GB Avail: 434.22 GB]
Start of Iteration 8
[CPU: 5.56 GB Avail: 434.22 GB]
-- DEV 3 THR 0 NUM 3500 TOTAL 18.896414 ELAPSED 168.12674 --
[CPU: 5.57 GB Avail: 434.25 GB]
Finished engine iteration 8 in 171.522s
[CPU: 5.57 GB Avail: 434.25 GB]
-- Effective number of classes per image: min 48.65 | 25-pct 249.64 | median 265.54 | 75-pct 277.14 | max 298.63
[CPU: 5.57 GB Avail: 434.25 GB]
-- Probability of best class per image: min 0.00 | 25-pct 0.01 | median 0.01 | 75-pct 0.01 | max 0.10
**** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****
Job is unresponsive - no heartbeat received in 180 seconds.
HERE IS THE job.log for a 2D classification that failed
========= sending heartbeat at 2025-04-28 12:37:40.059044
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(735): warning #177-D: variable "zint" was declared but never referenced
warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(35): warning #68-D: integer conversion resulted in a change of sign
kernel(44): warning #68-D: integer conversion resulted in a change of sign
kernel(17): warning #177-D: variable "N_I" was declared but never referenced
warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(735): warning #177-D: variable "zint" was declared but never referenced
warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(735): warning #177-D: variable "zint" was declared but never referenced
warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(735): warning #177-D: variable "zint" was declared but never referenced
warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(735): warning #177-D: variable "zint" was declared but never referenced
warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(100): warning #177-D: variable "zint" was declared but never referenced
kernel(325): warning #177-D: variable "zint" was declared but never referenced
warnings.warn(msg)
/home/cryosparc/cryosparc-install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(868): warning #177-D: variable "pos" was declared but never referenced
warnings.warn(msg)
========= sending heartbeat at 2025-04-28 12:37:50.077168
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:40:00.391794
========= sending heartbeat at 2025-04-28 12:40:10.400795
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:40:20.419553
========= sending heartbeat at 2025-04-28 12:40:30.437451
========= sending heartbeat at 2025-04-28 12:40:40.455302
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:40:50.473320
========= sending heartbeat at 2025-04-28 12:41:00.491223
========= sending heartbeat at 2025-04-28 12:41:10.509190
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:41:20.527091
========= sending heartbeat at 2025-04-28 12:41:30.545132
========= sending heartbeat at 2025-04-28 12:41:40.563796
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:41:50.580791
========= sending heartbeat at 2025-04-28 12:42:00.598797
<string>:1: RuntimeWarning: invalid value encountered in true_divide
/home/cryosparc/cryosparc-install/cryosparc_worker/cryosparc_compute/plotutil.py:2014: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
fig = plt.figure(figsize=figsize)
========= sending heartbeat at 2025-04-28 12:44:07.376818
========= sending heartbeat at 2025-04-28 12:44:17.394968
========= sending heartbeat at 2025-04-28 12:44:27.408821
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:44:37.426401
========= sending heartbeat at 2025-04-28 12:44:47.444354
========= sending heartbeat at 2025-04-28 12:44:57.462185
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:45:07.480835
========= sending heartbeat at 2025-04-28 12:45:17.499240
========= sending heartbeat at 2025-04-28 12:45:27.517051
========= sending heartbeat at 2025-04-28 12:45:37.528061
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:45:47.546079
========= sending heartbeat at 2025-04-28 12:45:57.563980
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:46:07.581781
========= sending heartbeat at 2025-04-28 12:46:17.594770
========= sending heartbeat at 2025-04-28 12:46:27.612483
========= sending heartbeat at 2025-04-28 12:46:37.626798
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:46:47.645372
========= sending heartbeat at 2025-04-28 12:46:57.662787
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:47:07.680905
========= sending heartbeat at 2025-04-28 12:47:17.698876
========= sending heartbeat at 2025-04-28 12:47:27.716790
========= sending heartbeat at 2025-04-28 12:47:37.736141
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:47:47.754060
========= sending heartbeat at 2025-04-28 12:47:57.772489
========= sending heartbeat at 2025-04-28 12:48:07.790505
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:50:16.095664
========= sending heartbeat at 2025-04-28 12:50:26.113785
========= sending heartbeat at 2025-04-28 12:50:36.131939
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:50:46.149782
========= sending heartbeat at 2025-04-28 12:50:56.167789
========= sending heartbeat at 2025-04-28 12:51:06.186788
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-04-28 12:51:16.203804
========= sending heartbeat at 2025-04-28 12:51:26.221985
========= sending heartbeat at 2025-04-28 12:51:36.230784
========= sending heartbeat at 2025-04-28 12:51:46.247780
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-04-28 12:51:56.265719
Terminated
Here is information on the machine I’m using
cryosparcm cli "get_job('P3', 'J695', 'job_type', 'version', 'instance_information', 'params_spec', 'status', 'started_at')"
{'_id': '680ebbba0228303d4f08027d', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '429.29GB', 'cpu_model': 'AMD Ryzen Threadripper PRO 5975WX 32-Cores', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 21158428672, 'name': 'NVIDIA RTX A4500', 'pcie': '0000:01:00'}, {'id': 1, 'mem': 21156200448, 'name': 'NVIDIA RTX A4500', 'pcie': '0000:02:00'}, {'id': 2, 'mem': 21158428672, 'name': 'NVIDIA RTX A4500', 'pcie': '0000:41:00'}, {'id': 3, 'mem': 21158428672, 'name': 'NVIDIA RTX A4500', 'pcie': '0000:42:00'}], 'ofd_hard_limit': 524288, 'ofd_soft_limit': 1024, 'physical_cores': 32, 'platform_architecture': 'x86_64', 'platform_node': 'taken-out', 'platform_release': '5.14.0-284.25.1.el9_2.x86_64', 'platform_version': '#1 SMP PREEMPT_DYNAMIC Wed Aug 2 14:53:30 UTC 2023', 'total_memory': '440.02GB', 'used_memory': '6.71GB'}, 'job_type': 'class_2D_new', 'params_spec': {'class2D_K': {'value': 300}, 'class2D_max_res': {'value': 15}, 'class2D_num_full_iter_batch': {'value': 50}, 'class2D_num_full_iter_batchsize_per_class': {'value': 200}, 'class2D_sigma_init_factor': {'value': 4}, 'class2D_window_inner_A': {'value': 320}, 'compute_num_gpus': {'value': 4}}, 'project_uid': 'P3', 'started_at': 'Mon, 28 Apr 2025 16:37:19 GMT', 'status': 'failed', 'uid': 'J695', 'version': 'v4.7.0'}