Stalled RBMC jobs?

Hi,

I have previously run RBMC jobs without problems, but now they seem to be stalling out, and I can’t figure out why. I am running RBMC jobs on subsets of a larger dataset for which I previously ran RBMC successfully. They all run fine to the stage of estimating motion hyperparameters.

One job then stalls at the stage of estimating dose weights (getting part way through and then generating a lot of “sending heartbeat” errors) and the other does the same at the motion-correct-particles step - getting 90% of the way through and then stalling out.

I have tried killing and restarting the jobs and so far it seems reproducible. Thoughts? Here is part of the joblog for the job that stalls at the motion-correct-particles step:

refmotion worker 0 (NVIDIA GeForce RTX 3090)
BFGS iterations:      55
scale (alpha):        0.092566
noise model (sigma2): 57.930584
     TIME (s)  SECTION
  0.000080141  sanity
  9.548886196  read movie
  0.022416273  get gain, defects
  0.026118669  read bg
  0.019918833  read rigid
  0.892744111  prep_movie
  0.562088867  extract from frames
  0.000181282  extract from refs
  0.000000190  adj
  0.000000120  bfactor
  0.029171596  rigid motion correct
  0.000290954  get noise, scale
  0.025338349  optimize trajectory
  0.067452653  shift_sum patches
  0.028979984  ifft
  0.000201022  unpad
  0.000075901  fill out dataset
  0.001446938  write output files
 11.225392080  --- TOTAL ---

/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909cfdd3a0> (size 3). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909cf7e4c0> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909cf7e460> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909ca97610> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909cbbe580> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909cfdda90> (size 3). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909cf7e460> (size 3). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f90b50aaca0> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909ca97730> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f90b506ae20> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909cbbe3d0> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909cfdd430> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f90ccdf49d0> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f90b48a1bb0> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909cfdda60> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909d5c63d0> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f909e8d3640> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f90b4b16a00> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead. 
  self._target(*self._args, **self._kwargs)
========= sending heartbeat at 2023-11-30 02:08:20.709730
========= sending heartbeat at 2023-11-30 02:08:30.727057
========= sending heartbeat at 2023-11-30 02:08:40.743080
========= sending heartbeat at 2023-11-30 02:08:50.759071
========= sending heartbeat at 2023-11-30 02:09:00.775034
========= sending heartbeat at 2023-11-30 02:09:10.790033
========= sending heartbeat at 2023-11-30 02:09:20.806038
========= sending heartbeat at 2023-11-30 02:09:30.822404

refmotion worker 2 (NVIDIA GeForce RTX 3090)
BFGS iterations:      148
scale (alpha):        0.081300
noise model (sigma2): 58.623901
     TIME (s)  SECTION
  0.000079441  sanity
 10.905865442  read movie
  0.018212542  get gain, defects
  0.023676769  read bg
  0.000432995  read rigid
  0.799730807  prep_movie
  1.709923751  extract from frames
  0.011128406  extract from refs
  0.000042221  adj
  0.000000030  bfactor
  3.044188999  rigid motion correct
  0.056932615  get noise, scale
 69.468694558  optimize trajectory
  0.875048076  shift_sum patches
  0.009892401  ifft
  0.012079587  unpad
  0.000074771  fill out dataset
  0.024842503  write output files
 86.960845912  --- TOTAL ---

========= sending heartbeat at 2023-11-30 02:09:40.828734
========= sending heartbeat at 2023-11-30 02:09:50.854572
========= sending heartbeat at 2023-11-30 02:10:00.864866
========= sending heartbeat at 2023-11-30 02:10:10.882144
========= sending heartbeat at 2023-11-30 02:10:20.899354
========= sending heartbeat at 2023-11-30 02:10:30.910133
========= sending heartbeat at 2023-11-30 02:10:40.955937
========= sending heartbeat at 2023-11-30 02:10:50.972382
========= sending heartbeat at 2023-11-30 02:11:00.988478
========= sending heartbeat at 2023-11-30 02:11:10.996304
========= sending heartbeat at 2023-11-30 02:11:21.006176
========= sending heartbeat at 2023-11-30 02:11:31.022453
========= sending heartbeat at 2023-11-30 02:11:41.030690
========= sending heartbeat at 2023-11-30 02:11:51.046836
========= sending heartbeat at 2023-11-30 02:12:01.056730
========= sending heartbeat at 2023-11-30 02:12:11.067822
========= sending heartbeat at 2023-11-30 02:12:21.078090
========= sending heartbeat at 2023-11-30 02:12:31.094232
========= sending heartbeat at 2023-11-30 02:12:41.110630
========= sending heartbeat at 2023-11-30 02:12:51.127007
========= sending heartbeat at 2023-11-30 02:13:01.143423
========= sending heartbeat at 2023-11-30 02:13:11.159257
========= sending heartbeat at 2023-11-30 02:13:21.178199
========= sending heartbeat at 2023-11-30 02:13:31.194548
========= sending heartbeat at 2023-11-30 02:13:41.203154
========= sending heartbeat at 2023-11-30 02:13:51.219666
========= sending heartbeat at 2023-11-30 02:14:01.235715
========= sending heartbeat at 2023-11-30 02:14:11.245820
========= sending heartbeat at 2023-11-30 02:14:21.264506
========= sending heartbeat at 2023-11-30 02:14:31.280803
========= sending heartbeat at 2023-11-30 02:14:41.297289
========= sending heartbeat at 2023-11-30 02:14:51.313997
========= sending heartbeat at 2023-11-30 02:15:01.330050
========= sending heartbeat at 2023-11-30 02:15:11.346634
========= sending heartbeat at 2023-11-30 02:15:21.364130
========= sending heartbeat at 2023-11-30 02:15:31.372194
========= sending heartbeat at 2023-11-30 02:15:41.379104
========= sending heartbeat at 2023-11-30 02:15:51.395700
========= sending heartbeat at 2023-11-30 02:16:01.412183
========= sending heartbeat at 2023-11-30 02:16:11.428304
========= sending heartbeat at 2023-11-30 02:16:21.445758
========= sending heartbeat at 2023-11-30 02:16:31.463648
========= sending heartbeat at 2023-11-30 02:16:41.475053
========= sending heartbeat at 2023-11-30 02:16:51.481863
========= sending heartbeat at 2023-11-30 02:17:01.498655
========= sending heartbeat at 2023-11-30 02:17:11.515850
========= sending heartbeat at 2023-11-30 02:17:21.532620
========= sending heartbeat at 2023-11-30 02:17:31.549033
========= sending heartbeat at 2023-11-30 02:17:41.619394
========= sending heartbeat at 2023-11-30 02:17:51.627667
========= sending heartbeat at 2023-11-30 02:18:01.643631
========= sending heartbeat at 2023-11-30 02:18:11.661498
========= sending heartbeat at 2023-11-30 02:18:21.679504
========= sending heartbeat at 2023-11-30 02:18:31.695779
========= sending heartbeat at 2023-11-30 02:18:41.712237
========= sending heartbeat at 2023-11-30 02:18:51.728461
========= sending heartbeat at 2023-11-30 02:19:01.745011
========= sending heartbeat at 2023-11-30 02:19:11.763211
========= sending heartbeat at 2023-11-30 02:19:21.778779
========= sending heartbeat at 2023-11-30 02:19:31.795715
========= sending heartbeat at 2023-11-30 02:19:41.812123
========= sending heartbeat at 2023-11-30 02:19:51.828688
========= sending heartbeat at 2023-11-30 02:20:01.844835
========= sending heartbeat at 2023-11-30 02:20:11.863007
========= sending heartbeat at 2023-11-30 02:20:21.871119
========= sending heartbeat at 2023-11-30 02:20:31.878390
========= sending heartbeat at 2023-11-30 02:20:41.894805
========= sending heartbeat at 2023-11-30 02:20:51.905530
========= sending heartbeat at 2023-11-30 02:21:01.922219
========= sending heartbeat at 2023-11-30 02:21:12.015110
========= sending heartbeat at 2023-11-30 02:21:22.032972
========= sending heartbeat at 2023-11-30 02:21:32.049534
========= sending heartbeat at 2023-11-30 02:21:42.065438
========= sending heartbeat at 2023-11-30 02:21:52.081849
========= sending heartbeat at 2023-11-30 02:22:02.097618
========= sending heartbeat at 2023-11-30 02:22:12.117476
========= sending heartbeat at 2023-11-30 02:22:22.134201
========= sending heartbeat at 2023-11-30 02:22:32.151864
========= sending heartbeat at 2023-11-30 02:22:42.168232
========= sending heartbeat at 2023-11-30 02:22:52.184874
========= sending heartbeat at 2023-11-30 02:23:02.192862
========= sending heartbeat at 2023-11-30 02:23:12.200698
========= sending heartbeat at 2023-11-30 02:23:22.211809
========= sending heartbeat at 2023-11-30 02:23:32.223875
========= sending heartbeat at 2023-11-30 02:23:42.231270
========= sending heartbeat at 2023-11-30 02:23:52.247779
========= sending heartbeat at 2023-11-30 02:24:02.263889
========= sending heartbeat at 2023-11-30 02:24:12.280722
========= sending heartbeat at 2023-11-30 02:24:22.297260
========= sending heartbeat at 2023-11-30 02:24:32.308644
========= sending heartbeat at 2023-11-30 02:24:42.324481
========= sending heartbeat at 2023-11-30 02:24:52.340723
========= sending heartbeat at 2023-11-30 02:25:02.357111
========= sending heartbeat at 2023-11-30 02:25:12.364100
========= sending heartbeat at 2023-11-30 02:25:22.376419
========= sending heartbeat at 2023-11-30 02:25:32.385214
========= sending heartbeat at 2023-11-30 02:25:42.401021
========= sending heartbeat at 2023-11-30 02:25:52.413298
========= sending heartbeat at 2023-11-30 02:26:02.425600
========= sending heartbeat at 2023-11-30 02:26:12.437911
========= sending heartbeat at 2023-11-30 02:26:22.452014
========= sending heartbeat at 2023-11-30 02:26:32.469572
========= sending heartbeat at 2023-11-30 02:26:42.484982
========= sending heartbeat at 2023-11-30 02:26:52.501202
========= sending heartbeat at 2023-11-30 02:27:02.510597
========= sending heartbeat at 2023-11-30 02:27:12.518179
========= sending heartbeat at 2023-11-30 02:27:22.533824
========= sending heartbeat at 2023-11-30 02:27:32.549864
========= sending heartbeat at 2023-11-30 02:27:42.566476
========= sending heartbeat at 2023-11-30 02:27:52.583094
========= sending heartbeat at 2023-11-30 02:28:02.600788
========= sending heartbeat at 2023-11-30 02:28:12.618841
========= sending heartbeat at 2023-11-30 02:28:22.635472
========= sending heartbeat at 2023-11-30 02:28:32.644542
========= sending heartbeat at 2023-11-30 02:28:42.660543
========= sending heartbeat at 2023-11-30 02:28:52.676920
========= sending heartbeat at 2023-11-30 02:29:02.693204
========= sending heartbeat at 2023-11-30 02:29:12.710159
========= sending heartbeat at 2023-11-30 02:29:22.719698
========= sending heartbeat at 2023-11-30 02:29:32.726635
========= sending heartbeat at 2023-11-30 02:29:42.743262
========= sending heartbeat at 2023-11-30 02:29:52.759361
========= sending heartbeat at 2023-11-30 02:30:02.769073
========= sending heartbeat at 2023-11-30 02:30:12.786427
========= sending heartbeat at 2023-11-30 02:30:22.795118
========= sending heartbeat at 2023-11-30 02:30:32.810124
========= sending heartbeat at 2023-11-30 02:30:42.826373
========= sending heartbeat at 2023-11-30 02:30:52.842840
========= sending heartbeat at 2023-11-30 02:31:02.850101
========= sending heartbeat at 2023-11-30 02:31:12.867265
========= sending heartbeat at 2023-11-30 02:31:22.883070
========= sending heartbeat at 2023-11-30 02:31:32.892516
========= sending heartbeat at 2023-11-30 02:31:42.908552
========= sending heartbeat at 2023-11-30 02:31:52.924722
========= sending heartbeat at 2023-11-30 02:32:02.940708
========= sending heartbeat at 2023-11-30 02:32:12.957600
========= sending heartbeat at 2023-11-30 02:32:22.974048
========= sending heartbeat at 2023-11-30 02:32:32.989806
========= sending heartbeat at 2023-11-30 02:32:43.004049
========= sending heartbeat at 2023-11-30 02:32:53.019889
========= sending heartbeat at 2023-11-30 02:33:03.035913
========= sending heartbeat at 2023-11-30 02:33:13.053145
========= sending heartbeat at 2023-11-30 02:33:23.061011
========= sending heartbeat at 2023-11-30 02:33:33.077720
========= sending heartbeat at 2023-11-30 02:33:43.093810
========= sending heartbeat at 2023-11-30 02:33:53.110168
========= sending heartbeat at 2023-11-30 02:34:03.119291
========= sending heartbeat at 2023-11-30 02:34:13.136220
========= sending heartbeat at 2023-11-30 02:34:23.144144
========= sending heartbeat at 2023-11-30 02:34:33.161149
========= sending heartbeat at 2023-11-30 02:34:43.169490
========= sending heartbeat at 2023-11-30 02:34:53.186135
========= sending heartbeat at 2023-11-30 02:35:03.227365
========= sending heartbeat at 2023-11-30 02:35:13.242282
========= sending heartbeat at 2023-11-30 02:35:23.257962
========= sending heartbeat at 2023-11-30 02:35:33.274338
========= sending heartbeat at 2023-11-30 02:35:43.291171
========= sending heartbeat at 2023-11-30 02:35:53.307337
========= sending heartbeat at 2023-11-30 02:36:03.322874
========= sending heartbeat at 2023-11-30 02:36:13.339889
========= sending heartbeat at 2023-11-30 02:36:23.355875
========= sending heartbeat at 2023-11-30 02:36:33.372014
========= sending heartbeat at 2023-11-30 02:36:43.387926
========= sending heartbeat at 2023-11-30 02:36:53.404640
========= sending heartbeat at 2023-11-30 02:37:03.420912
========= sending heartbeat at 2023-11-30 02:37:13.488806
========= sending heartbeat at 2023-11-30 02:37:23.505033
========= sending heartbeat at 2023-11-30 02:37:33.521045
========= sending heartbeat at 2023-11-30 02:37:43.538700
========= sending heartbeat at 2023-11-30 02:37:53.555159
========= sending heartbeat at 2023-11-30 02:38:03.562826
========= sending heartbeat at 2023-11-30 02:38:13.572811
========= sending heartbeat at 2023-11-30 02:38:23.589276
========= sending heartbeat at 2023-11-30 02:38:33.605466
========= sending heartbeat at 2023-11-30 02:38:43.621732
========= sending heartbeat at 2023-11-30 02:38:53.638729
========= sending heartbeat at 2023-11-30 02:39:03.654393
========= sending heartbeat at 2023-11-30 02:39:13.664271
========= sending heartbeat at 2023-11-30 02:39:23.676765
========= sending heartbeat at 2023-11-30 02:39:33.692646
========= sending heartbeat at 2023-11-30 02:39:43.701849
========= sending heartbeat at 2023-11-30 02:39:53.718049
========= sending heartbeat at 2023-11-30 02:40:03.733888
========= sending heartbeat at 2023-11-30 02:40:13.752182
========= sending heartbeat at 2023-11-30 02:40:23.769323
========= sending heartbeat at 2023-11-30 02:40:33.785764
========= sending heartbeat at 2023-11-30 02:40:43.801994
========= sending heartbeat at 2023-11-30 02:40:53.817922
========= sending heartbeat at 2023-11-30 02:41:03.834314
========= sending heartbeat at 2023-11-30 02:41:13.852451
========= sending heartbeat at 2023-11-30 02:41:23.870078
========= sending heartbeat at 2023-11-30 02:41:33.886588
========= sending heartbeat at 2023-11-30 02:41:43.902030
========= sending heartbeat at 2023-11-30 02:41:53.918540
========= sending heartbeat at 2023-11-30 02:42:03.937762
========= sending heartbeat at 2023-11-30 02:42:13.955371
========= sending heartbeat at 2023-11-30 02:42:23.971753
========= sending heartbeat at 2023-11-30 02:42:33.982264
========= sending heartbeat at 2023-11-30 02:42:43.993904
========= sending heartbeat at 2023-11-30 02:42:54.001125
========= sending heartbeat at 2023-11-30 02:43:04.017046
========= sending heartbeat at 2023-11-30 02:43:14.034703
========= sending heartbeat at 2023-11-30 02:43:24.052110
========= sending heartbeat at 2023-11-30 02:43:34.068597
========= sending heartbeat at 2023-11-30 02:43:44.085131
========= sending heartbeat at 2023-11-30 02:43:54.094210
========= sending heartbeat at 2023-11-30 02:44:04.101911
========= sending heartbeat at 2023-11-30 02:44:14.119887
========= sending heartbeat at 2023-11-30 02:44:24.126573
========= sending heartbeat at 2023-11-30 02:44:34.144394
========= sending heartbeat at 2023-11-30 02:44:44.160685
========= sending heartbeat at 2023-11-30 02:44:54.176496
========= sending heartbeat at 2023-11-30 02:45:04.193222
========= sending heartbeat at 2023-11-30 02:45:14.210218
========= sending heartbeat at 2023-11-30 02:45:24.226400
========= sending heartbeat at 2023-11-30 02:45:34.262869
========= sending heartbeat at 2023-11-30 02:45:44.269621
========= sending heartbeat at 2023-11-30 02:45:54.286198
========= sending heartbeat at 2023-11-30 02:46:04.302525
========= sending heartbeat at 2023-11-30 02:46:14.327062
========= sending heartbeat at 2023-11-30 02:46:24.343392
========= sending heartbeat at 2023-11-30 02:46:34.359640
========= sending heartbeat at 2023-11-30 02:46:44.377235
========= sending heartbeat at 2023-11-30 02:46:54.394319
========= sending heartbeat at 2023-11-30 02:47:04.409789
========= sending heartbeat at 2023-11-30 02:47:14.418891
========= sending heartbeat at 2023-11-30 02:47:24.436102
========= sending heartbeat at 2023-11-30 02:47:34.453098
========= sending heartbeat at 2023-11-30 02:47:44.463801
========= sending heartbeat at 2023-11-30 02:47:54.479837
========= sending heartbeat at 2023-11-30 02:48:04.495696
========= sending heartbeat at 2023-11-30 02:48:14.507535
========= sending heartbeat at 2023-11-30 02:48:24.524084
========= sending heartbeat at 2023-11-30 02:48:34.539942
========= sending heartbeat at 2023-11-30 02:48:44.556711
========= sending heartbeat at 2023-11-30 02:48:54.573542
========= sending heartbeat at 2023-11-30 02:49:04.589948
========= sending heartbeat at 2023-11-30 02:49:14.608323
========= sending heartbeat at 2023-11-30 02:49:24.617432
========= sending heartbeat at 2023-11-30 02:49:34.629482
========= sending heartbeat at 2023-11-30 02:49:44.645741
========= sending heartbeat at 2023-11-30 02:49:54.662038
========= sending heartbeat at 2023-11-30 02:50:04.670918
========= sending heartbeat at 2023-11-30 02:50:14.683125
========= sending heartbeat at 2023-11-30 02:50:24.699002
========= sending heartbeat at 2023-11-30 02:50:34.715649
========= sending heartbeat at 2023-11-30 02:50:44.732429
========= sending heartbeat at 2023-11-30 02:50:54.748892
========= sending heartbeat at 2023-11-30 02:51:04.764786
========= sending heartbeat at 2023-11-30 02:51:14.782271
========= sending heartbeat at 2023-11-30 02:51:24.799104
========= sending heartbeat at 2023-11-30 02:51:34.807943
========= sending heartbeat at 2023-11-30 02:51:44.820245
========= sending heartbeat at 2023-11-30 02:51:54.832539
========= sending heartbeat at 2023-11-30 02:52:04.849527
========= sending heartbeat at 2023-11-30 02:52:14.866774
========= sending heartbeat at 2023-11-30 02:52:24.884279
========= sending heartbeat at 2023-11-30 02:52:34.901658
========= sending heartbeat at 2023-11-30 02:52:44.918425
========= sending heartbeat at 2023-11-30 02:52:54.934657
========= sending heartbeat at 2023-11-30 02:53:04.951256
========= sending heartbeat at 2023-11-30 02:53:14.968825
========= sending heartbeat at 2023-11-30 02:53:24.985807
========= sending heartbeat at 2023-11-30 02:53:35.003367
========= sending heartbeat at 2023-11-30 02:53:45.020645
========= sending heartbeat at 2023-11-30 02:53:55.037210
========= sending heartbeat at 2023-11-30 02:54:05.053552
========= sending heartbeat at 2023-11-30 02:54:15.071106
========= sending heartbeat at 2023-11-30 02:54:25.080337
========= sending heartbeat at 2023-11-30 02:54:35.097928
========= sending heartbeat at 2023-11-30 02:54:45.114631
========= sending heartbeat at 2023-11-30 02:54:55.130766
========= sending heartbeat at 2023-11-30 02:55:05.147124
========= sending heartbeat at 2023-11-30 02:55:15.165397
========= sending heartbeat at 2023-11-30 02:55:25.181817
========= sending heartbeat at 2023-11-30 02:55:35.199327
========= sending heartbeat at 2023-11-30 02:55:45.215233
========= sending heartbeat at 2023-11-30 02:55:55.231508
========= sending heartbeat at 2023-11-30 02:56:05.248105
========= sending heartbeat at 2023-11-30 02:56:15.265385
========= sending heartbeat at 2023-11-30 02:56:25.284301
========= sending heartbeat at 2023-11-30 02:56:35.300908
========= sending heartbeat at 2023-11-30 02:56:45.318799
========= sending heartbeat at 2023-11-30 02:56:55.336565
========= sending heartbeat at 2023-11-30 02:57:05.352568
========= sending heartbeat at 2023-11-30 02:57:15.370731
========= sending heartbeat at 2023-11-30 02:57:25.387486
========= sending heartbeat at 2023-11-30 02:57:35.404261
========= sending heartbeat at 2023-11-30 02:57:45.421174
========= sending heartbeat at 2023-11-30 02:57:55.438611
========= sending heartbeat at 2023-11-30 02:58:05.455126
========= sending heartbeat at 2023-11-30 02:58:15.473219
========= sending heartbeat at 2023-11-30 02:58:25.489842

EDIT:

Here are my compute settings if helpful (nothing else is running on the system, and the system has two 3090 GPUs)

Please can you post a screenshot of htop from the worker while the job is in this state?
Have you already tried CryoSPARC 4.4 significantly increases "minimum" target specification for processing systems - #5 by sarulthasan ?

1 Like

I looked at htop and didn’t see anything particularly suspicious but will check again, and I’ll try that tip re hugepages, thanks!

I will say though that for both problem jobs it is repeatedly stalling at the same point on restarted runs - for one job, it is stalling after progressing 14% of the way through the dose-weight estimation, and for the other, it stalls at 94% of the motion-correct-particles step…

Here is htop output at stalled state:

I tried disabling “transparent hugepages” but it didn’t make any difference, at least to the currently running stalled job. (just tested - also doesn’t help on restart).

EDIT:
For comparison, here is htop during the same job prior to stalling:

Once it stalls, a single thread maxes out at 100% and stays there, with all the others at zero…

I tried increasing the oversubscription threshold so it only processes one mic per GPU, and reducing or increasing the memory fraction, same behavior.

EDIT2:

Restarting the system seems to have solved the problem (or at least it has progressed further than it did before)…

1 Like

Sorry for the slow response on this, I’m glad that restarting the system helped. Is it still working correctly?

Yes as of now, but won’t be sure until the jobs complete successfully - will report back!

1 Like

Hi @hsnyder,

Restarting the system allowed one of the two jobs to complete. The other one still stalls at the same point (94% of the way through motion-correct-particles)… and unfortunately there is no option but to restart it from scratch as far as I can see, I can’t just have it restart from the position where it stalled… appreciate any suggestions!

Cheers
Oli

Hi @olibclarke. Other than the transparent hugepages issue, we’re not aware of anything else that causes these kinds of stalls. All I can think of is to make sure that transparent hugepages are disabled in a reboot-persistent way, and check for anything in dmesg or journalctl.