I’m currently running into an error while trying to run Helical Refinement and it’s not entirely clear to me what is happening. During the first iteration, my job(s) terminate with the following error:
Traceback (most recent call last):
File “/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2192, in run_with_except_hook
run_old(*args, **kw)
File “/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/threading.py”, line 870, in run
self._target(*self._args, **self._kwargs)
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 2702, in cryosparc_master.cryosparc_compute.engine.newengine.process.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 2755, in cryosparc_master.cryosparc_compute.engine.newengine.process.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 1619, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.find_best_pose_shift_class
File “<array_function internals>”, line 5, in unravel_index
ValueError: index 55 is out of bounds for array with size 32
I have looked back at several of the other troubleshooting threads for ‘ValueError’ crashes and haven’t had much success in diagnosing my problem based on this. I have similar jobs (from which this one was cloned) that ran just last week without problems. Currently I am running v4.4.1, but I’m not sure of the specific patch. I’m currently awaiting our cluster manager to find time to upgrade to v4.5, which will likely be later this week.
The job log looks like there might be some problem with either our python install or CUDA itself, but I’m not savvy enough to know where to go from here:
ERROR: ld.so: object ‘/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.8.so’ from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object ‘/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.8.so’ from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object ‘/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.8.so’ from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object ‘/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.8.so’ from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object ‘/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.8.so’ from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object ‘/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.8.so’ from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object ‘/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.8.so’ from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object ‘/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.8.so’ from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object ‘/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.8.so’ from LD_PRELOAD cannot be preloaded: ignored.================= CRYOSPARCW ======= 2024-05-13 10:18:51.856285 =========
Project P15 Job J462
Master reichow-cs.local Port 39002========= monitor process now starting main process at 2024-05-13 10:18:51.856381
MAINPROCESS PID 20872
MAIN PID 20872
helix.run_refine cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
========= sending heartbeat at 2024-05-13 10:19:21.063516
========= sending heartbeat at 2024-05-13 10:19:31.085319
========= sending heartbeat at 2024-05-13 10:19:41.106007
========= sending heartbeat at 2024-05-13 10:19:51.127174
========= sending heartbeat at 2024-05-13 10:20:01.148537
========= sending heartbeat at 2024-05-13 10:20:11.168767
========= sending heartbeat at 2024-05-13 10:20:21.189039
========= sending heartbeat at 2024-05-13 10:20:31.207828
========= sending heartbeat at 2024-05-13 10:20:41.227892
========= sending heartbeat at 2024-05-13 10:20:51.245881
========= sending heartbeat at 2024-05-13 10:21:01.265278
========= sending heartbeat at 2024-05-13 10:21:11.285562
========= sending heartbeat at 2024-05-13 10:21:21.307537
========= sending heartbeat at 2024-05-13 10:21:31.327938
========= sending heartbeat at 2024-05-13 10:21:41.347447
========= sending heartbeat at 2024-05-13 10:21:51.370256
========= sending heartbeat at 2024-05-13 10:22:01.390201
========= sending heartbeat at 2024-05-13 10:22:11.410702
========= sending heartbeat at 2024-05-13 10:22:21.432701
========= sending heartbeat at 2024-05-13 10:22:31.445900
========= sending heartbeat at 2024-05-13 10:22:41.466254
========= sending heartbeat at 2024-05-13 10:22:51.488397
========= sending heartbeat at 2024-05-13 10:23:01.511077
========= sending heartbeat at 2024-05-13 10:23:11.533222
========= sending heartbeat at 2024-05-13 10:23:21.555197
========= sending heartbeat at 2024-05-13 10:23:31.576120
========= sending heartbeat at 2024-05-13 10:23:41.595050
========= sending heartbeat at 2024-05-13 10:23:51.615712
========= sending heartbeat at 2024-05-13 10:24:01.636621
========= sending heartbeat at 2024-05-13 10:24:11.658570
========= sending heartbeat at 2024-05-13 10:24:21.680701
========= sending heartbeat at 2024-05-13 10:24:31.700396
========= sending heartbeat at 2024-05-13 10:24:41.719245
========= sending heartbeat at 2024-05-13 10:24:51.739103
========= sending heartbeat at 2024-05-13 10:25:01.760860
========= sending heartbeat at 2024-05-13 10:25:11.785471
========= sending heartbeat at 2024-05-13 10:25:21.805224
========= sending heartbeat at 2024-05-13 10:25:31.827786
========= sending heartbeat at 2024-05-13 10:25:41.845904
========= sending heartbeat at 2024-05-13 10:25:51.868055
========= sending heartbeat at 2024-05-13 10:26:01.887705
========= sending heartbeat at 2024-05-13 10:26:11.907976
========= sending heartbeat at 2024-05-13 10:26:21.927890
========= sending heartbeat at 2024-05-13 10:26:31.948459
========= sending heartbeat at 2024-05-13 10:26:41.970765
========= sending heartbeat at 2024-05-13 10:26:51.983895
========= sending heartbeat at 2024-05-13 10:27:02.002674
========= sending heartbeat at 2024-05-13 10:27:12.021086
========= sending heartbeat at 2024-05-13 10:27:22.033933
========= sending heartbeat at 2024-05-13 10:27:32.052485
========= sending heartbeat at 2024-05-13 10:27:42.073760
========= sending heartbeat at 2024-05-13 10:27:52.092779
========= sending heartbeat at 2024-05-13 10:28:02.114634
========= sending heartbeat at 2024-05-13 10:28:12.134353
========= sending heartbeat at 2024-05-13 10:28:22.153454
========= sending heartbeat at 2024-05-13 10:28:32.172759
========= sending heartbeat at 2024-05-13 10:28:42.194538
========= sending heartbeat at 2024-05-13 10:28:52.214588
========= sending heartbeat at 2024-05-13 10:29:02.233867
========= sending heartbeat at 2024-05-13 10:29:12.255442
========= sending heartbeat at 2024-05-13 10:29:22.275629
========= sending heartbeat at 2024-05-13 10:29:32.294850
========= sending heartbeat at 2024-05-13 10:29:42.315392
========= sending heartbeat at 2024-05-13 10:29:54.386116
========= sending heartbeat at 2024-05-13 10:30:04.408810
========= sending heartbeat at 2024-05-13 10:30:14.430307
========= sending heartbeat at 2024-05-13 10:30:24.444519
========= sending heartbeat at 2024-05-13 10:30:34.464708
========= sending heartbeat at 2024-05-13 10:30:44.484561
========= sending heartbeat at 2024-05-13 10:30:54.503816
========= sending heartbeat at 2024-05-13 10:31:04.535915
========= sending heartbeat at 2024-05-13 10:31:14.554259
========= sending heartbeat at 2024-05-13 10:31:24.574578
========= sending heartbeat at 2024-05-13 10:31:34.595388
========= sending heartbeat at 2024-05-13 10:31:44.615013
========= sending heartbeat at 2024-05-13 10:31:54.634971
========= sending heartbeat at 2024-05-13 10:32:04.654848
========= sending heartbeat at 2024-05-13 10:32:14.675319
========= sending heartbeat at 2024-05-13 10:32:24.694138
========= sending heartbeat at 2024-05-13 10:32:34.713131
========= sending heartbeat at 2024-05-13 10:32:44.792548
========= sending heartbeat at 2024-05-13 10:32:54.812338
========= sending heartbeat at 2024-05-13 10:33:04.833057
========= sending heartbeat at 2024-05-13 10:33:14.853966
========= sending heartbeat at 2024-05-13 10:33:24.874410
========= sending heartbeat at 2024-05-13 10:33:34.892352
========= sending heartbeat at 2024-05-13 10:33:44.912821
========= sending heartbeat at 2024-05-13 10:33:54.932993
========= sending heartbeat at 2024-05-13 10:34:04.955905
========= sending heartbeat at 2024-05-13 10:34:14.980887
========= sending heartbeat at 2024-05-13 10:34:25.002815
========= sending heartbeat at 2024-05-13 10:34:35.026086
========= sending heartbeat at 2024-05-13 10:34:45.047321
========= sending heartbeat at 2024-05-13 10:34:55.068687
========= sending heartbeat at 2024-05-13 10:35:05.088368
========= sending heartbeat at 2024-05-13 10:35:15.097568
========= sending heartbeat at 2024-05-13 10:35:25.120940
========= sending heartbeat at 2024-05-13 10:35:35.142131
========= sending heartbeat at 2024-05-13 10:35:45.161425
========= sending heartbeat at 2024-05-13 10:35:55.181170
========= sending heartbeat at 2024-05-13 10:36:05.200360
========= sending heartbeat at 2024-05-13 10:36:15.218369
========= sending heartbeat at 2024-05-13 10:36:25.240021
========= sending heartbeat at 2024-05-13 10:36:35.260890
========= sending heartbeat at 2024-05-13 10:36:45.272717
========= sending heartbeat at 2024-05-13 10:36:55.290845
========= sending heartbeat at 2024-05-13 10:37:05.310892
========= sending heartbeat at 2024-05-13 10:37:15.330153
========= sending heartbeat at 2024-05-13 10:37:25.350841
========= sending heartbeat at 2024-05-13 10:37:35.373224
========= sending heartbeat at 2024-05-13 10:37:45.394291
========= sending heartbeat at 2024-05-13 10:37:55.414886
========= sending heartbeat at 2024-05-13 10:38:05.433815
========= sending heartbeat at 2024-05-13 10:38:15.447806
========= sending heartbeat at 2024-05-13 10:38:25.468557
========= sending heartbeat at 2024-05-13 10:38:35.489938
========= sending heartbeat at 2024-05-13 10:38:45.509043
========= sending heartbeat at 2024-05-13 10:38:55.531594
========= sending heartbeat at 2024-05-13 10:39:05.552053
========= sending heartbeat at 2024-05-13 10:39:15.571129
========= sending heartbeat at 2024-05-13 10:39:25.591667
========= sending heartbeat at 2024-05-13 10:39:35.612355
========= sending heartbeat at 2024-05-13 10:39:45.633939
========= sending heartbeat at 2024-05-13 10:39:55.656717
========= sending heartbeat at 2024-05-13 10:40:05.676744
========= sending heartbeat at 2024-05-13 10:40:15.696858
========= sending heartbeat at 2024-05-13 10:40:25.717575
========= sending heartbeat at 2024-05-13 10:40:35.728707
========= sending heartbeat at 2024-05-13 10:40:45.752670
========= sending heartbeat at 2024-05-13 10:40:55.774432
========= sending heartbeat at 2024-05-13 10:41:05.795693
========= sending heartbeat at 2024-05-13 10:41:15.816281
========= sending heartbeat at 2024-05-13 10:41:25.828892
========= sending heartbeat at 2024-05-13 10:41:35.840191
========= sending heartbeat at 2024-05-13 10:41:45.865972
========= sending heartbeat at 2024-05-13 10:41:55.883809
========= sending heartbeat at 2024-05-13 10:42:05.902819
========= sending heartbeat at 2024-05-13 10:42:15.922907
========= sending heartbeat at 2024-05-13 10:42:25.943877
========= sending heartbeat at 2024-05-13 10:42:35.962879
========= sending heartbeat at 2024-05-13 10:42:45.982272
========= sending heartbeat at 2024-05-13 10:42:56.001121
========= sending heartbeat at 2024-05-13 10:43:06.021489
========= sending heartbeat at 2024-05-13 10:43:16.041668
========= sending heartbeat at 2024-05-13 10:43:26.061667
gpufft: creating new cufft plan (plan id 0 pid 20872)
gpu_id 0
ndims 2
dims 288 288 0
inembed 288 288 0
istride 1
idist 82944
onembed 288 288 0
ostride 1
odist 82944
batch 500
type C2C
wkspc automatic
Python traceback:
Running job J462 of type helix_refine
Running job on hostname %s reichow-cs
Allocated Resources : {‘fixed’: {‘SSD’: False}, ‘hostname’: ‘reichow-cs’, ‘lane’: ‘reichow-cs’, ‘lane_type’: ‘cluster’, ‘license’: True, ‘licenses_acquired’: 1, ‘slots’: {‘CPU’: [0, 1, 2, 3], ‘GPU’: [0], ‘RAM’: [0, 1, 2]}, ‘target’: {‘cache_path’: ‘/home/exacloud/gscratch/reichowlab/cryosparc_cache’, ‘cache_quota_mb’: 1000000, ‘cache_reserve_mb’: 10000, ‘custom_var_names’: , ‘custom_vars’: {}, ‘desc’: None, ‘hostname’: ‘reichow-cs’, ‘lane’: ‘reichow-cs’, ‘name’: ‘reichow-cs’, ‘qdel_cmd_tpl’: ‘scancel {{ cluster_job_id }}’, ‘qinfo_cmd_tpl’: “sinfo --format=‘%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E’”, ‘qstat_cmd_tpl’: ‘squeue -j {{ cluster_job_id }}’, ‘qstat_code_cmd_tpl’: None, ‘qsub_cmd_tpl’: ‘sbatch {{ script_path_abs }}’, ‘script_tpl’: ‘#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --partition=gpu\n#SBATCH --account=reichowlab\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n#SBATCH -N 1\n#SBATCH --qos=normal\n#SBATCH --mem=100G\n#SBATCH -n {{num_cpu}}\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --gres=gpu:{{num_gpu}}\n#SBATCH --time=7-0\n\n{{ run_cmd }}\n\n’, ‘send_cmd_tpl’: ‘{{ command }}’, ‘title’: ‘reichow-cs’, ‘tpl_vars’: [‘project_uid’, ‘job_log_path_abs’, ‘job_dir_abs’, ‘num_cpu’, ‘job_uid’, ‘cluster_job_id’, ‘num_gpu’, ‘run_cmd’, ‘command’], ‘type’: ‘cluster’, ‘worker_bin_path’: ‘/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/bin/cryosparcw’}}
HOST ALLOCATION FUNCTION: using numba.cuda.pinned_array
========= sending heartbeat at 2024-05-13 10:43:36.080836
========= sending heartbeat at 2024-05-13 10:43:46.102210
========= sending heartbeat at 2024-05-13 10:43:56.122842
========= sending heartbeat at 2024-05-13 10:44:06.143815
========= sending heartbeat at 2024-05-13 10:44:16.155848
========= sending heartbeat at 2024-05-13 10:44:26.175080
gpufft: creating new cufft plan (plan id 1 pid 20872)
gpu_id 0
ndims 3
dims 288 288 288
inembed 288 288 290
istride 1
idist 24053760
onembed 288 288 145
ostride 1
odist 12026880
batch 1
type R2C
wkspc automatic
Python traceback:========= sending heartbeat at 2024-05-13 10:44:36.195343
========= sending heartbeat at 2024-05-13 10:44:46.214230
gpufft: creating new cufft plan (plan id 2 pid 20872)
gpu_id 0
ndims 2
dims 288 288 0
inembed 288 290 0
istride 1
idist 83520
onembed 288 145 0
ostride 1
odist 41760
batch 458
type R2C
wkspc automatic
Python traceback:========= sending heartbeat at 2024-05-13 10:44:56.227950
gpufft: creating new cufft plan (plan id 3 pid 20872)
gpu_id 0
ndims 2
dims 288 288 0
inembed 288 290 0
istride 1
idist 83520
onembed 288 145 0
ostride 1
odist 41760
batch 500
type R2C
wkspc automatic
Python traceback:**custom thread exception hook caught something
**** handle exception rc
/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/cryosparc_compute/jobs/motioncorrection/mic_utils.py:95: NumbaDeprecationWarning: The ‘nopython’ keyword argument was not supplied to the ‘numba.jit’ decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See Deprecation Notices — Numba 0+untagged.2155.g9ce83ef.dirty documentation for details.
@jit(nogil=True)
/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/cryosparc_compute/micrographs.py:563: NumbaDeprecationWarning: The ‘nopython’ keyword argument was not supplied to the ‘numba.jit’ decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See Deprecation Notices — Numba 0+untagged.2155.g9ce83ef.dirty documentation for details.
def contrast_normalization(arr_bin, tile_size = 128):
/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/cryosparc_compute/alignment.py:216: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
NET.srcmapf_gpu.free()
/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
self._target(*self._args, **self._kwargs)
/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/cryosparc_compute/plotutil.py:565: RuntimeWarning: divide by zero encountered in log
logabs = n.log(n.abs(fM))
/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/threading.py:870: RuntimeWarning: invalid value encountered in multiply
self._target(*self._args, **self._kwargs)
/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/dispatcher.py:538: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/threading.py:870: RuntimeWarning: invalid value encountered in multiply
self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
File “/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2192, in run_with_except_hook
run_old(*args, **kw)
File “/home/exacloud/gscratch/reichowlab/local/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/threading.py”, line 870, in run
self._target(*self._args, **self._kwargs)
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 2702, in cryosparc_master.cryosparc_compute.engine.newengine.process.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 2755, in cryosparc_master.cryosparc_compute.engine.newengine.process.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 1619, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.find_best_pose_shift_class
File “<array_function internals>”, line 5, in unravel_index
ValueError: index 55 is out of bounds for array with size 32
set status to failed
========= main process now complete at 2024-05-13 10:45:00.800318.
========= monitor process now complete at 2024-05-13 10:45:00.806254.
Thank you