Ab-initio Crashes abnormally

qchen · August 6, 2024, 5:56pm

Hiya,

I am combining two EER datasets with the sample pixel size and dose. All pre-processing on v4.5.3 including 2D class works fine, while Ab-initio always crashes at the same point.

Any solutions would be appreciated.

The log file of the error is as follows:

========= sending heartbeat at 2024-07-28 15:39:57.126780
========= sending heartbeat at 2024-07-28 15:40:07.145649
gpufft: creating new cufft plan (plan id 0 pid 280864)
gpu_id 0
ndims 2
dims 80 80 0
inembed 80 80 0
istride 1
idist 6400
onembed 80 80 0
ostride 1
odist 6400
batch 10
type C2C
wkspc automatic
Python traceback:

HOST ALLOCATION FUNCTION: using numba.cuda.pinned_array
/net/flash/flash/qchen/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(35): warning #68-D: integer conversion resulted in a change of sign

kernel(44): warning #68-D: integer conversion resulted in a change of sign

kernel(17): warning #177-D: variable “N_I” was declared but never referenced

warnings.warn(msg)
/net/flash/flash/qchen/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py:2294: RuntimeWarning: divide by zero encountered in float_scalars
run_old(*args, **kw)
/net/flash/flash/qchen/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py:2294: RuntimeWarning: divide by zero encountered in double_scalars
run_old(*args, **kw)
/net/flash/flash/qchen/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py:2294: RuntimeWarning: invalid value encountered in float_scalars
run_old(*args, **kw)
/net/flash/flash/qchen/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py:2294: RuntimeWarning: invalid value encountered in double_scalars
run_old(*args, **kw)
:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
/net/flash/flash/qchen/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py:2294: RuntimeWarning: invalid value encountered in float_scalars
run_old(*args, **kw)
/net/flash/flash/qchen/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py:2294: RuntimeWarning: invalid value encountered in double_scalars
run_old(*args, **kw)
/net/flash/flash/qchen/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py:2294: RuntimeWarning: divide by zero encountered in float_scalars
run_old(*args, **kw)
/net/flash/flash/qchen/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py:2294: RuntimeWarning: divide by zero encountered in double_scalars
run_old(*args, **kw)
:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2024-07-28 15:40:17.164646
gpufft: creating new cufft plan (plan id 1 pid 280864)
gpu_id 0
ndims 2
dims 80 80 0
inembed 80 80 0
istride 1
idist 6400
onembed 80 80 0
ostride 1
odist 6400
batch 90
type C2C
wkspc automatic
Python traceback:

:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2024-07-28 15:40:27.180721
========= sending heartbeat at 2024-07-28 15:40:37.200718
========= sending heartbeat at 2024-07-28 15:40:47.218657
/net/flash/flash/qchen/cryosparc/cryosparc_worker/cryosparc_compute/plotutil.py:602: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (matplotlib.pyplot.figure) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam figure.max_open_warning). Consider using matplotlib.pyplot.close().
fig = plt.figure(figsize=figsize)
========= sending heartbeat at 2024-07-28 15:40:57.247647
========= sending heartbeat at 2024-07-28 15:41:07.265648
========= sending heartbeat at 2024-07-28 15:41:17.285271
========= sending heartbeat at 2024-07-28 15:41:27.303648
========= sending heartbeat at 2024-07-28 15:41:37.321650
========= sending heartbeat at 2024-07-28 15:41:47.339705
========= sending heartbeat at 2024-07-28 15:41:57.364646
========= sending heartbeat at 2024-07-28 15:42:07.383650
========= sending heartbeat at 2024-07-28 15:42:17.401157
========= sending heartbeat at 2024-07-28 15:42:27.420954
========= sending heartbeat at 2024-07-28 15:42:37.439469
========= sending heartbeat at 2024-07-28 15:42:47.461646
========= sending heartbeat at 2024-07-28 15:42:57.480644
:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2024-07-28 15:43:07.500653
========= sending heartbeat at 2024-07-28 15:43:17.519649
========= sending heartbeat at 2024-07-28 15:43:27.538649
========= sending heartbeat at 2024-07-28 15:43:37.556649
========= sending heartbeat at 2024-07-28 15:43:47.574667
========= sending heartbeat at 2024-07-28 15:43:57.593650
========= sending heartbeat at 2024-07-28 15:44:07.612649
========= sending heartbeat at 2024-07-28 15:44:17.632118
========= sending heartbeat at 2024-07-28 15:44:27.650651
========= sending heartbeat at 2024-07-28 15:44:37.668650
========= sending heartbeat at 2024-07-28 15:44:47.687649
========= sending heartbeat at 2024-07-28 15:44:57.706649
========= sending heartbeat at 2024-07-28 15:45:07.733648
========= sending heartbeat at 2024-07-28 15:45:17.752540
:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
corrupted size vs. prev_size
/net/flash/flash/qchen/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 280864 Aborted (core dumped) python -c “import cryosparc_compute.run as run; run.run()” “$@”

qchen · August 7, 2024, 6:01pm

I have more Heterorefinment jobs crashes with the same error.

hsnyder · August 26, 2024, 6:55pm

Hi @qchen,

I’ll send you a direct message about this issue momentarily.

– Harris

jhh1492 · November 30, 2024, 9:37pm

Have had similiar issues on Ab initio, but much more frequently 2D classification jobs

wtempel · December 2, 2024, 7:33pm

@jhh1492 Please can you post the outputs of the commands

csprojectid=P99 # replace with actual project ID
csjobid=J199 # replace with id of a job that should be running
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"
cryosparcm joblog $csprojectid $csjobid | tail -n 40
cryosparcm eventlog $csprojectid $csjobid | tail -n 40

for jobs where you observed the error.

aorta · February 11, 2025, 11:28pm

I am running into the same issue with 2D classification jobs

wtempel · February 12, 2025, 3:30pm

Welcome to the forum @aorta . Please can you post the outputs of these commands:

wtempel:

csprojectid=P99 # replace with actual project ID
csjobid=J199 # replace with id of a job that should be running
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"
cryosparcm joblog $csprojectid $csjobid | tail -n 40
cryosparcm eventlog $csprojectid $csjobid | tail -n 40