Hi, I’m running Ab Initio reconstruction on cryosparc v4.5.3. It keeps failing, though reinstalled it twice, with error message “no heartbeat received in 180 s”.
In the job.log file it gives the error:
cryosparc_worker/cryosparc_compute/jobs/runcommon.py:2294: RuntimeWarning: divide by zero encountered in float_scalars run_old(*args, **kw)
Anything I could do to make it run properly?
Thanks!
@fent Please can you post the outputs of these commands
cspid="P99"
csjid="J199"
cyosparcm cli "get_job('$cspid', '$csjid', 'job_type', 'version', 'params_spec', 'instance_information', 'input_slot_groups', 'status')"
cryosparcm eventlog "$cspid" "$csjid" | head -n 40
cryosparcm eventlog "$cspid" "$csjid" | tail -n 40
cryosparcm joblog "$cspid" "$csjid" | tail -n 40
where you replace P99
, J199
with the job’s actual project and job IDs, respectively
[Mon, 05 Aug 2024 16:41:54 GMT] License is valid.
[Mon, 05 Aug 2024 16:41:54 GMT] Launching job on lane SLURM target SLURM ...
[Mon, 05 Aug 2024 16:41:54 GMT] Launching job on cluster SLURM
[Mon, 05 Aug 2024 16:41:54 GMT]
====================== Cluster submission script: ========================
==========================================================================
#!/bin/sh
#SBATCH --export=ALL
#SBATCH -J cryosparc_P1_J3
#SBATCH -o /beegfs3/xxx/grid385_230724k4/CS-grid385-230724k4/J3/J3.out
#SBATCH -e /beegfs3/xxx/grid385_230724k4/CS-grid385-230724k4/J3/J3.err
#SBATCH --open-mode append
#SBATCH -t 7-00:00:00
#SBATCH --mail-type FAIL
#SBATCH -p gpu --gres gpu:1 --ntasks 1 --cpus-per-task 8
#SBATCH --mem 45G
echo CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}
export CRYOSPARC_SSD_PATH="/ssd/${SLURM_JOB_USER}-${SLURM_JOBID}"
export MODULEPATH="/public/com/modules"
source /usr/share/lmod/lmod/init/profile
module load cuda/11.8
/net/flash/flash/xxx/cryosparc/cryosparc_worker/bin/cryosparcw run --project P1 --job J3 --master_hostname flash.lmb.internal --master_command_core_port 50322 > /beegfs3/xxx/grid385_230724k4/CS-grid385-230724k4/J3/job.log 2>&1
==========================================================================
==========================================================================
[Mon, 05 Aug 2024 16:41:54 GMT] -------- Submission command:
sbatch /beegfs3/xxx/grid385_230724k4/CS-grid385-230724k4/J3/queue_sub_script.sh
[Mon, 05 Aug 2024 16:41:54 GMT] -------- Cluster Job ID:
5034014
[Mon, 05 Aug 2024 16:41:54 GMT] -------- Queued on cluster at 2024-08-05 17:41:54.749937
[Mon, 05 Aug 2024 16:41:55 GMT] -------- Cluster job status at 2024-08-05 17:42:15.286951 (2 retries)
5034014 gpu cryospar xxx PD 0:00 1 (None)
[Mon, 05 Aug 2024 16:42:24 GMT] [CPU RAM used: 93 MB] Job J3 Started
[Mon, 05 Aug 2024 16:42:24 GMT] [CPU RAM used: 93 MB] Master running v4.5.3, worker running v4.5.3
[Mon, 05 Aug 2024 16:42:24 GMT] [CPU RAM used: 93 MB] Working in directory: /beegfs3/xxx/grid385_230724k4/CS-grid385-230724k4/J3
[Mon, 05 Aug 2024 16:42:24 GMT] [CPU RAM used: 93 MB] Running on lane SLURM
[Mon, 05 Aug 2024 16:42:24 GMT] [CPU RAM used: 93 MB] Resources allocated:
[Mon, 05 Aug 2024 16:42:24 GMT] [CPU RAM used: 93 MB] Worker: SLURM
Traceback (most recent call last):
File "<string>", line 9, in <module>
BrokenPipeError: [Errno 32] Broken pipe
[Mon, 05 Aug 2024 16:47:17 GMT] [CPU RAM used: 1088 MB] -- Class 2 -- lr: 0.20 eps: 1.15 step ratio : 0.1215 ESS R: 6.771 S: 2.445 Class Size: 37.4% (Average: 37.8%)
[Mon, 05 Aug 2024 16:47:17 GMT] [CPU RAM used: 1088 MB] Done iteration 00394 of 01641 in 0.929s. Total time 273.9s. Est time remaining 1140.8s.
[Mon, 05 Aug 2024 16:47:17 GMT] [CPU RAM used: 1088 MB] ----------- Iteration 395 (epoch 0.241). radwn 17.80 resolution 16.72A minisize 300 beta 0.00
[Mon, 05 Aug 2024 16:47:18 GMT] [CPU RAM used: 1088 MB] -- Class 0 -- lr: 0.20 eps: 1.14 step ratio : 0.0994 ESS R: 6.335 S: 2.592 Class Size: 23.8% (Average: 25.9%)
[Mon, 05 Aug 2024 16:47:18 GMT] [CPU RAM used: 1088 MB] -- Class 1 -- lr: 0.20 eps: 1.14 step ratio : 0.1287 ESS R: 5.482 S: 2.730 Class Size: 39.7% (Average: 36.3%)
[Mon, 05 Aug 2024 16:47:18 GMT] [CPU RAM used: 1088 MB] -- Class 2 -- lr: 0.20 eps: 1.14 step ratio : 0.1253 ESS R: 6.754 S: 2.449 Class Size: 36.5% (Average: 37.8%)
[Mon, 05 Aug 2024 16:47:18 GMT] [CPU RAM used: 1088 MB] Done iteration 00395 of 01641 in 0.933s. Total time 274.8s. Est time remaining 1142.0s.
[Mon, 05 Aug 2024 16:47:18 GMT] [CPU RAM used: 1088 MB] ----------- Iteration 396 (epoch 0.242). radwn 17.84 resolution 16.68A minisize 300 beta 0.00
[Mon, 05 Aug 2024 16:47:19 GMT] [CPU RAM used: 1088 MB] -- Class 0 -- lr: 0.20 eps: 1.16 step ratio : 0.1242 ESS R: 5.820 S: 2.589 Class Size: 26.3% (Average: 25.9%)
[Mon, 05 Aug 2024 16:47:19 GMT] [CPU RAM used: 1088 MB] -- Class 1 -- lr: 0.20 eps: 1.16 step ratio : 0.1186 ESS R: 5.391 S: 2.660 Class Size: 34.7% (Average: 36.3%)
[Mon, 05 Aug 2024 16:47:19 GMT] [CPU RAM used: 1088 MB] -- Class 2 -- lr: 0.20 eps: 1.16 step ratio : 0.1526 ESS R: 6.191 S: 2.330 Class Size: 39.0% (Average: 37.8%)
[Mon, 05 Aug 2024 16:47:19 GMT] [CPU RAM used: 1088 MB] Done iteration 00396 of 01641 in 0.939s. Total time 275.7s. Est time remaining 1143.9s.
[Mon, 05 Aug 2024 16:47:19 GMT] [CPU RAM used: 1088 MB] ----------- Iteration 397 (epoch 0.244). radwn 17.88 resolution 16.64A minisize 300 beta 0.00
[Mon, 05 Aug 2024 16:47:20 GMT] [CPU RAM used: 1088 MB] -- Class 0 -- lr: 0.20 eps: 1.11 step ratio : 0.1031 ESS R: 5.647 S: 2.493 Class Size: 22.3% (Average: 25.8%)
[Mon, 05 Aug 2024 16:47:20 GMT] [CPU RAM used: 1088 MB] -- Class 1 -- lr: 0.20 eps: 1.11 step ratio : 0.1302 ESS R: 5.064 S: 2.484 Class Size: 44.2% (Average: 36.4%)
[Mon, 05 Aug 2024 16:47:20 GMT] [CPU RAM used: 1088 MB] -- Class 2 -- lr: 0.20 eps: 1.11 step ratio : 0.1217 ESS R: 6.549 S: 2.555 Class Size: 33.5% (Average: 37.8%)
[Mon, 05 Aug 2024 16:47:20 GMT] [CPU RAM used: 1088 MB] Done iteration 00397 of 01641 in 0.939s. Total time 276.7s. Est time remaining 1145.3s.
[Mon, 05 Aug 2024 16:47:20 GMT] [CPU RAM used: 1088 MB] ----------- Iteration 398 (epoch 0.245). radwn 17.92 resolution 16.61A minisize 300 beta 0.00
[Mon, 05 Aug 2024 16:47:21 GMT] [CPU RAM used: 1088 MB] -- Class 0 -- lr: 0.20 eps: 1.13 step ratio : 0.0996 ESS R: 5.742 S: 2.492 Class Size: 21.9% (Average: 25.8%)
[Mon, 05 Aug 2024 16:47:21 GMT] [CPU RAM used: 1088 MB] -- Class 1 -- lr: 0.20 eps: 1.13 step ratio : 0.1289 ESS R: 5.012 S: 2.444 Class Size: 37.3% (Average: 36.4%)
[Mon, 05 Aug 2024 16:47:21 GMT] [CPU RAM used: 1088 MB] -- Class 2 -- lr: 0.20 eps: 1.13 step ratio : 0.1437 ESS R: 5.901 S: 2.398 Class Size: 40.8% (Average: 37.8%)
[Mon, 05 Aug 2024 16:47:21 GMT] [CPU RAM used: 1088 MB] Done iteration 00398 of 01641 in 0.950s. Total time 277.6s. Est time remaining 1148.0s.
[Mon, 05 Aug 2024 16:47:21 GMT] [CPU RAM used: 1088 MB] ----------- Iteration 399 (epoch 0.246). radwn 17.96 resolution 16.57A minisize 300 beta 0.00
[Mon, 05 Aug 2024 16:47:22 GMT] [CPU RAM used: 1088 MB] -- Class 0 -- lr: 0.20 eps: 1.15 step ratio : 0.0986 ESS R: 6.125 S: 2.568 Class Size: 21.6% (Average: 25.8%)
[Mon, 05 Aug 2024 16:47:22 GMT] [CPU RAM used: 1088 MB] -- Class 1 -- lr: 0.20 eps: 1.15 step ratio : 0.1216 ESS R: 5.711 S: 2.765 Class Size: 37.3% (Average: 36.4%)
[Mon, 05 Aug 2024 16:47:22 GMT] [CPU RAM used: 1088 MB] -- Class 2 -- lr: 0.20 eps: 1.15 step ratio : 0.1486 ESS R: 6.059 S: 2.341 Class Size: 41.1% (Average: 37.8%)
[Mon, 05 Aug 2024 16:47:22 GMT] [CPU RAM used: 1090 MB] Done iteration 00399 of 01641 in 0.934s. Total time 278.6s. Est time remaining 1148.3s.
[Mon, 05 Aug 2024 16:47:25 GMT] Structure for Class 000 Iteration 400
[Mon, 05 Aug 2024 16:47:26 GMT] Viewing Direction Distribution Class 000 Iteration 400
[Mon, 05 Aug 2024 16:47:26 GMT] Structure for Class 001 Iteration 400
[Mon, 05 Aug 2024 16:47:27 GMT] Viewing Direction Distribution Class 001 Iteration 400
[Mon, 05 Aug 2024 16:47:28 GMT] Structure for Class 002 Iteration 400
[Mon, 05 Aug 2024 16:47:28 GMT] Viewing Direction Distribution Class 002 Iteration 400
[Mon, 05 Aug 2024 16:47:29 GMT] Noise Model Iteration 400
[Mon, 05 Aug 2024 16:47:29 GMT] [CPU RAM used: 1094 MB] ----------- Iteration 400 (epoch 0.248). radwn 18.00 resolution 16.53A minisize 300 beta 0.00
[Mon, 05 Aug 2024 16:47:30 GMT] [CPU RAM used: 1094 MB] -- Class 0 -- lr: 0.20 eps: 1.12 step ratio : 0.1101 ESS R: 5.979 S: 2.546 Class Size: 24.3% (Average: 25.8%)
[Mon, 05 Aug 2024 16:47:30 GMT] [CPU RAM used: 1094 MB] -- Class 1 -- lr: 0.20 eps: 1.12 step ratio : 0.1322 ESS R: 5.562 S: 2.661 Class Size: 36.9% (Average: 36.4%)
[Mon, 05 Aug 2024 16:47:30 GMT] [CPU RAM used: 1094 MB] -- Class 2 -- lr: 0.20 eps: 1.12 step ratio : 0.1349 ESS R: 5.893 S: 2.349 Class Size: 38.8% (Average: 37.8%)
[Mon, 05 Aug 2024 16:50:25 GMT] **** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****
[Mon, 05 Aug 2024 16:50:26 GMT] Job is unresponsive - no heartbeat received in 180 seconds.
========= sending heartbeat at 2024-08-05 17:44:14.862980
========= sending heartbeat at 2024-08-05 17:44:24.879992
========= sending heartbeat at 2024-08-05 17:44:34.896982
========= sending heartbeat at 2024-08-05 17:44:44.915863
========= sending heartbeat at 2024-08-05 17:44:54.933885
========= sending heartbeat at 2024-08-05 17:45:04.952723
========= sending heartbeat at 2024-08-05 17:45:14.970969
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2024-08-05 17:45:24.989988
========= sending heartbeat at 2024-08-05 17:45:35.008995
========= sending heartbeat at 2024-08-05 17:45:45.027043
========= sending heartbeat at 2024-08-05 17:45:55.044600
gpufft: creating new cufft plan (plan id 2 pid 446597)
gpu_id 0
ndims 2
dims 128 128 0
inembed 128 128 0
istride 1
idist 16384
onembed 128 128 0
ostride 1
odist 16384
batch 300
type C2C
wkspc automatic
Python traceback:
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2024-08-05 17:46:05.061957
========= sending heartbeat at 2024-08-05 17:46:15.089088
========= sending heartbeat at 2024-08-05 17:46:25.106993
========= sending heartbeat at 2024-08-05 17:46:35.127248
========= sending heartbeat at 2024-08-05 17:46:45.146968
========= sending heartbeat at 2024-08-05 17:46:55.165966
========= sending heartbeat at 2024-08-05 17:47:05.193993
========= sending heartbeat at 2024-08-05 17:47:15.210962
========= sending heartbeat at 2024-08-05 17:47:25.228014
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
corrupted size vs. prev_size
/net/flash/flash/xxx/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 446597 Aborted (core dumped) python -c "import cryosparc_compute.run as run; run.run()" "$@"
Hi @fent,
The message corrupted size vs. prev_size
near the bottom of the log suggests that there has been some sort of memory corruption. I’m going to send you a DM about this shortly
– Harris