Job is unresponsive - no heartbeat received in 30 seconds in CS v4.0.3

jcheongs · November 8, 2022, 2:21pm

Running jobs have recently been failing on Job is unresponsive - no heartbeat received in 30 seconds after upgrading to the latest v4.0.3 of cryoSPARC. We followed the instructions posted here: Job is unresponsive - no heartbeat received in 30 second - #3 by marino-j by setting the export CRYOSPARC_HEARTBEAT_SECONDS=180 to cryosparc_master/config.sh` but still to no avail even after restarting cryoSPARC.

For example in Heterogeneous Refinement jobs, it sometimes gets further into the run than others, but it doesn’t finish. We have also tried different box sizes to see if RAM was the issue but that also did not work. However, if I bin particles down from 512 → 256, the same jobs that were failing on no heartbeat will work. This is strange because we’ve previously processed this data at full size (512) without issue on CS3.

Any help to fix this constant issue we’re facing will be much appreciated – thank you.

wtempel · November 8, 2022, 4:13pm

Please can you post job.log for one of those failed jobs.

If you have not already, please can you clone a successful v3 job, run the cloned job in v4 and send us

the job.json of the v3 job
the job report for the failed v4 job

jcheongs · November 8, 2022, 7:05pm

This is the job.log for one of the failed jobs on v4:

Connection Refused: Server at http://helix.hpc.private:8002/api is not accepting new connections. Is the server running?
************* Connection to cryosparc command lost.
~
~
~
~
…skipping…
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
Connection Refused: Server at http://helix.hpc.private:8002/api is not accepting new connections. Is the server running?
************* Connection to cryosparc command lost.

This is the job.json of the v3 job:

HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:

/admin/opt/common/cryosparc3.1/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/process.py:99: RuntimeWarning: invalid value encountered in true_divide
self._target(*self._args, **self._kwargs)
/admin/opt/common/cryosparc3.1/cryosparc_worker/cryosparc_compute/plotutil.py:901: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (matplotlib.pyplot.figure) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam figure.max_open_warning).
fig = plt.figure(figsize=figsize)
/admin/opt/common/cryosparc3.1/cryosparc_worker/cryosparc_compute/util/logsumexp.py:40: RuntimeWarning: divide by zero encountered in log
return n.log(wa * n.exp(a - vmax) + wb * n.exp(b - vmax) ) + vmax
/admin/opt/common/cryosparc3.1/cryosparc_worker/cryosparc_compute/util/logsumexp.py:40: RuntimeWarning: divide by zero encountered in log
return n.log(wa * n.exp(a - vmax) + wb * n.exp(b - vmax) ) + vmax
/admin/opt/common/cryosparc3.1/cryosparc_worker/cryosparc_compute/util/logsumexp.py:40: RuntimeWarning: divide by zero encountered in log
return n.log(wa * n.exp(a - vmax) + wb * n.exp(b - vmax) ) + vmax
/admin/opt/common/cryosparc3.1/cryosparc_worker/cryosparc_compute/util/logsumexp.py:40: RuntimeWarning: divide by zero encountered in log
return n.log(wa * n.exp(a - vmax) + wb * n.exp(b - vmax) ) + vmax
/admin/opt/common/cryosparc3.1/cryosparc_worker/cryosparc_compute/util/logsumexp.py:40: RuntimeWarning: divide by zero encountered in log
return n.log(wa * n.exp(a - vmax) + wb * n.exp(b - vmax) ) + vmax
========= main process now complete.
========= monitor process now complete.

Lastly, how do I send you the job report that I downloaded for the v4 job?

wtempel · November 9, 2022, 7:37pm

A follow-up question to the v4 job report you sent:
Please can you search inside the corresponding job’s directory for any *.err and *.out files and post their content. If you don’t find those files, please can you post a file listing of the job directory.

jcheongs · November 10, 2022, 5:01pm

There is nothing in the .err file. Below is the content for the .out file:

Sender: LSF System lsfadmin@node
Subject: Job 11683429: <cryosparc_P1_J197> in cluster Done

Job <cryosparc_P1_J197> was submitted from host by user in cluster at Tue Nov 8 13:37:27 2022
Job was executed on host(s) <4*node>, in queue , as user in cluster at Tue Nov 8 13:37:28 2022
</opt/common/cryosparc3.1> was used as the home directory.
</opt/common/cryosparc/CS-4.0.3/cryosparc_master> was used as the working directory.
Started at Tue Nov 8 13:37:28 2022
Terminated at Tue Nov 8 15:08:45 2022
Results reported at Tue Nov 8 15:08:45 2022

Your job looked like:

LSBATCH: User input

#!/bin/bash
#BSUB -J cryosparc_P1_J197
#BSUB -m lj-gpu
#BSUB -e /data/%J.err
#BSUB -o /data/%J.out
#BSUB -n 4
#BSUB -R “span[ptile=4]”
#BSUB -R “rusage[mem=16.0]”
#BSUB -W 167:00
#BSUB -q gpuqueue
#BSUB -gpu “num=1:gmem=20G:j_exclusive=no:mode=shared”
#BSUB -R “A100” -sla llSC

#BSUB -W 167:00
##Load modules

/opt/common/cryosparc/CS-4.0.3/cryosparc_worker/bin/cryosparcw run --project P1 --job J197 --master_hostname --master_command_core_port 8002 > /path/job.log 2>&1

Successfully completed.

Resource usage summary:
CPU time :                                   6374.10 sec.
Max Memory :                                 31 GB
Average Memory :                             22.86 GB
Total Requested Memory :                     64.00 GB
Delta Memory :                               33.00 GB
Max Swap :                                   -
Max Processes :                              7
Max Threads :                                20
Run time :                                   5477 sec.
Turnaround time :                            5478 sec.
The output (if any) follows:

PS:

Read file for stderr output of this job.

jcheongs · November 11, 2022, 2:38pm

Checking in to see if there are any updates for fixing this issue?

wtempel · November 11, 2022, 7:39pm

How long did jobs with the binned 256 box size particles take to complete?

jcheongs · November 14, 2022, 4:40pm

It took approximately 08h 41m.

kplkmrgoutam · November 30, 2022, 9:42am

Hi,
Did you find any solution to this?
best,
Kapil

jcheongs · December 5, 2022, 4:17pm

The issue is still persisting on my end unfortunately.

wtempel · December 12, 2022, 5:48pm

CryoSPARC 4.1 has just been released. The release includes a fix that prevents certain heartbeat failures.

YYang · December 26, 2022, 5:43pm

Hi,

I have similar problems during 2D classification jobs. It happens very often since I upgrading to v4. Updating to the latest version (4.1.1) did not seem to solve the problem. Here is the job.log file for one of the failed 2D class jobs.



================= CRYOSPARCW =======  2022-12-26 04:26:31.452796  =========
Project P4 Job J9
Master cryows1 Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 174865
MAIN PID 174865
class2D.run cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
========= sending heartbeat
========= sending heartbeat
***************************************************************
Running job  J9  of type  class_2D
Running job on hostname %s cryows1
Allocated Resources :  {'fixed': {'SSD': True}, 'hostname': 'cryows1', 'lane': 'default', 'lane_type': 'node', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/home/cryosparc_user/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25431441408, 'name': 'NVIDIA RTX A5500'}, {'id': 1, 'mem': 25434587136, 'name': 'NVIDIA RTX A5500'}, {'id': 2, 'mem': 25434587136, 'name': 'NVIDIA RTX A5500'}, {'id': 3, 'mem': 25434587136, 'name': 'NVIDIA RTX A5500'}], 'hostname': 'cryows1', 'lane': 'default', 'monitor_port': None, 'name': 'cryows1', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@cryows1', 'title': 'Worker node cryows1', 'type': 'node', 'worker_bin_path': '/home/cryosparc_user/Applications/cryosparc/cryosparc_worker/bin/cryosparcw'}}
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat

It seems this Job unresponsive error is more likely to happen when: i) a higher number (400 or more) of 2D classes is used; ii) a higher number of online-EM iterations (40 or 60) is used; iii) larger batchsize (e.g. 400) is used; iv) more than one GPU is used for a 2D classification job; or v) more than one 2D classification jobs are running at the same time.

Any suggestions or fixes are much appreciated.

Thanks.

wtempel · December 28, 2022, 1:58pm

@YYang Please can you post the output of
uname -a
on the GPU node and email us the job report for J9.

YYang · December 28, 2022, 6:40pm

Hi @wtempel , here is the output of uname -a

Linux cryows1 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

I have also just emailed you the job report for J9.

Thanks.

wtempel · December 28, 2022, 7:33pm

Thanks for sending those. Please can you post the output of
/usr/bin/env.

YYang · December 28, 2022, 7:35pm

Here is the output of /usr/bin/env

SHELL=/bin/bash
SESSION_MANAGER=local/cryows1:@/tmp/.ICE-unix/3026,unix/cryows1:/tmp/.ICE-unix/3026
QT_ACCESSIBILITY=1
COLORTERM=truecolor
XDG_CONFIG_DIRS=/etc/xdg/xdg-ubuntu:/etc/xdg
SSH_AGENT_LAUNCHER=gnome-keyring
XDG_MENU_PREFIX=gnome-
GNOME_DESKTOP_SESSION_ID=this-is-deprecated
CONDA_EXE=/home/changliu/Applications/anaconda3/bin/conda
CHTML=/home/changliu/Applications/ccp4-8.0/html
_CE_M=
CCP4_OPEN=UNKNOWN
CLIBD=/home/changliu/Applications/ccp4-8.0/lib/data
GNOME_SHELL_SESSION_MODE=ubuntu
SSH_AUTH_SOCK=/run/user/1000/keyring/ssh
CCP4I_TOP=/home/changliu/Applications/ccp4-8.0/share/ccp4i
CCP4_MASTER=/home/changliu/Applications
PHENIX=/usr/local/phenix-1.20.1-4487
XMODIFIERS=@im=ibus
DESKTOP_SESSION=ubuntu
PHENIX_VERSION=1.20.1-4487
CCP4_HELPDIR=/home/changliu/Applications/ccp4-8.0/help/
BALBES_ROOT=/home/changliu/Applications/ccp4-8.0/BALBES
GTK_MODULES=gail:atk-bridge
PWD=/home/changliu/Desktop
LOGNAME=changliu
XDG_SESSION_DESKTOP=ubuntu
XDG_SESSION_TYPE=x11
CEXAM=/home/changliu/Applications/ccp4-8.0/examples
CCP4_SCR=/tmp/changliu
GPG_AGENT_INFO=/run/user/1000/gnupg/S.gpg-agent:0:1
SYSTEMD_EXEC_PID=3044
XAUTHORITY=/run/user/1000/gdm/Xauthority
IMOD_CALIB_DIR=/usr/local/ImodCalib
WINDOWPATH=2
HOME=/home/changliu
USERNAME=changliu
LANG=en_US.UTF-8
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
XDG_CURRENT_DESKTOP=ubuntu:GNOME
VTE_VERSION=6800
IMOD_DIR=/usr/local/IMOD
MMCIFDIC=/home/changliu/Applications/ccp4-8.0/lib/ccp4/cif_mmdic.lib
CLIB=/home/changliu/Applications/ccp4-8.0/lib
GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/ce918f7d_b7e6_48ef_89bc_c76a6f5c5116
IMOD_JAVADIR=/usr/local/java
IMOD_PLUGIN_DIR=/usr/local/IMOD/lib/imodplug
GFORTRAN_UNBUFFERED_PRECONNECTED=Y
LIBTBX_TMPVAL=
FOR_DISABLE_STACK_TRACE=1
CCP4=/home/changliu/Applications/ccp4-8.0
LESSCLOSE=/usr/bin/lesspipe %s %s
XDG_SESSION_CLASS=user
TERM=xterm-256color
_CE_CONDA=
IMOD_QTLIBDIR=/usr/local/IMOD/qtlib
LESSOPEN=| /usr/bin/lesspipe %s
USER=changliu
GNOME_TERMINAL_SERVICE=:1.105
CONDA_SHLVL=0
DISPLAY=:1
SHLVL=1
LIBTBX_OPATH=
CRANK=/home/changliu/Applications/ccp4-8.0/share/ccp4i/crank
QT_IM_MODULE=ibus
CONDA_PYTHON_EXE=/home/changliu/Applications/anaconda3/bin/python
CLIBD_MON=/home/changliu/Applications/ccp4-8.0/lib/data/monomers/
XDG_RUNTIME_DIR=/run/user/1000
SSL_CERT_FILE=/home/changliu/Applications/ccp4-8.0/etc/ssl/cacert.pem
CBIN=/home/changliu/Applications/ccp4-8.0/bin
LIBTBX_BUILD=
CETC=/home/changliu/Applications/ccp4-8.0/etc
XDG_DATA_DIRS=/usr/share/ubuntu:/usr/share/gnome:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop
PATH=/home/changliu/Applications/cryosparc/cryosparc_master/bin:/home/changliu/Applications/anaconda3/condabin:/home/changliu/Applications/cryoEM:/home/changliu/Applications/cistem:/usr/local/relion/bin:/usr/local/cuda/bin:/usr/local/phenix-1.20.1-4487/build/bin:/home/changliu/Applications/ccp4-8.0/etc:/home/changliu/Applications/ccp4-8.0/bin:/usr/local/IMOD/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/IMOD/pythonLink
GDMSESSION=ubuntu
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus
CCP4I_TCLTK=/home/changliu/Applications/ccp4-8.0/bin
CINCL=/home/changliu/Applications/ccp4-8.0/include
OLDPWD=/usr/local/phenix-1.20.1-4487/build
_=/usr/bin/env

Thanks.

wtempel · January 6, 2023, 8:13pm

4 posts were split to a new topic: Bus error (core dumped) during non-uniform refinement

YYang · February 21, 2023, 7:22am

Hi @wtempel Just wondering if there is any fix for this issue. We still encounter this issue very often in v4.1.2 during 2D classifications, especially when high number (e.g. 400) of 2D classes is used.

wtempel · March 20, 2023, 10:09pm

@YYang We were so far unable to determine the cause of this problem. One hypothesis is that your system under load begins to swap just before the error is thrown.
Would you like to monitor available RAM on the system while running a 2D classification job that will eventually trigger the problem?
Before starting the job, you could start this command in a shell:

watch -n 30 '(date && free -m) | tee -a mem.log'

While the command runs, memory stats will be written to mem.log. Be sure to run the command where in a directory mem.log will not cause the filesystem to run out of space.
When the CryoSPARC job fails,
please post relevant records from mem.log and remember to terminate the command when it is no longer needed.

LYS · April 10, 2023, 2:43pm

It may be due to insufficient bandwidth of the hard disk, you can try to replace the interface.
When my hard drive uses an external usb2.0 interface, I will encounter the same problem.