No heartbeat error, but refinement continues

Hi,
Just wanted to share an odd error behavior. Although it says the job was killed - the refinement continues seemingly without problems (see picture). I repeatedly get this error after the first iteration when refining particles with box size 500.

ERROR: No Heartbeat
This job was killed because a heartbeat was not received for 30 seconds.

The last 100 lines of the job standard output are printed below.
====== ENVIRONMENT ========
{‘BASH_FUNC_module()’: ‘() { eval /usr/bin/modulecmd bash $*\n}’, ‘SHELL’: ‘/bin/bash’, ‘CRYOSPARC_BULK_DIR’: ‘/home/mestroupe/bin/cryosparc_beta/run/bulk’, ‘SUPERVISOR_ENABLED’: ‘1’, ‘HISTSIZE’: ‘1000’, ‘TEST_METADATA’: ‘{}’, ‘XMODIFIERS’: ‘@im=ibus’, ‘XDG_RUNTIME_DIR’: ‘/run/user/1000’, ‘PYTHONPATH’: ‘’, ‘CRYOSPARC_REGISTER_DONE’: ‘true’, ‘CRYOSPARC_METEOR_BINDIR’: ‘’, ‘XDG_SESSION_ID’: ‘2’, ‘DBUS_SESSION_BUS_ADDRESS’: ‘unix:abstract=/tmp/dbus-pSfac9qokm,guid=029eba3e1dbdc0112cd866a558c02635’, ‘DESKTOP_SESSION’: ‘gnome-classic’, ‘CRYOSPARC_RESULTS_MIGRATION_DONE’: ‘true’, ‘HOSTNAME’: ‘ultra’, ‘CRYOSPARC_NODEJS_BINDIR’: ‘/home/mestroupe/bin/cryosparc_beta/nodejs/bin’, ‘MAIL’: ‘/var/spool/mail/mestroupe’, ‘MONGO_URL’: ‘mongodb://localhost:38001/meteor’, ‘MONGO_OPLOG_URL’: ‘mongodb://localhost:38001/local’, ‘CRYOSPARC_MASTER_HOSTNAME’: ‘ultra.biophysics.fsu.edu’, ‘LESSOPEN’: ‘||/usr/bin/lesspipe.sh %s’, ‘USER’: ‘mestroupe’, ‘HOME’: ‘/home/mestroupe’, ‘XDG_VTNR’: ‘1’, ‘PORT’: ‘38000’, ‘SUPERVISOR_SERVER_URL’: ‘unix:///tmp/supervisor-8243795635577540623.sock’, ‘XAUTHORITY’: ‘/run/gdm/auth-for-mestroupe-jfTeIx/database’, ‘SESSION_MANAGER’: ‘local/unix:@/tmp/.ICE-unix/11698,unix/unix:/tmp/.ICE-unix/11698’, ‘SHLVL’: ‘4’, ‘DISPLAY’: ‘:0’, ‘NODE_ENV’: ‘production’, ‘CRYOSPARC_DEVELOP’: ‘false’, ‘WINDOWID’: ‘38024291’, ‘GPG_AGENT_INFO’: ‘/run/user/1000/keyring/gpg:0:1’, ‘MODULESHOME’: ‘/usr/share/Modules’, ‘XDG_SESSION_DESKTOP’: ‘gnome-classic’, ‘ROOT_URL’: ‘http://localhost:38000’, ‘CRYOSPARC_RAM_SLOTS’: ‘16’, ‘TOMOCTF’: ‘/usr/local/bin/tomoctf’, ‘GDMSESSION’: ‘gnome-classic’, ‘SUPERVISOR_PROCESS_NAME’: ‘webapp’, ‘CRYOSPARC_HTTP_PORT’: ‘38000’, ‘XDG_MENU_PREFIX’: ‘gnome-’, ‘RELION’: ‘/usr/local/relion-2.0/build/’, ‘ULTRASCAN’: ‘/usr/lib/ultrascan3’, ‘_’: ‘/home/mestroupe/bin/cryosparc_beta/anaconda2/bin/python’, ‘MODULEPATH’: ‘/usr/share/Modules/modulefiles:/etc/modulefiles’, ‘LS_COLORS’: ‘rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:.tar=38;5;9:.tgz=38;5;9:.arc=38;5;9:.arj=38;5;9:.taz=38;5;9:.lha=38;5;9:.lz4=38;5;9:.lzh=38;5;9:.lzma=38;5;9:.tlz=38;5;9:.txz=38;5;9:.tzo=38;5;9:.t7z=38;5;9:.zip=38;5;9:.z=38;5;9:.Z=38;5;9:.dz=38;5;9:.gz=38;5;9:.lrz=38;5;9:.lz=38;5;9:.lzo=38;5;9:.xz=38;5;9:.bz2=38;5;9:.bz=38;5;9:.tbz=38;5;9:.tbz2=38;5;9:.tz=38;5;9:.deb=38;5;9:.rpm=38;5;9:.jar=38;5;9:.war=38;5;9:.ear=38;5;9:.sar=38;5;9:.rar=38;5;9:.alz=38;5;9:.ace=38;5;9:.zoo=38;5;9:.cpio=38;5;9:.7z=38;5;9:.rz=38;5;9:.cab=38;5;9:.jpg=38;5;13:.jpeg=38;5;13:.gif=38;5;13:.bmp=38;5;13:.pbm=38;5;13:.pgm=38;5;13:.ppm=38;5;13:.tga=38;5;13:.xbm=38;5;13:.xpm=38;5;13:.tif=38;5;13:.tiff=38;5;13:.png=38;5;13:.svg=38;5;13:.svgz=38;5;13:.mng=38;5;13:.pcx=38;5;13:.mov=38;5;13:.mpg=38;5;13:.mpeg=38;5;13:.m2v=38;5;13:.mkv=38;5;13:.webm=38;5;13:.ogm=38;5;13:.mp4=38;5;13:.m4v=38;5;13:.mp4v=38;5;13:.vob=38;5;13:.qt=38;5;13:.nuv=38;5;13:.wmv=38;5;13:.asf=38;5;13:.rm=38;5;13:.rmvb=38;5;13:.flc=38;5;13:.avi=38;5;13:.fli=38;5;13:.flv=38;5;13:.gl=38;5;13:.dl=38;5;13:.xcf=38;5;13:.xwd=38;5;13:.yuv=38;5;13:.cgm=38;5;13:.emf=38;5;13:.axv=38;5;13:.anx=38;5;13:.ogv=38;5;13:.ogx=38;5;13:.aac=38;5;45:.au=38;5;45:.flac=38;5;45:.mid=38;5;45:.midi=38;5;45:.mka=38;5;45:.mp3=38;5;45:.mpc=38;5;45:.ogg=38;5;45:.ra=38;5;45:.wav=38;5;45:.axa=38;5;45:.oga=38;5;45:.spx=38;5;45:*.xspf=38;5;45:’, ‘OMP_NUM_THREADS’: ‘1’, ‘CRYOSPARC_LICENSE_ID’: ‘7fe9aca1-fd04-5411-a429-8ae934780c29’, ‘CRYOSPARC_ROOT_DIR’: ‘/home/mestroupe/bin/cryosparc_beta’, ‘CRYOSPARC_UPLOAD_DIR’: ‘/home/mestroupe/bin/cryosparc_beta/run/bulk/uploads’, ‘CRYOSPARC_CACHE_CUSHION’: ‘10240.0’, ‘QTDIR’: ‘/usr/lib64/qt-3.3’, ‘LD_LIBRARY_PATH’: ‘/usr/local/cuda/lib64:/usr/local/cuda-7.5:/usr/local/cuda-7.5/lib64:/usr/lib64/:/usr/lib/ultrascan3/lib:/home/mestroupe/libfatal: Not a git repository (or any parent up to mount point /home)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /home)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /home)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
usage: git diff [–no-index]
/lib:/usr/lib64/openmpi:/usr/lib64/openmpi/bin::/usr/local/cuda-7.5:/usr/local/cuda-7.5/lib64:/usr/lib64/:/usr/lib/ultrascan3/lib:/home/mestroupe/lib/lib:/usr/lib64/openmpi:/usr/lib64/openmpi/bin’, ‘LANG’: ‘en_US.UTF-8’, ‘QTLIB’: ‘/usr/lib64/qt-3.3/lib’, ‘QTINC’: ‘/usr/lib64/qt-3.3/include’, ‘GNOME_DESKTOP_SESSION_ID’: ‘this-is-deprecated’, ‘MKL_NUM_THREADS’: ‘1’, ‘IMSETTINGS_MODULE’: ‘none’, ‘CRYOSPARC_CODE_DIR’: ‘/home/mestroupe/bin/cryosparc_beta/cryosparc-compute’, ‘VTE_VERSION’: ‘3804’, ‘CRYOSPARC_ANACONDA_BINDIR’: ‘/home/mestroupe/bin/cryosparc_beta/anaconda2/bin’, ‘QT_GRAPHICSSYSTEM_CHECKED’: ‘1’, ‘XDG_CURRENT_DESKTOP’: ‘GNOME-Classic:GNOME’, ‘CRYOSPARC_MONGO_PORT’: ‘38001’, ‘CRYOSPARC_CUDA_DEVS’: ‘0,1,2’, ‘SUPERVISOR_GROUP_NAME’: ‘webapp’, ‘USERNAME’: ‘mestroupe’, ‘GDM_LANG’: ‘en_US.UTF-8’, ‘NUMEXPR_NUM_THREADS’: ‘1’, ‘QT_IM_MODULE’: ‘ibus’, ‘LOGNAME’: ‘mestroupe’, ‘XDG_SEAT’: ‘seat0’, ‘PATH’: ‘/home/mestroupe/bin/cryosparc_beta/mongodb/bin:/home/mestroupe/bin/cryosparc_beta/nodejs/bin:/home/mestroupe/bin/cryosparc_beta/anaconda2/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib64/:/usr/lib:/usr/lib64/openmpi:/usr/local:/usr/lib64/openmpi/bin:/usr/lib64/openmpi/lib:/home/mestroupe/bin/frealign_v9.11/bin:/usr/local/relion-2.0/build//bin:/usr/local/cuda/bin:/home/mestroupe/bin/cryosparc_beta/bin:/usr/lib/ultrascan3/bin:/usr/local/bin/tomoctf/bin’, ‘SSH_AGENT_PID’: ‘11872’, ‘TERM’: ‘xterm-256color’, ‘WINDOWPATH’: ‘1’, ‘IMSETTINGS_INTEGRATE_DESKTOP’: ‘yes’, ‘METEOR_SETTINGS’: ‘{“public”:{“webinstance”:false, “instancename”:“ultra.biophysics.fsu.edu”, “instancetype”:“academicbeta”}}’, ‘CRYOSPARC_JOB_LOG_DIR’: ‘/home/mestroupe/bin/cryosparc_beta/run/sparcjobs’, ‘CRYOSPARC_SUPERVISOR_SOCK_FILE’: ‘/tmp/supervisor-8243795635577540623.sock’, ‘CRYOSPARC_INSTALL_TYPE’: ‘master’, ‘CRYOSPARC_MONGODB_BINDIR’: ‘/home/mestroupe/bin/cryosparc_beta/mongodb/bin’, ‘SSH_AUTH_SOCK’: ‘/run/user/1000/keyring/ssh’, ‘HTTP_FORWARDED_COUNT’: ‘1’, ‘GNOME_SHELL_SESSION_MODE’: ‘classic’, ‘LOADEDMODULES’: ‘’, ‘APP_ID’: ‘1ipv1tp12xzfzhba088c’, ‘HISTCONTROL’: ‘ignoredups’, ‘PWD’: ‘/home/mestroupe/bin/cryosparc_beta/cryosparc-webapp/bundle/programs/server’}

2839a807bd80fca11d311cdb0e7917735f774ece1b2692d7db8ee4541ed8424f
License Data: {“request_date”:“Thursday March 16 2017”,“issued_date”:“Monday February 6 2017”,“expiry_date”:“Saturday April 1 2017”,“issued_to_inst”:“Institute of Molecular Biophysics / Florida State University”,“issued_to_name”:“mn12@my.fsu.edu”,“license_type”:“academic_beta”,“version”:“all”,“valid”:true}
License Signature: (12297364243309157028851666555203583159204662421052654659069307946824529572076571028887518495231179356779579857308957644509317775299841141142776366053362192842713771322526515234337493315388882389671471691096390982535761714038136476548345684307944281010429286355927345914911402289868302418100278053670769108299808926396885834237223719472037563156982787714854523818753816808332353908136938080445110798349310072500635441616760673155129153977421779920419359727402089195086456303524467140885746798814414470693336794845734157452292375881122423384203080680176171387485791629890411277592349874270685959280965431657542351571215L,)
<matplotlib.figure.Figure at 0x7f337cd83290>
<matplotlib.figure.Figure at 0x7f34bc8edb90>
<matplotlib.figure.Figure at 0x7f333db68610>
FSC No-Mask… 0.143 at 127.450 radwn. Took 10.234s.
FSC Spherical Mask… 0.143 at 136.386 radwn. Took 13.966s.
FSC Loose Mask… 0.143 at 147.573 radwn. Took 42.250s.
FSC Tight Mask… 0.143 at 158.378 radwn. 0.5 at 134.689 radwn. Took 41.064s.
FSC Noise Sub… 0.143 at 150.793 radwn. 0.5 at 133.844 radwn. Took 88.767


Mike

Very strange.
This means that some part of the job is drowning the CPU for more than 30 seconds (the heartbeat is sent from the processing job to the webapp every 30s so that the webapp knows the job is actually still alive). It looks like it might be that some 3D FFT is taking longer than that, which is plausible with 500^3 boxsizes. We will look into this.
For now, though it’s annoying, it won’t actually impact the results of the job at all.

Hope that helps,
Ali

Hi Ali,

I can confirm that there is no impact on the results. Also this happens only once per job, usually in iteration 1 or 2 and then continues without any issues. Right after the error occurs it marks the job as ‘Failed’, but if left untouched the job will eventually finish and job status will change to ‘complete’.

Mike

I had a same problem “This job was killed because a heartbeat was not received for 30 seconds.”.

For me, it happened when I tried to visualize my 142k particles dataset (not during refinement).
With that heartbeat message, it didn’t visualize at all. I tried many times, it kept failing.

After my other job (relion using 25 cores) was finished, I tried again. This time, cryosparc visualized well without heartbeat error. Therefore, I assume that my other heavy duty job resulted in that error.

I just had the same message during ab initio as well.

ERROR: No Heartbeat
This job was killed because a heartbeat was not received for 30 seconds.

The last 100 lines of the job standard output are printed below.

This message occur when I ran 25 other cores for other relion job at the same time.

Hi @dnamkr,

The heartbeat error happens when a running compute job hasn’t responded for 30 seconds. This usually indicates that the job has crashed in a serious way (so it couldn’t even report an error traceback). In your case if all your hardware resources are consumed by a different process (Relion for instance) then the cryoSPARC job definitely won’t be able to do anything for 30 seconds, in which case it looks to the cryoSPARC scheduler as if the job has completely failed.
So it’s not really that anything wrong is happening - this is the expected behaviour if the system is fully loaded by another process.

Ali

1 Like