"Connection to cryosparc command lost" with 3D jobs

I’ve had many Ab Initio and Refinements job just fail very close to completion. The only error that appears in the job.log in the last line is “Connection to cryosparc command lost”. Refinements are usually are preceded with the “No Heartbeat” message, but continue running until “Command lost”. Ab Initio only shows “command lost” message. Usually, by this point I’m happy with the results, but if I try to use outputs from failed job in the next one it will always be “Queued”.
I am only ever running a single job at a time.
CS2 is installed on stand alone workstation, CentOS7, v 2.0.27. I’ve just updated to 2.1 and going to run a few jobs to test out.

Thank you

Hello,

Did the job tests go well?
Also, you can try restarting the command core when you notice this problem.

cryosparcm restart

If that doesn’t help, and the issue persists, can you output the log of command core and your job?

cryosparcm log command_core
cryosparcm joblog PX JXX

Thanks,

Stephan

@stephan Hi Stephan,

After 2.10 update, 4 test jobs have run without either the “heartbeat” or “Connection lost” error. Even ones that previously failed, re-run without any issues. I will let you know if the error resurfaces.

Thank you,

Andrey

@stephan
After the 2.20 update, now on a cluster, I am running into heartbeat errors again during refinements. Ab initio has been working fine, but homogeneous refinements with either C1 or D2 symmetry fail. I’ve tried “cryosparcm restart” but that has not helped.
Output of cryosparcm log command_core:

` ---------- Scheduler running ---------------
Lane Cryo-EM_cluster cluster : Jobs Queued (nonpaused, inputs ready): [u’J44’]
Now trying to schedule J44
Need slots : {u’GPU’: 1, u’RAM’: 3, u’CPU’: 4}
Need fixed : {u’SSD’: False}
Need licen : True
Master direct : False
Trying to schedule on Cryo-EM_cluster
Launchable: True
Alloc slots : {u’GPU’: [0], u’RAM’: [0, 1, 2], u’CPU’: [0, 1, 2, 3]}
Alloc fixed : {u’SSD’: False}
Alloc licen : True
– Launchable! – Launching.
---- Running project UID P7 job UID J44
Insecure mode - no SSL in license check
failed to connect link
License Data: {“token”: “#####”, “token_valid”: true, “request_date”: 1537553239, “license_valid”: true}
License Signature: #####
Running job on worker type cluster
sbatch /ul/amalyuti/091818_Bgal_R11_190K/P7/J44/queue_sub_script.sh
[‘sbatch’, ‘/ul/amalyuti/091818_Bgal_R11_190K/P7/J44/queue_sub_script.sh’]
Submitted batch job 645

squeue -j 645
['squeue', '-j', '645']
Changed job P7.J44 status launched
---------- Scheduler done ------------------
Changed job P7.J44 status started
Changed job P7.J44 status running
------------- Heartbeat check ------------------- 
deadline:    2018-09-21 18:24:12.334057 
Overdue jobs :    [{u'_id': ObjectId('5ba5334615cc9c513b085a10'), u'uid': u'J44', u'project_uid': u'P7'}] 
Marking job P7.J44 as failed 
Changed job P7.J44 status failed 
------------- Heartbeat check done ------------- 
compute_use_ssd boolean
Setting parameter J45.compute_use_ssd with value False of type <type 'bool'>
refine_symmetry string
Setting parameter J45.refine_symmetry with value D2 of type <type 'str'>
---------- Scheduler running --------------- 
Lane  Cryo-EM_cluster cluster : Jobs Queued (nonpaused, inputs ready):  [u'J45']
Now trying to schedule J45
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 4}
  Need fixed :  {u'SSD': False}
  Need licen :  True
  Master direct :  False
   Trying to schedule on Cryo-EM_cluster
    Launchable:  True
    Alloc slots :  {u'GPU': [0], u'RAM': [0, 1, 2], u'CPU': [0, 1, 2, 3]}
    Alloc fixed :  {u'SSD': False}
    Alloc licen :  True
     -- Launchable! -- Launching.
---- Running project UID P7 job UID J45 
Insecure mode - no SSL in license check
failed to connect link
License Data:  {"token": "#####, "token_valid": true, "request_date": 1537554626, "license_valid": true}
License Signature:  #####
     Running job on worker type cluster
sbatch /ul/amalyuti/091818_Bgal_R11_190K/P7/J45/queue_sub_script.sh
['sbatch', '/ul/amalyuti/091818_Bgal_R11_190K/P7/J45/queue_sub_script.sh']
Submitted batch job 646
squeue -j 646
['squeue', '-j', '646']
Changed job P7.J45 status launched
---------- Scheduler done ------------------
Changed job P7.J45 status started
Changed job P7.J45 status running
------------- Heartbeat check ------------------- 
deadline:    2018-09-21 18:47:30.372238 
Overdue jobs :    [{u'_id': ObjectId('5ba538b315cc9c513b0873e0'), u'uid': u'J45', u'project_uid': u'P7'}] 
Marking job P7.J45 as failed 
Changed job P7.J45 status failed `

As for cryosparcm joblog P7 J45:

================= CRYOSPARCW ======= 2018-09-21 11:31:17.833397 =========
Project P7 Job J45
Master cemaster.cluster.caltech.edu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 28413
========= monitor process now waiting for main process
MAIN PID 28413
refine.run cryosparc2_compute.jobs.jobregister
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
***************************************************************
Running job J45 of type homo_refine
Running job on hostname %s Cryo-EM_cluster
Allocated Resources : {u’lane’: u’Cryo-EM_cluster’, u’target’: {u’lane’: u’Cryo-EM_cluster’, u’qdel_cmd_tpl’: u’scancel {{ cluster_job_id }}’, u’name’: u’Cryo-EM_cluster’, u’title’: u’Cryo-EM_cluster’, u’hostname’: u’Cryo-EM_cluster’, u’qstat_cmd_tpl’: u’squeue -j {{ cluster_job_id }}’, u’worker_bin_path’: u’/net/cemaster/data/software/cryoSPARC/V2/cryosparc2_worker/bin/cryosparcw’, u’qinfo_cmd_tpl’: u’sinfo’, u’qsub_cmd_tpl’: u’sbatch {{ script_path_abs }}’, u’cache_path’: u’’, u’cache_quota_mb’: None, u’script_tpl’: u’#!/usr/bin/env bash\n#SBATCH --partition=gpu\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --time=48:00:00\n#SBATCH --mem={{ (ram_gb)|int }}GB\n#SBATCH --exclusive\n#SBATCH --job-name cspark_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z “$available_devs” ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n’, u’cache_reserve_mb’: 10000, u’type’: u’cluster’, u’send_cmd_tpl’: u’{{ command }}’, u’desc’: None}, u’license’: True, u’hostname’: u’Cryo-EM_cluster’, u’slots’: {u’GPU’: [0], u’RAM’: [0, 1, 2], u’CPU’: [0, 1, 2, 3]}, u’fixed’: {u’SSD’: False}, u’lane_type’: u’cluster’}
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
cryosparc2_compute/plotutil.py:237: RuntimeWarning: divide by zero encountered in log
logabs = n.log(n.abs(fM))
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
FSC No-Mask… ========= sending heartbeat
0.143 at 23.240 radwn. 0.5 at 12.843 radwn. Took 11.058s.
FSC Spherical Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 25.750 radwn. 0.5 at 17.589 radwn. Took 14.422s.
FSC Loose Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 27.606 radwn. 0.5 at 20.322 radwn. Took 52.393s.
FSC Tight Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 32.488 radwn. 0.5 at 25.170 radwn. Took 39.099s.
FSC Noise Sub… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
cryosparc2_compute/sigproc.py:763: RuntimeWarning: invalid value encountered in divide
fsc_true = (fsc_t - fsc_n) / (1.0 - fsc_n)
0.143 at 32.406 radwn. 0.5 at 24.970 radwn. Took 89.591s.
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
FSC No-Mask… ========= sending heartbeat
0.143 at 32.053 radwn. 0.5 at 24.549 radwn. Took 11.347s.
FSC Spherical Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 35.585 radwn. 0.5 at 27.227 radwn. Took 15.227s.
FSC Loose Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat

Hopefully this will be helpful. Please let me know if there is anything else I can post.