@stephan
After the 2.20 update, now on a cluster, I am running into heartbeat errors again during refinements. Ab initio has been working fine, but homogeneous refinements with either C1 or D2 symmetry fail. I’ve tried “cryosparcm restart” but that has not helped.
Output of cryosparcm log command_core
:
` ---------- Scheduler running ---------------
Lane Cryo-EM_cluster cluster : Jobs Queued (nonpaused, inputs ready): [u’J44’]
Now trying to schedule J44
Need slots : {u’GPU’: 1, u’RAM’: 3, u’CPU’: 4}
Need fixed : {u’SSD’: False}
Need licen : True
Master direct : False
Trying to schedule on Cryo-EM_cluster
Launchable: True
Alloc slots : {u’GPU’: [0], u’RAM’: [0, 1, 2], u’CPU’: [0, 1, 2, 3]}
Alloc fixed : {u’SSD’: False}
Alloc licen : True
– Launchable! – Launching.
---- Running project UID P7 job UID J44
Insecure mode - no SSL in license check
failed to connect link
License Data: {“token”: “#####”, “token_valid”: true, “request_date”: 1537553239, “license_valid”: true}
License Signature: #####
Running job on worker type cluster
sbatch /ul/amalyuti/091818_Bgal_R11_190K/P7/J44/queue_sub_script.sh
[‘sbatch’, ‘/ul/amalyuti/091818_Bgal_R11_190K/P7/J44/queue_sub_script.sh’]
Submitted batch job 645
squeue -j 645
['squeue', '-j', '645']
Changed job P7.J44 status launched
---------- Scheduler done ------------------
Changed job P7.J44 status started
Changed job P7.J44 status running
------------- Heartbeat check -------------------
deadline: 2018-09-21 18:24:12.334057
Overdue jobs : [{u'_id': ObjectId('5ba5334615cc9c513b085a10'), u'uid': u'J44', u'project_uid': u'P7'}]
Marking job P7.J44 as failed
Changed job P7.J44 status failed
------------- Heartbeat check done -------------
compute_use_ssd boolean
Setting parameter J45.compute_use_ssd with value False of type <type 'bool'>
refine_symmetry string
Setting parameter J45.refine_symmetry with value D2 of type <type 'str'>
---------- Scheduler running ---------------
Lane Cryo-EM_cluster cluster : Jobs Queued (nonpaused, inputs ready): [u'J45']
Now trying to schedule J45
Need slots : {u'GPU': 1, u'RAM': 3, u'CPU': 4}
Need fixed : {u'SSD': False}
Need licen : True
Master direct : False
Trying to schedule on Cryo-EM_cluster
Launchable: True
Alloc slots : {u'GPU': [0], u'RAM': [0, 1, 2], u'CPU': [0, 1, 2, 3]}
Alloc fixed : {u'SSD': False}
Alloc licen : True
-- Launchable! -- Launching.
---- Running project UID P7 job UID J45
Insecure mode - no SSL in license check
failed to connect link
License Data: {"token": "#####, "token_valid": true, "request_date": 1537554626, "license_valid": true}
License Signature: #####
Running job on worker type cluster
sbatch /ul/amalyuti/091818_Bgal_R11_190K/P7/J45/queue_sub_script.sh
['sbatch', '/ul/amalyuti/091818_Bgal_R11_190K/P7/J45/queue_sub_script.sh']
Submitted batch job 646
squeue -j 646
['squeue', '-j', '646']
Changed job P7.J45 status launched
---------- Scheduler done ------------------
Changed job P7.J45 status started
Changed job P7.J45 status running
------------- Heartbeat check -------------------
deadline: 2018-09-21 18:47:30.372238
Overdue jobs : [{u'_id': ObjectId('5ba538b315cc9c513b0873e0'), u'uid': u'J45', u'project_uid': u'P7'}]
Marking job P7.J45 as failed
Changed job P7.J45 status failed `
As for cryosparcm joblog P7 J45
:
================= CRYOSPARCW ======= 2018-09-21 11:31:17.833397 =========
Project P7 Job J45
Master cemaster.cluster.caltech.edu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 28413
========= monitor process now waiting for main process
MAIN PID 28413
refine.run cryosparc2_compute.jobs.jobregister
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
***************************************************************
Running job J45 of type homo_refine
Running job on hostname %s Cryo-EM_cluster
Allocated Resources : {u’lane’: u’Cryo-EM_cluster’, u’target’: {u’lane’: u’Cryo-EM_cluster’, u’qdel_cmd_tpl’: u’scancel {{ cluster_job_id }}’, u’name’: u’Cryo-EM_cluster’, u’title’: u’Cryo-EM_cluster’, u’hostname’: u’Cryo-EM_cluster’, u’qstat_cmd_tpl’: u’squeue -j {{ cluster_job_id }}’, u’worker_bin_path’: u’/net/cemaster/data/software/cryoSPARC/V2/cryosparc2_worker/bin/cryosparcw’, u’qinfo_cmd_tpl’: u’sinfo’, u’qsub_cmd_tpl’: u’sbatch {{ script_path_abs }}’, u’cache_path’: u’’, u’cache_quota_mb’: None, u’script_tpl’: u’#!/usr/bin/env bash\n#SBATCH --partition=gpu\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --time=48:00:00\n#SBATCH --mem={{ (ram_gb)|int }}GB\n#SBATCH --exclusive\n#SBATCH --job-name cspark_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z “$available_devs” ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n’, u’cache_reserve_mb’: 10000, u’type’: u’cluster’, u’send_cmd_tpl’: u’{{ command }}’, u’desc’: None}, u’license’: True, u’hostname’: u’Cryo-EM_cluster’, u’slots’: {u’GPU’: [0], u’RAM’: [0, 1, 2], u’CPU’: [0, 1, 2, 3]}, u’fixed’: {u’SSD’: False}, u’lane_type’: u’cluster’}
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
cryosparc2_compute/plotutil.py:237: RuntimeWarning: divide by zero encountered in log
logabs = n.log(n.abs(fM))
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
FSC No-Mask… ========= sending heartbeat
0.143 at 23.240 radwn. 0.5 at 12.843 radwn. Took 11.058s.
FSC Spherical Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 25.750 radwn. 0.5 at 17.589 radwn. Took 14.422s.
FSC Loose Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 27.606 radwn. 0.5 at 20.322 radwn. Took 52.393s.
FSC Tight Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 32.488 radwn. 0.5 at 25.170 radwn. Took 39.099s.
FSC Noise Sub… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
cryosparc2_compute/sigproc.py:763: RuntimeWarning: invalid value encountered in divide
fsc_true = (fsc_t - fsc_n) / (1.0 - fsc_n)
0.143 at 32.406 radwn. 0.5 at 24.970 radwn. Took 89.591s.
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
FSC No-Mask… ========= sending heartbeat
0.143 at 32.053 radwn. 0.5 at 24.549 radwn. Took 11.347s.
FSC Spherical Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 35.585 radwn. 0.5 at 27.227 radwn. Took 15.227s.
FSC Loose Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
Hopefully this will be helpful. Please let me know if there is anything else I can post.