No heartbeat error in v4.4

luisshulk · January 23, 2024, 7:59am

I updated it to v4.4，and queued 4 jobs at the same time in one compute node. But after a few while, all the 4 jobs failed with Job is unresponsive - no heartbeat received in 180 seconds.

I find this topic, Error in Extensive Validation after v4.4 update

I think it is possible that if I queue too many jobs at the same time, my compute node do not have enough CPU resource and can not load so many jobs to cache particle images on SSD? And the speed of particles caching is much slower than before, maybe because former cryoSPARC edition use master mode to copy particles?

Actually I haven’t met this “no heartbeat” error before. I don’t know how to avoid this situation next time. Otherwise I can only caching particle images of one job at the same time, and the particles copying speed is so low. Can I change some argument to use mater node to copy particles like before?

This is one of my job.log, I haven’t found anything wrong.



================= CRYOSPARCW =======  2024-01-21 14:51:11.941524  =========
Project P27 Job J190
Master shipmhpc Port 45102
===========================================================================
========= monitor process now starting main process at 2024-01-21 14:51:11.941630
MAINPROCESS PID 30379
MAIN PID 30379
refine.newrun cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
========= sending heartbeat at 2024-01-21 14:51:40.430095
========= sending heartbeat at 2024-01-21 14:51:50.451699
========= sending heartbeat at 2024-01-21 14:52:00.472041
========= sending heartbeat at 2024-01-21 14:52:10.490722
========= sending heartbeat at 2024-01-21 14:52:20.511816
========= sending heartbeat at 2024-01-21 14:52:30.534083
========= sending heartbeat at 2024-01-21 14:52:40.563027
========= sending heartbeat at 2024-01-21 14:52:50.595073
========= sending heartbeat at 2024-01-21 14:53:00.617058
========= sending heartbeat at 2024-01-21 14:53:10.668047
========= sending heartbeat at 2024-01-21 14:53:20.687060
========= sending heartbeat at 2024-01-21 14:53:30.706043
========= sending heartbeat at 2024-01-21 14:53:40.725044
========= sending heartbeat at 2024-01-21 14:53:50.743768
========= sending heartbeat at 2024-01-21 14:54:00.772058
========= sending heartbeat at 2024-01-21 14:54:10.791069
========= sending heartbeat at 2024-01-21 14:54:20.809078
========= sending heartbeat at 2024-01-21 14:54:30.881125
========= sending heartbeat at 2024-01-21 14:54:40.907203
========= sending heartbeat at 2024-01-21 14:54:50.941703
========= sending heartbeat at 2024-01-21 14:55:00.960638
========= sending heartbeat at 2024-01-21 14:55:10.979061
========= sending heartbeat at 2024-01-21 14:55:21.011525
========= sending heartbeat at 2024-01-21 14:55:31.030068
========= sending heartbeat at 2024-01-21 14:55:41.046815
========= sending heartbeat at 2024-01-21 14:55:51.079071
========= sending heartbeat at 2024-01-21 14:56:01.100299
========= sending heartbeat at 2024-01-21 14:56:11.123205
========= sending heartbeat at 2024-01-21 14:56:21.135097
========= sending heartbeat at 2024-01-21 14:56:31.154300
========= sending heartbeat at 2024-01-21 14:56:41.187273
========= sending heartbeat at 2024-01-21 14:56:51.206761
========= sending heartbeat at 2024-01-21 14:57:01.236498
========= sending heartbeat at 2024-01-21 14:57:11.259107
========= sending heartbeat at 2024-01-21 14:57:21.286041
========= sending heartbeat at 2024-01-21 14:57:31.315060
========= sending heartbeat at 2024-01-21 14:57:41.333041
========= sending heartbeat at 2024-01-21 14:57:51.358912
========= sending heartbeat at 2024-01-21 14:58:01.372088
========= sending heartbeat at 2024-01-21 14:58:11.390045
========= sending heartbeat at 2024-01-21 14:58:21.422627
========= sending heartbeat at 2024-01-21 14:58:31.441835
========= sending heartbeat at 2024-01-21 14:58:41.469061
========= sending heartbeat at 2024-01-21 14:58:51.500052
========= sending heartbeat at 2024-01-21 14:59:01.525825
========= sending heartbeat at 2024-01-21 14:59:11.548080
========= sending heartbeat at 2024-01-21 14:59:22.040048
========= sending heartbeat at 2024-01-21 14:59:32.180931
========= sending heartbeat at 2024-01-21 14:59:42.207139

Thanks a lot.

wtempel · January 23, 2024, 5:42pm

Heartbeat failures may have various causes, with some causes depending on the type of CryoSPARC instances (such as single workstation, or separate master and worker node).
An overloaded worker node may fail to send a heartbeat.
An overloaded master node may fail to process heartbeats it receives.
An unstable network may disrupt the transmission of a heartbeat.
Does the Event log of job P27 J190 reveal the time of the job failure?
What is the output of the command

cryosparcm cli "get_job('P27', 'J190', 'instance_information', 'killed_at')"

?

luisshulk · January 25, 2024, 8:19am

Thanks for your reply.

cryosparcm cli "get_job('P27', 'J190', 'instance_information', 'killed_at')"

The result is:

{'_id': '65a293fe9f852ade8f73f792', 'instance_information': {}, 'killed_at': None, 'project_uid': 'P27', 'uid': 'J190'}

And These days I found that the problem is the particles caching. While caching particles, the speed is so low, and often lead to Job is unresponsive - no heartbeat received in 180 seconds. error. Even sometimes my compute node would be drained and must be recovered by our cluster manager.

Manager said it might be the Cache particles on SSD program have some bug, because if I use cp to copy particle.star file to SSD temp directory, there is no trouble on particles coping speed and computer node drained.

My cryosparc edtion is v4.4.1, is there any patch recently?

wtempel · January 25, 2024, 3:39pm

@luisshulk

Please can you

post the output of the command

cryosparcm cli "get_scheduler_targets()"

post the output of the command
```
grep -v LICENSE /path/to/cryosparc_worker/config.sh
```
on the compute node, where substitute the actual path to the cryosparc_worker/ directory
post error messages from the Event Log and job log (under Metadata|Log) of the Cache particles on SSD job
request from your cluster manager the reason and specific resource limitation that caused the node to be DRAINED

luisshulk · January 28, 2024, 8:15am

Here is the output of
cryosparcm cli "get_scheduler_targets()"

[{'cache_path': '/gpu_temp', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cy', 'lane': 'cy', 'name': 'cy', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p cy\n###SBATCH --mem={{ (ram_gb*1000)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/run.out\n#SBATCH -e {{ job_dir_abs }}/run.err\n#module load cuda80/toolkit/8.0.61\n#module load cuda80/fft/8.0.61\n#module load cuda10.1\n\n#echo "PATH is ${PATH}"\n#echo "LD_LIBRARY_PATH is ${LD_LIBRARY_PATH}"\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'cy', 'tpl_vars': ['run_args', 'cluster_job_id', 'worker_bin_path', 'job_log_path_abs', 'ram_gb', 'project_dir_abs', 'num_gpu', 'job_creator', 'job_uid', 'cryosparc_username', 'command', 'job_dir_abs', 'run_cmd', 'num_cpu', 'project_uid'], 'type': 'cluster', 'worker_bin_path': '/cm/shared/apps/cryosparc/cylab/cryosparc_worker/bin/cryosparcw'}]

And this one:

grep -v LICENSE /path/to/cryosparc_worker/config.sh

export CRYOSPARC_USE_GPU=true
export CRYOSPARC_IMPROVED_SSD_CACHE=true
export CRYOSPARC_CACHE_NUM_THREADS=1

The Event log of failed NU-refinement job


> [2024-01-21 1:47:11.95] License is valid.
> [2024-01-21 1:47:11.95] Launching job on lane cy target cy ...
> [2024-01-21 1:47:12.03] Launching job on cluster cy
> [2024-01-21 1:47:12.04] ====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## /cm/shared/apps/cryosparc/cylab/cryosparc_worker/bin/cryosparcw run --project P27 --job J190
--master_hostname shipmhpc --master_command_core_port 45102 > /work/caolab/yu.cao/CS-ly-ribo/J190/job.log
2>&1 - the complete command string to run the job
## 4 - the number of CPUs needed
## 1 - the number of GPUs needed.
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## 24.0 - the amount of RAM needed in GB
## /work/caolab/yu.cao/CS-ly-ribo/J190 - absolute path to the job directory
## /work/caolab/yu.cao/CS-ly-ribo - absolute path to the project dir
## /work/caolab/yu.cao/CS-ly-ribo/J190/job.log - absolute path to the log file for the job
## /cm/shared/apps/cryosparc/cylab/cryosparc_worker/bin/cryosparcw - absolute path to the cryosparc
worker command
## --project P27 --job J190 --master_hostname shipmhpc --master_command_core_port 45102 -
arguments to be passed to cryosparcw run
## P27 - uid of the project
## J190 - uid of the job
## yu.cao - name of the user that created the job (may contain spaces)
## yu.cao@shsmu.edu.cn - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:
#SBATCH --job-name cryosparc_P27_J190
#SBATCH -n 4
#SBATCH --gres=gpu:1
#SBATCH -p cy
###SBATCH --mem=24000MB
#SBATCH -o /work/caolab/yu.cao/CS-ly-ribo/J190/run.out
#SBATCH -e /work/caolab/yu.cao/CS-ly-ribo/J190/run.err
#module load cuda80/toolkit/8.0.61
#module load cuda80/fft/8.0.61
#module load cuda10.1
#echo "PATH is ${PATH}"
#echo "LD_LIBRARY_PATH is ${LD_LIBRARY_PATH}"
available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z "$available_devs" ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done
export CUDA_VISIBLE_DEVICES=$available_devs
/cm/shared/apps/cryosparc/cylab/cryosparc_worker/bin/cryosparcw run --project P27 --job J190
--master_hostname shipmhpc --master_command_core_port 45102 > /work/caolab/yu.cao/CS-ly-ribo/J190/job.log
2>&1
==========================================================================
==========================================================================
> [2024-01-21 1:47:12.05] -------- Submission command:
sbatch /work/caolab/yu.cao/CS-ly-ribo/J190/queue_sub_script.sh
> [2024-01-21 1:47:12.09] -------- Cluster Job ID:
68
> [2024-01-21 1:47:12.09] -------- Queued on cluster at 2024-01-21 14:47:12.099717
> [2024-01-21 1:47:12.69] -------- Cluster job status at 2024-01-21 14:51:17.090480 (24 retries)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
68 cy cryospar cylab R 4:05 1 gpu21
> [2024-01-21 1:51:19.55] [CPU: 180.8 MB] Job J190 Started
> [2024-01-21 1:51:19.61] [CPU: 180.8 MB] Master running v4.4.1, worker running v4.4.1
> [2024-01-21 1:51:19.64] [CPU: 180.8 MB] Working in directory: /work/caolab/yu.cao/CS-ly-ribo/J190
> [2024-01-21 1:51:19.64] [CPU: 180.8 MB] Running on lane cy
> [2024-01-21 1:51:19.64] [CPU: 180.8 MB] Resources allocated:
> [2024-01-21 1:51:19.65] [CPU: 180.8 MB] Worker: cy
> [2024-01-21 1:51:19.65] [CPU: 180.8 MB] CPU : [0, 1, 2, 3]
> [2024-01-21 1:51:19.65] [CPU: 180.8 MB] GPU : [0]
> [2024-01-21 1:51:19.66] [CPU: 180.8 MB] RAM : [0, 1, 2]
> [2024-01-21 1:51:19.67] [CPU: 180.8 MB] SSD : True
> [2024-01-21 1:51:19.67] [CPU: 180.8 MB] --------------------------------------------------------------
> [2024-01-21 1:51:19.68] [CPU: 180.8 MB] Importing job module for job type nonuniform_refine_new...
> [2024-01-21 1:51:26.60] [CPU: 257.7 MB] Job ready to run
> [2024-01-21 1:51:26.60] [CPU: 257.7 MB] ***************************************************************
> [2024-01-21 1:51:48.47] [CPU: 1.10 GB] Using random seed of None
> [2024-01-21 1:51:48.53] [CPU: 1.14 GB] Loading a ParticleStack with 304331 items...
> [2024-01-21 1:51:49.00] [CPU: 1.14 GB] SSD cache : cache successfully synced in_use
> [2024-01-21 1:51:50.88] [CPU: 1.14 GB] SSD cache : cache successfully synced, found 115,654.03 MB of files on SSD.
> [2024-01-21 1:51:55.94] [CPU: 1.14 GB] SSD cache : cache successfully requested to check 4311 files.
> [2024-01-21 1:58:54.71] [CPU: 1.14 GB] SSD cache : cache requires 399,116 MB more on the SSD for files to be downloaded.
> [2024-01-21 1:58:56.12] [CPU: 1.14 GB] SSD cache : cache has enough available space.
> [2024-01-21 1:58:56.12] [CPU: 1.14 GB] Needed | 399,116.15 MB
Available | 1,242,260.08 MB
Disk size | 1,525,438.13 MB
Usable space | 1,515,438.13 MB (reserve 10,000 MB)
> [2024-01-21 1:58:56.13] [CPU: 1.14 GB] Transferring across 2 threads:
000187370811886476155_FoilHole_18239858_Data_18234987_18234989_20230818_163132_fractions
_shiny.mrcs (44/4311)
Progress | 4,029 MB (1.01%)
Total | 399,116 MB
Average speed | 52.57 MB/s
ETA | 2h 5m 14s
> [2024-01-21 2:09:28.33] **** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****
> [2024-01-21 2:09:45.07] Job is unresponsive - no heartbeat received in 180 seconds.

And job log of failed Cache particles on SSD

================= CRYOSPARCW =======  2024-01-26 00:30:46.070320  =========
Project P1 Job J2641
Master shipmhpc Port 45102
===========================================================================
========= monitor process now starting main process at 2024-01-26 00:30:46.070412
MAINPROCESS PID 26332
========= monitor process now waiting for main process
MAIN PID 26332
utilities.run_cache_particles cryosparc_compute.jobs.jobregister
***************************************************************
2024-01-26 00:30:56,833 run_with_executor    INFO     | Resolving 7256 source path(s) for caching
========= sending heartbeat at 2024-01-26 00:31:02.792498
========= sending heartbeat at 2024-01-26 00:31:12.803054
========= sending heartbeat at 2024-01-26 00:31:22.823665
2024-01-26 00:31:26,695 run_with_executor    INFO     | Resolved sources in 29.86 seconds
2024-01-26 00:31:27,576 cleanup_junk_files   INFO     | Removed 5821 invalid item(s) in the cache
2024-01-26 00:31:28,306 run_with_executor    INFO     | Cache allocation ran in 1.51 seconds
2024-01-26 00:31:28,306 run_with_executor    INFO     | Found 1436 SSD hit(s)
2024-01-26 00:31:28,306 run_with_executor    INFO     | Transferring 5820 file(s)...
2024-01-26 00:31:29,635 run_with_executor    INFO     | Transferred /work/caolab/yu.cao/LuYi/20230519_WO4/Extract/job131/rawdata/FoilHole_19236997_Data_19228125_19228127_20230519_175211_fractions.mrcs to SSD key f282b498b3db47f1e6c5010e1cfd5cb6b9c1f54a...
2024-01-26 00:31:30,869 run_with_executor    INFO     | Transferred /work/caolab/yu.cao/LuYi/20230519_WO4/Extract/job131/rawdata/FoilHole_19236583_Data_19228125_19228127_20230519_172109_fractions.mrcs to SSD key ade405bf4ba9a727e3f6e2e2b641f16c6cff3dfd...
2024-01-26 00:31:32,393 run_with_executor    INFO     | Transferred /work/caolab/yu.cao/LuYi/20230519_WO4/Extract/job131/rawdata/FoilHole_19257811_Data_19228125_19228127_20230521_012018_fractions.mrcs to SSD key 6b0b31c292ea1a546c4537d15b102d66c41b8507...
========= sending heartbeat at 2024-01-26 00:31:32.843094

... Omit some similar log information ...

2024-01-26 01:43:12,944 run_with_executor    INFO     | Transferred /work/caolab/yu.cao/LuYi/20230519_WO4/Extract/job131/rawdata/FoilHole_19244211_Data_19228125_19228127_20230520_101656_fractions.mrcs to SSD key 6f3a0d694576bc80359eb3a2eb3d08ff942f4436...
========= sending heartbeat at 2024-01-26 01:43:13.838008
2024-01-26 01:43:14,727 run_with_executor    INFO     | Transferred /work/caolab/yu.cao/LuYi/20230519_WO4/Extract/job131/rawdata/FoilHole_19248551_Data_19228125_19228127_20230520_162940_fractions.mrcs to SSD key 2c6ffe715a0456b1445e8dc148da57c5ef774d2c...
========= sending heartbeat at 2024-01-26 01:43:23.858818
========= sending heartbeat at 2024-01-26 01:43:33.878336
========= sending heartbeat at 2024-01-26 01:43:43.897752
========= sending heartbeat at 2024-01-26 01:43:53.919049
========= sending heartbeat at 2024-01-26 01:44:03.938244
========= sending heartbeat at 2024-01-26 01:44:13.957391
========= sending heartbeat at 2024-01-26 01:44:23.972807
========= sending heartbeat at 2024-01-26 01:44:33.982990
========= sending heartbeat at 2024-01-26 01:44:43.993090
========= sending heartbeat at 2024-01-26 01:44:54.013351
========= sending heartbeat at 2024-01-26 01:45:04.033164
========= sending heartbeat at 2024-01-26 01:45:14.052505
========= sending heartbeat at 2024-01-26 01:45:24.072957
========= sending heartbeat at 2024-01-26 01:45:34.092117
========= sending heartbeat at 2024-01-26 01:45:44.107228
========= sending heartbeat at 2024-01-26 01:45:54.117364
========= sending heartbeat at 2024-01-26 01:46:04.127541
========= sending heartbeat at 2024-01-26 01:46:14.137664
========= sending heartbeat at 2024-01-26 01:46:24.147798
========= sending heartbeat at 2024-01-26 01:46:34.157989
========= sending heartbeat at 2024-01-26 01:46:44.168093
========= sending heartbeat at 2024-01-26 01:46:54.188376
========= sending heartbeat at 2024-01-26 01:47:04.207433
========= sending heartbeat at 2024-01-26 01:47:14.226565
========= sending heartbeat at 2024-01-26 01:47:24.246281
========= sending heartbeat at 2024-01-26 01:47:34.266657
========= sending heartbeat at 2024-01-26 01:47:44.287137
========= sending heartbeat at 2024-01-26 01:47:54.305387
========= sending heartbeat at 2024-01-26 01:48:04.317681
========= sending heartbeat at 2024-01-26 01:48:14.330223
========= sending heartbeat at 2024-01-26 01:48:24.342826
========= sending heartbeat at 2024-01-26 01:48:34.355248
========= sending heartbeat at 2024-01-26 01:48:44.367856
========= sending heartbeat at 2024-01-26 01:48:54.380376
========= sending heartbeat at 2024-01-26 01:49:04.393067
========= sending heartbeat at 2024-01-26 01:49:14.405523

wtempel · January 29, 2024, 12:58am

Thanks for posting this information @luisshulk.
May I ask a few follow-up questions.

Was the CRYOSPARC_CACHE_NUM_THREADS parameter inside cryosparc_worker/config.sh set to 1 after or before January 21?
Please can you post the Event Log of job P1 J2641.
What type of storage provides /gpu_temp: Is this a local device on each GPU node, or a shared filesystem?
How do you ensure that a job that requires cached particles will be sent to the/a node to which particles were previously copied with a Cache particles on SSD job?

luisshulk · January 29, 2024, 1:37am

Was the CRYOSPARC_CACHE_NUM_THREADS parameter inside cryosparc_worker/config.sh set to 1 after or before January 21?

Yes. I had tried 12 before, and once I want to change this CRYOSPARC_CACHE_NUM_THREADS parameter, I will change both cryosparc_worker/config.sh and cryosparc_master/config.sh files.

Please can you post the Event Log of job P1 J2641.

Here is the Event Log of P1 J2641.

> [2024-01-25 11:27:38.97] License is valid.
> [2024-01-25 11:27:38.97] Launching job on lane cy target cy ...
> [2024-01-25 11:27:39.02] Launching job on cluster cy
> [2024-01-25 11:27:39.04] ====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## /cm/shared/apps/cryosparc/cylab/cryosparc_worker/bin/cryosparcw run --project P1 --job J2641
--master_hostname shipmhpc --master_command_core_port 45102 > /work/caolab/yu.cao/P16/J2641/job.log 2>&1
- the complete command string to run the job
## 1 - the number of CPUs needed
## 0 - the number of GPUs needed.
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## 0.0 - the amount of RAM needed in GB
## /work/caolab/yu.cao/P16/J2641 - absolute path to the job directory
## /work/caolab/yu.cao/P16 - absolute path to the project dir
## /work/caolab/yu.cao/P16/J2641/job.log - absolute path to the log file for the job
## /cm/shared/apps/cryosparc/cylab/cryosparc_worker/bin/cryosparcw - absolute path to the cryosparc
worker command
## --project P1 --job J2641 --master_hostname shipmhpc --master_command_core_port 45102 -
arguments to be passed to cryosparcw run
## P1 - uid of the project
## J2641 - uid of the job
## yu.cao - name of the user that created the job (may contain spaces)
## yu.cao@shsmu.edu.cn - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:
#SBATCH --job-name cryosparc_P1_J2641
#SBATCH -n 1
#SBATCH --gres=gpu:0
#SBATCH -p cy
###SBATCH --mem=0MB
#SBATCH -o /work/caolab/yu.cao/P16/J2641/run.out
#SBATCH -e /work/caolab/yu.cao/P16/J2641/run.err
#module load cuda80/toolkit/8.0.61
#module load cuda80/fft/8.0.61
#module load cuda10.1
#echo "PATH is ${PATH}"
#echo "LD_LIBRARY_PATH is ${LD_LIBRARY_PATH}"
available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z "$available_devs" ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done
export CUDA_VISIBLE_DEVICES=$available_devs
/cm/shared/apps/cryosparc/cylab/cryosparc_worker/bin/cryosparcw run --project P1 --job J2641
--master_hostname shipmhpc --master_command_core_port 45102 > /work/caolab/yu.cao/P16/J2641/job.log 2>&1
==========================================================================
==========================================================================
> [2024-01-25 11:27:39.05] -------- Submission command:
sbatch /work/caolab/yu.cao/P16/J2641/queue_sub_script.sh
> [2024-01-25 11:27:39.10] -------- Cluster Job ID:
203
> [2024-01-25 11:27:39.10] -------- Queued on cluster at 2024-01-26 00:27:39.104285
> [2024-01-25 11:27:39.16] -------- Cluster job status at 2024-01-26 00:30:41.317062 (18 retries)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
203 cy cryospar cylab R 3:02 1 gpu21
> [2024-01-25 11:30:52.77] [CPU: 178.8 MB] Job J2641 Started
> [2024-01-25 11:30:52.78] [CPU: 178.8 MB] Master running v4.4.1, worker running v4.4.1
> [2024-01-25 11:30:52.80] [CPU: 178.9 MB] Working in directory: /work/caolab/yu.cao/P16/J2641
> [2024-01-25 11:30:52.80] [CPU: 178.9 MB] Running on lane cy
> [2024-01-25 11:30:52.80] [CPU: 178.9 MB] Resources allocated:
> [2024-01-25 11:30:52.81] [CPU: 178.9 MB] Worker: cy
> [2024-01-25 11:30:52.81] [CPU: 178.9 MB] CPU : [0]
> [2024-01-25 11:30:52.82] [CPU: 178.9 MB] GPU : []
> [2024-01-25 11:30:52.82] [CPU: 178.9 MB] RAM : []
> [2024-01-25 11:30:52.82] [CPU: 178.9 MB] SSD : True
> [2024-01-25 11:30:52.83] [CPU: 178.9 MB] --------------------------------------------------------------
> [2024-01-25 11:30:52.83] [CPU: 178.9 MB] Importing job module for job type cache_particles...
> [2024-01-25 11:30:53.42] [CPU: 181.1 MB] Job ready to run
> [2024-01-25 11:30:53.42] [CPU: 181.1 MB] ***************************************************************
> [2024-01-25 11:30:56.76] [CPU: 249.0 MB] Loading a ParticleStack with 341456 items...
> [2024-01-25 11:30:56.81] [CPU: 249.0 MB] ??????????????????????????????????????????????????????????????
SSD cache ACTIVE at /gpu_temp/instance_shipmhpc:45101 (10 GB reserve)
?????????????????????????????????????????????????
? Cache usage ? Amount ?
?????????????????????????????????????????????????
? Total / Usable ? 1.45 TiB / 1.45 TiB ?
? Used / Free ? 701.06 GiB / 779.26 GiB ?
? Hits / Misses ? 74.36 GiB / 299.48 GiB ?
? Acquired / Required ? 373.85 GiB / 0.00 B ?
?????????????????????????????????????????????????
Progress: [?????????????????????????????????-----------------] 4826/7256 (67%)
Transferred: FoilHole_19248551_Data_19228125_19228127_20230520_162940_fractions.mrcs
(39.06 MiB)
Threads: 1
Avg speed: 41.45 MiB/s
Remaining: 0h 51m 31s (125.16 GiB)
Elapsed: 1h 12m 17s
Active jobs: P1-J2641
> [2024-01-25 12:52:32.45] **** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****
> [2024-01-25 12:52:32.71] Job is unresponsive - no heartbeat received in 180 seconds.

What type of storage provides /gpu_temp: Is this a local device on each GPU node, or a shared filesystem?

[XXXXX@gpu20 ~]$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 447.1G  0 disk
├─sda1   8:1    0    16G  0 part
└─sda2   8:2    0 431.1G  0 part
sdb      8:16   0 894.3G  0 disk
└─sdb1   8:17   0 894.3G  0 part /gpu_temp

Also with another node

How do you ensure that a job that requires cached particles will be sent to the/a node to which particles were previously copied with a Cache particles on SSD job?

Actually when I submit job to cluster, my computer nodes have their own SSD, and required particles will be copied to /gpu_temp of SSD, then the cryosparc will use these particles to run the job.

wtempel · February 1, 2024, 7:33pm

Were additional CryoSPARC or non-CryoSPARC jobs running on the same compute node at the time that J190 failed?

luisshulk · February 2, 2024, 4:21pm

No, no other job running on the same compute node. But I have also tried queue different CRYOSPARC job on the same compute node before, and job killed with the same error.

before V4.4, we used V4.1 and former edition CRYOSPARC, but never met this problem.

And I show my compute node message here, which job killed failed lead to node drain:

My cluster manager thought this might be the new edition CRYOSPARC issue. Once the job drained the node, I had to find the manager to resume, which get our lab members and cluster manager to trouble.

wtempel · February 6, 2024, 9:29pm

One of the changes included between the v4.1 and v4.4 versions was that there would be an attempt send a termination signal, likely using scancel on your cluster, to jobs who failed to register a heartbeat. You could

allow more time before CryoSPARC would initiate a termination attempt by significantly increasing the CRYOSPARC_HEARTBEAT_SECONDS value from its default (see CryoSPARC guide).
observe what is happening on the compute node when heartbeats fail to be sent, such as: Is the node swapping?
test the effect of disabling transparent_hugepage on the compute node(s) (how?). This re-configuration would require sysadmin privileges on the node(s).

wtempel · February 29, 2024, 9:15pm

@luisshulk Are you and the cluster admin still observing that nodes are drained due to
Reason=Kill task failed? There are several discussions of this topic.

The motivation behind this seemingly annoying node drain is explained here.

There are also debugging and resolution suggestions.

This observation makes me wonder if adding the scancel --full option to your cluster target configuration might help.

"qdel_cmd_tpl": "scancel {{ cluster_job_id }}"

to

"qdel_cmd_tpl": "scancel -f {{ cluster_job_id }}"

I have not tested this.