Homogenous Refinement fails after computing cFSCs

aegliwa · December 11, 2024, 5:08pm

Hi all!
I’m a sysadmin supporting users who run cryosparc on a heterogenous compute cluster. I’m trying to help a user who wants to run a homogenous refinement job on a 320GB dataset, ( [EMPIAR-12112 Cryo-EM structure of HflX bound to the Listeria monocytogenes 50S ribosomal subunit] ).

This job is consistently failing with a heartbeat error. I don’t think it is that job in and of itself, as I have seen other users running homogenous refinements on their own data in the past weeks. I’ve tried increasing the wait time to 10 mins but the heartbeat kill signal is still coming in. I’ve added the output of the metadata log below. In terms of simple fixes, I have 10TB of cache space, so I don’t think that is the issue. The storage location is another server that is used by many groups, so I can rule that out as well. We are tied into slurm management for the cluster as a whole, but the job is starting and hitting the node without problems, it just fails at a certain point and that’s true on both dedicated cryosparc nodes we have. I’d appreciate any guidance or troubleshooting advice!

Error log prior to crash:
gpufft: creating new cufft plan (plan id 7 pid 2188116)
gpu_id 0
ndims 3
dims 336 336 336
inembed 336 336 169
istride 1
idist 19079424
onembed 336 336 338
ostride 1
odist 38158848
batch 1
type C2R
wkspc manual
Python traceback:

:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
/usr/localMAIN/cryosparc/cryosparc_worker/bin/cryosparcw: line 153: 2188116 Killed python -c “import cryosparc_compute.run as run; run.run()” “$@”

And event log

[CPU: 18.39 GB]
Using full box size 650, downsampled box size 336, with low memory mode disabled.

[CPU: 18.39 GB]
Computing FFTs on GPU.

[CPU: 20.49 GB]
Done in 7.502s

[CPU: 20.49 GB]
Computing cFSCs…

**** Kill signal sent by CryoSPARC (ID: ) ****

Job is unresponsive - no heartbeat received in 600 seconds.

Thanks!!
-Alek

wtempel · December 11, 2024, 8:00pm

Welcome to the forum @aegliwa . Please can you post the outputs of these commands

on the CryoSPARC master

csprojectid=P99 # replace with actual project ID
csjobid=J199 # replace with id of a job that should be running
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"
cryosparcm joblog $csprojectid $csjobid | tail -n 40
cryosparcm eventlog $csprojectid $csjobid | tail -n 40
cryosparcm cli "get_scheduler_targets()"

on the worker where the job failed

uname -a
free -h
cat /sys/kernel/mm/transparent_hugepage/enabled

Is this local, nvme-attached storage on the CryoSPARC worker?

aegliwa · December 11, 2024, 8:41pm

Sure thing!

ryosparcm cli “get_job(‘P26’, ‘J40’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’)”
{‘_id’: ‘67587de3e919fc5f991990bd’, ‘errors_run’: [{‘message’: ‘Job is unresponsive - no heartbeat received in 600 seconds.’, ‘warning’: False}], ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘247.45GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) Silver 4410Y’, ‘driver_version’: ‘12.3’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 47810936832, ‘name’: ‘NVIDIA L40S’, ‘pcie’: ‘0000:3d:00’}], ‘ofd_hard_limit’: 131072, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 24, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘biomix10’, ‘platform_release’: ‘5.15.0-105-generic’, ‘platform_version’: ‘#115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024’, ‘total_memory’: ‘251.55GB’, ‘used_memory’: ‘2.14GB’}, ‘job_type’: ‘homo_refine_new’, ‘params_spec’: {}, ‘project_uid’: ‘P26’, ‘status’: ‘failed’, ‘uid’: ‘J40’, ‘version’: ‘v4.6.0’}

cryosparcm joblog P26 J40 | tail -n 40
========= sending heartbeat at 2024-12-11 16:13:57.634163
========= sending heartbeat at 2024-12-11 16:14:07.653089
========= sending heartbeat at 2024-12-11 16:14:17.671726
========= sending heartbeat at 2024-12-11 16:14:27.690158
========= sending heartbeat at 2024-12-11 16:14:37.708181
========= sending heartbeat at 2024-12-11 16:14:47.725482
========= sending heartbeat at 2024-12-11 16:14:57.914164
========= sending heartbeat at 2024-12-11 16:15:07.926856
gpufft: creating new cufft plan (plan id 6 pid 2188116)
gpu_id 0
ndims 3
dims 650 650 650
inembed 650 650 652
istride 1
idist 275470000
onembed 650 650 326
ostride 1
odist 137735000
batch 1
type R2C
wkspc manual
Python traceback:

gpufft: creating new cufft plan (plan id 7 pid 2188116)
gpu_id 0
ndims 3
dims 336 336 336
inembed 336 336 169
istride 1
idist 19079424
onembed 336 336 338
ostride 1
odist 38158848
batch 1
type C2R
wkspc manual
Python traceback:

:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
/usr/localMAIN/cryosparc/cryosparc_worker/bin/cryosparcw: line 153: 2188116 Killed python -c “import cryosparc_compute.run as run; run.run()” “$@”

cryosparcm eventlog P26 J40 | tail -n 40
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] Particles will be zeropadded/truncated to size 650 during alignment
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] Volume refinement will be done with effective box size 650
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] Volume refinement will be done with pixel size 0.8200
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] Particles will be zeropadded/truncated to size 650 during backprojection
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] Particles will be backprojected with box size 650
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] Volume will be internally cropped and stored with box size 650
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] Volume will be interpolated with box size 650 (zeropadding factor 1.00)
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] DC components of images will be ignored and volume will be floated at each iteration.
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] Spherical windowing of maps is enabled
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] Refining with C1 symmetry enforced
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] Resetting input per-particle scale factors to 1.0
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] Starting at initial resolution 30.000A (radwn 17.767).
[Wed, 11 Dec 2024 16:11:18 GMT] [CPU RAM used: 653 MB] ====== Masking ======
[Wed, 11 Dec 2024 16:11:29 GMT] [CPU RAM used: 5106 MB] No mask input was connected, so dynamic masking will be enabled.
[Wed, 11 Dec 2024 16:11:29 GMT] [CPU RAM used: 5106 MB] Dynamic mask threshold: 0.2000
[Wed, 11 Dec 2024 16:11:29 GMT] [CPU RAM used: 5106 MB] Dynamic mask near (A): 6.00
[Wed, 11 Dec 2024 16:11:29 GMT] [CPU RAM used: 5106 MB] Dynamic mask far (A): 14.00
[Wed, 11 Dec 2024 16:11:29 GMT] [CPU RAM used: 5106 MB] ====== Initial Model ======
[Wed, 11 Dec 2024 16:11:29 GMT] [CPU RAM used: 5106 MB] Resampling initial model to specified volume representation size and pixel-size…
[Wed, 11 Dec 2024 16:11:40 GMT] [CPU RAM used: 8189 MB] Estimating scale of initial reference.
[Wed, 11 Dec 2024 16:11:51 GMT] [CPU RAM used: 8391 MB] Rescaling initial reference by a factor of 1.049
[Wed, 11 Dec 2024 16:11:58 GMT] [CPU RAM used: 8420 MB] Estimating scale of initial reference.
[Wed, 11 Dec 2024 16:12:06 GMT] [CPU RAM used: 8418 MB] Rescaling initial reference by a factor of 1.007
[Wed, 11 Dec 2024 16:12:14 GMT] [CPU RAM used: 8423 MB] Estimating scale of initial reference.
[Wed, 11 Dec 2024 16:12:23 GMT] [CPU RAM used: 8424 MB] Rescaling initial reference by a factor of 1.000
[Wed, 11 Dec 2024 16:12:31 GMT] Initial Real Space Slices
[Wed, 11 Dec 2024 16:12:33 GMT] Initial Fourier Space Slices
[Wed, 11 Dec 2024 16:12:33 GMT] [CPU RAM used: 8590 MB] ====== Starting Refinement Iterations ======
[Wed, 11 Dec 2024 16:12:33 GMT] [CPU RAM used: 8590 MB] ----------------------------- Start Iteration 0
[Wed, 11 Dec 2024 16:12:33 GMT] [CPU RAM used: 8590 MB] Using Max Alignment Radius 17.767 (30.000A)
[Wed, 11 Dec 2024 16:12:33 GMT] [CPU RAM used: 8590 MB] Auto batchsize: 12100 in each split
[Wed, 11 Dec 2024 16:12:50 GMT] [CPU RAM used: 12919 MB] – THR 1 BATCH 500 NUM 6000 TOTAL 5.8994584 ELAPSED 117.29434 –
[Wed, 11 Dec 2024 16:14:51 GMT] [CPU RAM used: 16167 MB] Processed 24200.000 images in 121.875s.
[Wed, 11 Dec 2024 16:15:07 GMT] [CPU RAM used: 18394 MB] Computing FSCs…
[Wed, 11 Dec 2024 16:15:07 GMT] [CPU RAM used: 18394 MB] Using full box size 650, downsampled box size 336, with low memory mode disabled.
[Wed, 11 Dec 2024 16:15:07 GMT] [CPU RAM used: 18394 MB] Computing FFTs on GPU.
[Wed, 11 Dec 2024 16:15:15 GMT] [CPU RAM used: 20493 MB] Done in 7.502s
[Wed, 11 Dec 2024 16:15:15 GMT] [CPU RAM used: 20493 MB] Computing cFSCs…
[Wed, 11 Dec 2024 16:25:08 GMT] **** Kill signal sent by CryoSPARC (ID: ) ****
[Wed, 11 Dec 2024 16:25:08 GMT] Job is unresponsive - no heartbeat received in 600 seconds.

cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/csparc’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 15687286784, ‘name’: ‘Tesla T4’}], ‘hostname’: ‘biomix43’, ‘lane’: ‘biomix43’, ‘monitor_port’: None, ‘name’: ‘biomix43’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], ‘GPU’: [0], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘csparc@biomix43’, ‘title’: ‘Worker node biomix43’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/usr/localMAIN/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/csparc’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 47810936832, ‘name’: ‘NVIDIA L40S’}], ‘hostname’: ‘biomix10’, ‘lane’: ‘biomix10’, ‘monitor_port’: None, ‘name’: ‘biomix10’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], ‘GPU’: [0], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘csparc@biomix10’, ‘title’: ‘Worker node biomix10’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/usr/localMAIN/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/csparc’, ‘cache_quota_mb’: 9000000, ‘cache_reserve_mb’: 10000, ‘custom_var_names’: [‘ram_gb_multiplier’], ‘custom_vars’: {}, ‘desc’: None, ‘hostname’: ‘biomix’, ‘lane’: ‘biomix’, ‘name’: ‘biomix’, ‘qdel_cmd_tpl’: ‘scancel {{ cluster_job_id }}’, ‘qinfo_cmd_tpl’: ‘sinfo’, ‘qstat_cmd_tpl’: ‘squeue -j {{ cluster_job_id }}’, ‘qstat_code_cmd_tpl’: None, ‘qsub_cmd_tpl’: ‘sbatch {{ script_path_abs }}’, ‘script_tpl’: ‘#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed.\n## Note: The code will use this many GPUs starting from dev id 0.\n## The cluster scheduler has the responsibility\n## of setting CUDA_VISIBLE_DEVICES or otherwise enuring that the\n## job uses the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --partition=cryosparc\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int}}G\n#SBATCH --output={{ job_dir_abs }}/slurm.out\n#SBATCH --error={{ job_dir_abs }}/slurm.err\n\n{{ run_cmd }}\n\n’, ‘send_cmd_tpl’: ‘{{ command }}’, ‘title’: ‘biomix’, ‘tpl_vars’: [‘worker_bin_path’, ‘num_gpu’, ‘cluster_job_id’, ‘job_creator’, ‘ram_gb’, ‘num_cpu’, ‘command’, ‘run_cmd’, ‘job_uid’, ‘cryosparc_username’, ‘ram_gb_multiplier’, ‘job_dir_abs’, ‘run_args’, ‘job_log_path_abs’, ‘project_uid’, ‘project_dir_abs’], ‘type’: ‘cluster’, ‘worker_bin_path’: ‘/usr/localMAIN/cryosparc/cryosparc_worker/bin/cryosparcw’}]

Since we have run the job on both workers i’ve included the output of both.

worker 1

uname -a
Linux biomix10 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
free -h
total used free shared buff/cache available
Mem: 251Gi 5.1Gi 82Gi 1.3Gi 163Gi 243Gi
Swap: 8.0Gi 21Mi 8.0Gi
cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

worker 2

uname -a
Linux biomix43 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
free -h
total used free shared buff/cache available
Mem: 251Gi 15Gi 37Gi 2.6Gi 197Gi 230Gi
Swap: 8.0Gi 16Mi 8.0Gi
cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

The 10TB is stored on the cluster head node, since that is where job submission happens. From there we export that cache space to the two worker nodes using NFS. There is a direct 10Gb connection between the head and worker nodes. The path to the cache is the same as well. This has worked previously, although it is not explicitly local to each worker node.

Thanks for the response and warm welcome!

wtempel · December 11, 2024, 9:09pm

The job may be affected by an issue that has been fixed in v4.6.2. The first recommendation would be to upgrade CryoSPARC.

This setup may be suboptimal in the context of particle caching. Did you observe a speedup of cached jobs versus similar jobs that had caching disabled? If that storage is not significantly more performant than the storage for CryoSPARC project directories, enabling caching may even deteriorate performance compared to a job where caching has been disabled.

aegliwa · December 12, 2024, 4:11pm

I think it was exactly the SSD config. The job was created before I upgraded but I did upgrade and that changed nothing. Once I switched the setting for SSD caching the job completed. Thank you so much for your advice and knowledge! I love that this community is so supportive - this is my first time posting here but I’ve been lurking for over a year. Cheers!