Patch Motion is frequently broken

Hi everyone,

I am running the Patch Motion correction job (v4.5.3), it frequently and suddenly interrupts, and web pages fail to load. A lock file is generated in the tmp folder. Yesterday, it interrupted when processing 6,000 images. Today, it interrupted when separately processing 120, 40, and 20 images. The error messages are pasted below:
“**** Kill signal sent by CryoSPARC (ID: ) ****
Job is unresponsive - no heartbeat received in 180 seconds.”

What could be causing this? Additionally, how can I completely remove everything related to CryoSPARC from our workstation to reinstall it?

Thanks a lot!

What is the name of that lock file?

Please can you post the outputs of these commands (on the CryoSPARC master):

projectid=P99 # replace with actual project ID
jobid=J199 # replace with actual job ID
free -h
sudo journalctl | grep -i oom
cryosparcm cli "get_scheduler_targets()"
cryosparcm eventlog $projectid $jobid | head -n 40
cryosparcm joblog $projectid $jobid | tail -n 20

Thank you for much for your reply.

  1. I have posted lock file’s name as below:

cryosparc-supervisor-86986eb42f2c678a4816f644b3866e26.sock

  1. jobid=J74
    ~/cryosparc_user/cryosparc/cryosparc_master$ free -h
    total used free shared buff/cache available
    Mem: 125Gi 7.8Gi 6.5Gi 473Mi 110Gi 115Gi
    Swap: 9Gi 13Mi 9Gi

Sep 18 13:26:30 systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer…
Sep 18 13:26:30 systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
Sep 18 11:08:16 systemd-oomd[1619]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-xxx.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 91.93% > 50.00% for > 20s with reclaim activity
Sep 18 11:08:16 systemd[2459]: vte-spawn-xxx.scope: systemd-oomd killed 36 process(es) in this unit.

It seems that this is a memory issue. How can I fix it without upgrading the workstation or reducing usage GPUs?

many thanks!

Update: If I use only one GPU (4090), it works, but if I use two GPUs, it breaks immediately!

Thanks!

Do the job failures occur around the time indicated in the output of the command

sudo journalctl | grep -i oom

?
If they do, you may want to look at Refused connection when cryosparc is running - #25 by wtempel.

Thanks, I have pasted partial information about that!
Sep 18 13:26:30 systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer…
Sep 18 13:26:30 systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
Sep 18 11:08:16 systemd-oomd[1619]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-xxx.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 91.93% > 50.00% for > 20s with reclaim activity
Sep 18 11:08:16 systemd[2459]: vte-spawn-xxx.scope: systemd-oomd killed 36 process(es) in this unit.

Thanks!

Were any CryoSPARC jobs running (and failing) around that time?

Yes, several similar error messages occurred. I only pasted the most recent one.

Unfortunately, I do not have a suggestions that meets all requirements in

Could you please give me other suggestions that don’t have to meet the requirements?

Many thanks!

I’m having a similar issue with Patch Motion Correction, with error message “Job is unresponsive - no heartbeat received in 180 seconds.”

Here is the result of

cryosparcm cli "get_scheduler_targets()"
cryosparcm eventlog $projectid $jobid | head -n 40
cryosparcm joblog $projectid $jobid | tail -n 20
[{'cache_path': '/mnt/Scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}], 'hostname': '10.250.102.2', 'lane': 'Amphitrite', 'monitor_port': None, 'name': '10.250.102.2', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparcuser@10.250.102.2', 'title': 'Worker node 10.250.102.2', 'type': 'node', 'worker_bin_path': '/home/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw'}]
[Tue, 24 Sep 2024 00:44:30 GMT]  License is valid.
[Tue, 24 Sep 2024 00:44:30 GMT]  Launching job on lane Amphitrite target 10.250.102.2 ...
[Tue, 24 Sep 2024 00:44:30 GMT]  Running job on remote worker node hostname 10.250.102.2
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Job J203 Started
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Master running v4.5.3, worker running v4.5.3
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Working in directory: /home/cryosparcuser/Chiron/Shelly_wickham/Jiahe/Cryosparc/Cryosparc_projects/CS-2x2cube/J203
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Running on lane Amphitrite
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Resources allocated:
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   Worker:  10.250.102.2
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   CPU   :  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   GPU   :  [0, 1, 2, 3]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   RAM   :  [0, 1, 2, 3, 4, 5, 6, 7]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   SSD   :  False
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] --------------------------------------------------------------
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Importing job module for job type patch_motion_correction_multi...
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job ready to run
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] ***************************************************************
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job will process this many movies:  971
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job will output denoiser training data for this many movies:  200
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Random seed: 690708223
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] parent process is 57387
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57453
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57456
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57454
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57455
[Tue, 24 Sep 2024 00:44:39 GMT] [CPU RAM used: 294 MB] -- 2.0: processing 4 of 971: J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc
        loading /home/cryosparcuser/Chiron/Shelly_wickham/Jiahe/Cryosparc/Cryosparc_projects/CS-2x2cube/J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc
        Loading raw movie data from J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc ...
        Done in 15.53s
        Processing ...
        Done in 17.77s
        Completed rigid and patch motion with (Z:5,Y:8,X:8) knots
        Writing non-dose-weighted result to J203/motioncorrected/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_patch_aligned.mrc ...
        Done in 0.18s
        Writing 120x120 micrograph thumbnail to J203/thumbnails/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_thumb_@1x.png ...
        Done in 0.01s
        Writing 240x240 micrograph thumbnail to J203/thumbnails/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_thumb_@2x.png ...
        Done in 0.01s
        Writing dose-weighted result to J203/motioncorrected/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_patch_aligned_doseweighted.mrc ...
        Done in 0.15s
Traceback (most recent call last):
  File "<string>", line 9, in <module>
BrokenPipeError: [Errno 32] Broken pipe
========= sending heartbeat at 2024-09-24 11:55:54.559373
========= sending heartbeat at 2024-09-24 11:56:04.579781
========= sending heartbeat at 2024-09-24 11:56:14.600115
========= sending heartbeat at 2024-09-24 11:56:24.620202
========= sending heartbeat at 2024-09-24 11:56:34.640542
========= sending heartbeat at 2024-09-24 11:56:44.662026
========= sending heartbeat at 2024-09-24 11:56:54.682136
========= sending heartbeat at 2024-09-24 11:57:04.701806
========= sending heartbeat at 2024-09-24 11:57:14.720815
========= sending heartbeat at 2024-09-24 11:57:24.741521
========= sending heartbeat at 2024-09-24 11:57:34.761958
========= sending heartbeat at 2024-09-24 11:57:44.782397
========= sending heartbeat at 2024-09-24 11:57:54.803461
========= sending heartbeat at 2024-09-24 11:58:04.824165
========= sending heartbeat at 2024-09-24 11:58:14.844469
========= sending heartbeat at 2024-09-24 11:58:24.864399
========= sending heartbeat at 2024-09-24 11:58:34.885077
========= sending heartbeat at 2024-09-24 11:58:44.906332
========= sending heartbeat at 2024-09-24 11:58:54.926720
/home/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 57387 Terminated              python -c "import cryosparc_compute.run as run; run.run()" "$@"

There was no “Out of memory” errors when the job failed (from the sudo journalctl | grep -i oom)
When we try running the job again, sometimes it fails (at a different time) sometimes it finishes.

Any idea on how we could troubleshoot what’s causing the heartbeat to be lost?

@Andre Thanks for posting this information.
Please can you post the output of the commands

  • on the CryoSPARC master (replacing P99 with the actual project ID)
    cryosparcm cli "get_job('P99', 'J203', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'failed_at', 'params_spec', 'cloned_from', 'PID_main', 'PID_monitor')"
    
  • on the CryoSPARC worker computer where job J203 failed
    hostname -f
    cat /sys/kernel/mm/transparent_hugepage/enabled 
    cat /proc/cmdline
    

@wtempel Thanks for following up on this. Here is the result for the CryoSPARC master

[cryosparcuser@amphitrite ~]$ cryosparcm cli "get_job('P105', 'J203', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'failed_at', 'params_spec', 'cloned_from', 'PID_main', 'PID_monitor')"
{'PID_main': 57387, 'PID_monitor': 57393, '_id': '66f20b502b5694c187be0742', 'cloned_from': None, 'failed_at': 'Tue, 24 Sep 2024 02:02:01 GMT', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '243.45GB', 'cpu_model': 'Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz', 'driver_version': '12.4', 'gpu_info': [{'id': 0, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:18:00'}, {'id': 1, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:3b:00'}, {'id': 2, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:86:00'}, {'id': 3, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:af:00'}], 'ofd_hard_limit': 4096, 'ofd_soft_limit': 1024, 'physical_cores': 40, 'platform_architecture': 'x86_64', 'platform_node': 'amphitrite.research.sydney.edu.au', 'platform_release': '3.10.0-1160.53.1.el7.x86_64', 'platform_version': '#1 SMP Fri Jan 14 13:59:45 UTC 2022', 'total_memory': '251.35GB', 'used_memory': '6.94GB'}, 'job_type': 'patch_motion_correction_multi', 'killed_at': 'Tue, 24 Sep 2024 02:02:00 GMT', 'params_spec': {'compute_num_gpus': {'value': 4}}, 'project_uid': 'P105', 'started_at': 'Tue, 24 Sep 2024 00:44:33 GMT', 'status': 'failed', 'uid': 'J203', 'version': 'v4.5.3'}

And here is the result for the worker

[cryosparcuser@amphitrite ~]$ hostname -f
amphitrite.research.sydney.edu.au
[cryosparcuser@amphitrite ~]$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
[cryosparcuser@amphitrite ~]$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-1160.53.1.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb modprobe.blacklist=nouveau LANG=en_AU.UTF-8 nouveau.modeset=0 rd.driver.blacklist=nouveau

Please can you test if running the commands (details)

sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"
sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/defrag"

before running the job resolves the issue?
These commands would change the transparent_hugepage settings temporarily and should lose their effect during a system reboot.

@wtempel Do I run this on the worker or the master?

Also note that this issue is inconsistent. When I rerun this job, it could be successful even without this, which means that a successful run doesn’t mean this fixed the issue.

For the test, transparent_hugepage would need to be disabled on the worker.

Please let us know when you

  1. encounter the issue again and
  2. confirmed that the command (on the applicable worker)
    cat /sys/kernel/mm/transparent_hugepage/enabled
    
    shows
    always madvise [never]
    
    when the issue occurs, that is, transparent_hugepage has in fact been disabled on the relevant worker and not reenabled (automatically when the system is running or during system startup).

We are also facing the above issue, and no OOM errors. Some jobs are successful with the same settings, and others are not.

THP is enabled. We are still testing how disabling THP would affect other jobs (not CryoSPARC jobs) on our cluster and can’t disable it system-wide yet.

Is disabling THP the only solution? I’ve seen disabling THP being suggested for multiple issues on the forum. Could you provide a list of CryoSPARC issues and errors that can occur when THP is not disabled?

We are also running some test jobs on v4.7.1 to mitigate this, but specifically facing the CUDA_ERROR_INVALID_VALUE error, for which forum solutions suggest disabling THP as well. Issue here: cuMemHostAlloc & Out-of-Memory Errors

Thanks!

Details:

[guodongxie@beagle3-login3 ~]$ cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '$SCRATCH_TMPDIR', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {'ram_gb_multiplier': '6'}, 'desc': None, 'hostname': 'beagle3', 'lane': 'beagle3', 'name': 'beagle3', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --partition=beagle3\n#SBATCH --constraint=a40\n#SBATCH --account=pi-eozkan\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task={{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --gres-flags=enforce-binding\n#SBATCH --time=7:00:00\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int}}G\n#SBATCH --exclude=beagle3-0028\nexport CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'beagle3', 'tpl_vars': ['ram_gb', 'project_uid', 'num_gpu', 'ram_gb_multiplier', 'job_log_path_abs', 'command', 'job_uid', 'cluster_job_id', 'num_cpu', 'run_cmd'], 'type': 'cluster', 'worker_bin_path': '/software/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '$SCRATCH_TMPDIR', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {'ram_gb_multiplier': '6'}, 'desc': None, 'hostname': 'beagle3-exclude-0028', 'lane': 'beagle3-exclude-0028', 'name': 'beagle3-exclude-0028', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --partition=beagle3\n#SBATCH --constraint=a40\n#SBATCH --account=pi-eozkan\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task={{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --gres-flags=enforce-binding\n#SBATCH --time=7:00:00\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int}}G\n#SBATCH --exclude=beagle3-0028\nexport CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'beagle3-exclude-0028', 'tpl_vars': ['ram_gb', 'project_uid', 'num_gpu', 'ram_gb_multiplier', 'job_log_path_abs', 'command', 'job_uid', 'cluster_job_id', 'num_cpu', 'run_cmd'], 'type': 'cluster', 'worker_bin_path': '/software/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '$SCRATCH_TMPDIR', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'beagle3-testing', 'lane': 'beagle3-testing', 'name': 'beagle3-testing', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --partition=beagle3\n#SBATCH --account=pi-eozkan\n#SBATCH --output={{ job_dir_abs }}/slurm-%j.out\n#SBATCH --error={{ job_dir_abs }}/slurm-%j.err\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task={{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --gres-flags=enforce-binding\n#SBATCH --time=1:00:00\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int}}G\n#SBATCH --exclude=beagle3-0028\nexport CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'beagle3-testing', 'tpl_vars': ['job_dir_abs', 'ram_gb', 'project_uid', 'num_gpu', 'ram_gb_multiplier', 'command', 'job_uid', 'cluster_job_id', 'num_cpu', 'run_cmd'], 'type': 'cluster', 'worker_bin_path': '/software/cryosparc_worker/bin/cryosparcw'}]
[guodongxie@beagle3-login3 ~]$ cryosparcm eventlog $projectid $jobid | head -n 40

[Sun, 03 Aug 2025 23:27:08 GMT]  License is valid.
[Sun, 03 Aug 2025 23:27:08 GMT]  Launching job on lane beagle3-exclude-0028 target beagle3-exclude-0028 ...
[Sun, 03 Aug 2025 23:27:08 GMT]  Launching job on cluster beagle3-exclude-0028
[Sun, 03 Aug 2025 23:27:08 GMT]  
====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P1_J842
#SBATCH --partition=beagle3
#SBATCH --constraint=a40
#SBATCH --account=pi-eozkan
#SBATCH --output=/beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J842/job.log
#SBATCH --error=/beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J842/job.log
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --time=7:00:00
#SBATCH --mem=96G
#SBATCH --exclude=beagle3-0028
export CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"
srun /software/cryosparc_worker/bin/cryosparcw run --project P1 --job J842 --master_hostname beagle3-login3.rcc.local --master_command_core_port 39322 > /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J842/job.log 2>&1 
==========================================================================
==========================================================================
[Sun, 03 Aug 2025 23:27:08 GMT]  -------- Submission command: 
sbatch /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J842/queue_sub_script.sh
[Sun, 03 Aug 2025 23:27:08 GMT]  -------- Cluster Job ID: 
34160065
[Sun, 03 Aug 2025 23:27:08 GMT]  -------- Queued on cluster at 2025-08-03 18:27:08.341118
[Sun, 03 Aug 2025 23:27:10 GMT]  -------- Cluster job status at 2025-08-03 18:27:20.191197 (1 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          34160065   beagle3 cryospar guodongx  R       0:10      1 beagle3-0040
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB] Job J842 Started
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB] Master running v4.6.0, worker running v4.6.0
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB] Working in directory: /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J842
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB] Running on lane beagle3-exclude-0028
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB] Resources allocated:
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB]   Worker:  beagle3-exclude-0028
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB]   CPU   :  [0, 1, 2, 3, 4, 5]
Traceback (most recent call last):
  File "<string>", line 9, in <module>
BrokenPipeError: [Errno 32] Broken pipe
[guodongxie@beagle3-login3 ~]$ cryosparcm joblog $projectid $jobid | tail -n 20
min: -907.740127 max: 905.336167
min: -1660.243857 max: 1653.756021
========= sending heartbeat at 2025-08-04 01:25:12.556001
========= sending heartbeat at 2025-08-04 01:25:22.571260
========= sending heartbeat at 2025-08-04 01:25:32.586608
========= sending heartbeat at 2025-08-04 01:25:42.601258
========= sending heartbeat at 2025-08-04 01:25:52.616594
========= sending heartbeat at 2025-08-04 01:26:02.631943
========= sending heartbeat at 2025-08-04 01:26:12.647263
========= sending heartbeat at 2025-08-04 01:26:22.662765
========= sending heartbeat at 2025-08-04 01:26:32.679181
========= sending heartbeat at 2025-08-04 01:26:42.733255
========= sending heartbeat at 2025-08-04 01:26:52.748490
========= sending heartbeat at 2025-08-04 01:27:02.763974
========= sending heartbeat at 2025-08-04 01:27:12.779370
========= sending heartbeat at 2025-08-04 01:27:22.794979
========= sending heartbeat at 2025-08-04 01:27:32.810261
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 34160065.0 ON beagle3-0040 CANCELLED AT 2025-08-04T01:27:39 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 34160065 ON beagle3-0040 CANCELLED AT 2025-08-04T01:27:39 DUE TO TIME LIMIT ***
[guodongxie@beagle3-login3 ~]$ cryosparcm cli "get_job('P1', 'J842', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'failed_at', 'params_spec', 'cloned_from', 'PID_main', 'PID_monitor')"
{'PID_main': 2665864, 'PID_monitor': 2665866, '_id': '688e2d083306c0035abd1c4b', 'cloned_from': None, 'failed_at': 'Mon, 04 Aug 2025 06:30:35 GMT', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '225.11GB', 'cpu_model': 'Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 47622258688, 'name': 'NVIDIA A40', 'pcie': '0000:ca:00'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 1024, 'physical_cores': 32, 'platform_architecture': 'x86_64', 'platform_node': 'beagle3-0040.rcc.local', 'platform_release': '4.18.0-305.3.1.el8.x86_64', 'platform_version': '#1 SMP Tue Jun 1 16:14:33 UTC 2021', 'total_memory': '251.43GB', 'used_memory': '11.31GB'}, 'job_type': 'patch_motion_correction_multi', 'killed_at': 'Mon, 04 Aug 2025 06:30:33 GMT', 'params_spec': {}, 'project_uid': 'P1', 'started_at': 'Sun, 03 Aug 2025 23:27:21 GMT', 'status': 'completed', 'uid': 'J842', 'version': 'v4.6.0'}
[guodongxie@beagle3-login3 ~]$ cryosparcm cli "get_job('P1', 'J842', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'failed_at', 'params_spec', 'cloned_from', 'PID_main', 'PID_monitor')"
{'PID_main': 2665864, 'PID_monitor': 2665866, '_id': '688e2d083306c0035abd1c4b', 'cloned_from': None, 'failed_at': 'Mon, 04 Aug 2025 06:30:35 GMT', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '225.11GB', 'cpu_model': 'Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 47622258688, 'name': 'NVIDIA A40', 'pcie': '0000:ca:00'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 1024, 'physical_cores': 32, 'platform_architecture': 'x86_64', 'platform_node': 'beagle3-0040.rcc.local', 'platform_release': '4.18.0-305.3.1.el8.x86_64', 'platform_version': '#1 SMP Tue Jun 1 16:14:33 UTC 2021', 'total_memory': '251.43GB', 'used_memory': '11.31GB'}, 'job_type': 'patch_motion_correction_multi', 'killed_at': 'Mon, 04 Aug 2025 06:30:33 GMT', 'params_spec': {}, 'project_uid': 'P1', 'started_at': 'Sun, 03 Aug 2025 23:27:21 GMT', 'status': 'completed', 'uid': 'J842', 'version': 'v4.6.0'}

With recent CryoSPARC versions like v4.7.1, one may use the madvise setting instead of never.