Patch Motion is frequently broken

We are also facing the above issue, and no OOM errors. Some jobs are successful with the same settings, and others are not.

THP is enabled. We are still testing how disabling THP would affect other jobs (not CryoSPARC jobs) on our cluster and can’t disable it system-wide yet.

Is disabling THP the only solution? I’ve seen disabling THP being suggested for multiple issues on the forum. Could you provide a list of CryoSPARC issues and errors that can occur when THP is not disabled?

We are also running some test jobs on v4.7.1 to mitigate this, but specifically facing the CUDA_ERROR_INVALID_VALUE error, for which forum solutions suggest disabling THP as well. Issue here: cuMemHostAlloc & Out-of-Memory Errors

Thanks!

Details:

[guodongxie@beagle3-login3 ~]$ cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '$SCRATCH_TMPDIR', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {'ram_gb_multiplier': '6'}, 'desc': None, 'hostname': 'beagle3', 'lane': 'beagle3', 'name': 'beagle3', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --partition=beagle3\n#SBATCH --constraint=a40\n#SBATCH --account=pi-eozkan\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task={{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --gres-flags=enforce-binding\n#SBATCH --time=7:00:00\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int}}G\n#SBATCH --exclude=beagle3-0028\nexport CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'beagle3', 'tpl_vars': ['ram_gb', 'project_uid', 'num_gpu', 'ram_gb_multiplier', 'job_log_path_abs', 'command', 'job_uid', 'cluster_job_id', 'num_cpu', 'run_cmd'], 'type': 'cluster', 'worker_bin_path': '/software/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '$SCRATCH_TMPDIR', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {'ram_gb_multiplier': '6'}, 'desc': None, 'hostname': 'beagle3-exclude-0028', 'lane': 'beagle3-exclude-0028', 'name': 'beagle3-exclude-0028', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --partition=beagle3\n#SBATCH --constraint=a40\n#SBATCH --account=pi-eozkan\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task={{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --gres-flags=enforce-binding\n#SBATCH --time=7:00:00\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int}}G\n#SBATCH --exclude=beagle3-0028\nexport CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'beagle3-exclude-0028', 'tpl_vars': ['ram_gb', 'project_uid', 'num_gpu', 'ram_gb_multiplier', 'job_log_path_abs', 'command', 'job_uid', 'cluster_job_id', 'num_cpu', 'run_cmd'], 'type': 'cluster', 'worker_bin_path': '/software/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '$SCRATCH_TMPDIR', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'beagle3-testing', 'lane': 'beagle3-testing', 'name': 'beagle3-testing', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --partition=beagle3\n#SBATCH --account=pi-eozkan\n#SBATCH --output={{ job_dir_abs }}/slurm-%j.out\n#SBATCH --error={{ job_dir_abs }}/slurm-%j.err\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task={{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --gres-flags=enforce-binding\n#SBATCH --time=1:00:00\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int}}G\n#SBATCH --exclude=beagle3-0028\nexport CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'beagle3-testing', 'tpl_vars': ['job_dir_abs', 'ram_gb', 'project_uid', 'num_gpu', 'ram_gb_multiplier', 'command', 'job_uid', 'cluster_job_id', 'num_cpu', 'run_cmd'], 'type': 'cluster', 'worker_bin_path': '/software/cryosparc_worker/bin/cryosparcw'}]
[guodongxie@beagle3-login3 ~]$ cryosparcm eventlog $projectid $jobid | head -n 40

[Sun, 03 Aug 2025 23:27:08 GMT]  License is valid.
[Sun, 03 Aug 2025 23:27:08 GMT]  Launching job on lane beagle3-exclude-0028 target beagle3-exclude-0028 ...
[Sun, 03 Aug 2025 23:27:08 GMT]  Launching job on cluster beagle3-exclude-0028
[Sun, 03 Aug 2025 23:27:08 GMT]  
====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P1_J842
#SBATCH --partition=beagle3
#SBATCH --constraint=a40
#SBATCH --account=pi-eozkan
#SBATCH --output=/beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J842/job.log
#SBATCH --error=/beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J842/job.log
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --time=7:00:00
#SBATCH --mem=96G
#SBATCH --exclude=beagle3-0028
export CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"
srun /software/cryosparc_worker/bin/cryosparcw run --project P1 --job J842 --master_hostname beagle3-login3.rcc.local --master_command_core_port 39322 > /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J842/job.log 2>&1 
==========================================================================
==========================================================================
[Sun, 03 Aug 2025 23:27:08 GMT]  -------- Submission command: 
sbatch /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J842/queue_sub_script.sh
[Sun, 03 Aug 2025 23:27:08 GMT]  -------- Cluster Job ID: 
34160065
[Sun, 03 Aug 2025 23:27:08 GMT]  -------- Queued on cluster at 2025-08-03 18:27:08.341118
[Sun, 03 Aug 2025 23:27:10 GMT]  -------- Cluster job status at 2025-08-03 18:27:20.191197 (1 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          34160065   beagle3 cryospar guodongx  R       0:10      1 beagle3-0040
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB] Job J842 Started
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB] Master running v4.6.0, worker running v4.6.0
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB] Working in directory: /beagle3/eozkan/cryosparc_guodongxie/projects/NCAM1RIG3_FOM/CS-ncam1rig3ecd/J842
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB] Running on lane beagle3-exclude-0028
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB] Resources allocated:
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB]   Worker:  beagle3-exclude-0028
[Sun, 03 Aug 2025 23:27:21 GMT] [CPU RAM used: 83 MB]   CPU   :  [0, 1, 2, 3, 4, 5]
Traceback (most recent call last):
  File "<string>", line 9, in <module>
BrokenPipeError: [Errno 32] Broken pipe
[guodongxie@beagle3-login3 ~]$ cryosparcm joblog $projectid $jobid | tail -n 20
min: -907.740127 max: 905.336167
min: -1660.243857 max: 1653.756021
========= sending heartbeat at 2025-08-04 01:25:12.556001
========= sending heartbeat at 2025-08-04 01:25:22.571260
========= sending heartbeat at 2025-08-04 01:25:32.586608
========= sending heartbeat at 2025-08-04 01:25:42.601258
========= sending heartbeat at 2025-08-04 01:25:52.616594
========= sending heartbeat at 2025-08-04 01:26:02.631943
========= sending heartbeat at 2025-08-04 01:26:12.647263
========= sending heartbeat at 2025-08-04 01:26:22.662765
========= sending heartbeat at 2025-08-04 01:26:32.679181
========= sending heartbeat at 2025-08-04 01:26:42.733255
========= sending heartbeat at 2025-08-04 01:26:52.748490
========= sending heartbeat at 2025-08-04 01:27:02.763974
========= sending heartbeat at 2025-08-04 01:27:12.779370
========= sending heartbeat at 2025-08-04 01:27:22.794979
========= sending heartbeat at 2025-08-04 01:27:32.810261
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 34160065.0 ON beagle3-0040 CANCELLED AT 2025-08-04T01:27:39 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 34160065 ON beagle3-0040 CANCELLED AT 2025-08-04T01:27:39 DUE TO TIME LIMIT ***
[guodongxie@beagle3-login3 ~]$ cryosparcm cli "get_job('P1', 'J842', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'failed_at', 'params_spec', 'cloned_from', 'PID_main', 'PID_monitor')"
{'PID_main': 2665864, 'PID_monitor': 2665866, '_id': '688e2d083306c0035abd1c4b', 'cloned_from': None, 'failed_at': 'Mon, 04 Aug 2025 06:30:35 GMT', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '225.11GB', 'cpu_model': 'Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 47622258688, 'name': 'NVIDIA A40', 'pcie': '0000:ca:00'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 1024, 'physical_cores': 32, 'platform_architecture': 'x86_64', 'platform_node': 'beagle3-0040.rcc.local', 'platform_release': '4.18.0-305.3.1.el8.x86_64', 'platform_version': '#1 SMP Tue Jun 1 16:14:33 UTC 2021', 'total_memory': '251.43GB', 'used_memory': '11.31GB'}, 'job_type': 'patch_motion_correction_multi', 'killed_at': 'Mon, 04 Aug 2025 06:30:33 GMT', 'params_spec': {}, 'project_uid': 'P1', 'started_at': 'Sun, 03 Aug 2025 23:27:21 GMT', 'status': 'completed', 'uid': 'J842', 'version': 'v4.6.0'}
[guodongxie@beagle3-login3 ~]$ cryosparcm cli "get_job('P1', 'J842', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'failed_at', 'params_spec', 'cloned_from', 'PID_main', 'PID_monitor')"
{'PID_main': 2665864, 'PID_monitor': 2665866, '_id': '688e2d083306c0035abd1c4b', 'cloned_from': None, 'failed_at': 'Mon, 04 Aug 2025 06:30:35 GMT', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '225.11GB', 'cpu_model': 'Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 47622258688, 'name': 'NVIDIA A40', 'pcie': '0000:ca:00'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 1024, 'physical_cores': 32, 'platform_architecture': 'x86_64', 'platform_node': 'beagle3-0040.rcc.local', 'platform_release': '4.18.0-305.3.1.el8.x86_64', 'platform_version': '#1 SMP Tue Jun 1 16:14:33 UTC 2021', 'total_memory': '251.43GB', 'used_memory': '11.31GB'}, 'job_type': 'patch_motion_correction_multi', 'killed_at': 'Mon, 04 Aug 2025 06:30:33 GMT', 'params_spec': {}, 'project_uid': 'P1', 'started_at': 'Sun, 03 Aug 2025 23:27:21 GMT', 'status': 'completed', 'uid': 'J842', 'version': 'v4.6.0'}