Reference-based motion correction out of memory on cluster

CleoShen · January 3, 2024, 12:31am

Hi,
Something special in my case is that I installed the CS on the computer clusters, so the worker nodes are variable, but I also noticed the same issue. I requested 128G, but the job was errored whenever the RAM used 50G ish, the log noted an “out of memory” alert.

wtempel · January 3, 2024, 5:23pm

@CleoShen Please can you post additional information

job type
(if applicable) box size and number of particles

output of the command

cryosparcm cli "get_scheduler_targets()"

log lines with the out-of-memory error and preceding lines. Is it possible the error occurred at a step where the RAM used did not yet reflect a large amount of data that were about to be loaded?

CleoShen · January 3, 2024, 10:20pm

Hi wtempel,
Since the output for question 3 has many characters, I deleted partial of it.

Reference-Based Motion Correction
Box size=448 pix; 53,390 particles
`[{‘cache_path’: ‘…’: 10000, ‘custom_var_names’: , ‘desc’: None, ‘hostname’: ‘brunger_all’, ‘lane’: ‘brunger_all’, ‘name’: ‘brunger_all’, ‘qdel_cmd_tpl’: ‘scancel {{ cluster_job_id }}’, ‘qinfo_cmd_tpl’: ‘sinfo’, ‘qstat_cmd_tpl’: ‘squeue -j {{ cluster_job_id }}’, ‘qsub_cmd_tpl’: ‘sbatch {{ script_path_abs }}’, ‘script_tpl’: ‘#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n##SBATCH -N 1\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p brunger\n#SBATCH --mem={{ (ram_gb)|int }}G \n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --time=120:00:00\n#SBATCH --error=job.err\n#SBATCH --output=job.out\n\n\navailable_devs=“”\nfor devidx in (seq 0 15);\ndo\n if [[ -z (nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z “$available_devs” ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n’, ‘send_cmd_tpl’: ‘{{ command }}’, ‘title’: ‘ccw0820’, ‘tpl_vars’: [‘cryosparc_username’, ‘job_log_path_abs’, ‘cluster_job_id’, ‘run_cmd’, ‘command’, ‘worker_bin_path’, ‘run_args’, ‘num_gpu’, ‘project_uid’, ‘job_uid’, ‘job_creator’, ‘num_cpu’, ‘ram_gb’, ‘project_dir_abs’, ‘job_dir_abs’], ‘type’: ‘cluster’, ‘worker_bin_path’: ‘/home/groups/brunger/software/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘L_SCRATCH', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'desc': None, 'hostname': 'owners_24h', 'lane': 'owners_24h', 'name': 'owners_24h', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n##SBATCH -N 1\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p owners\n#SBATCH --mem={{ (ram_gb)|int }}G \n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --time=24:00:00\n#SBATCH --error=job.err\n#SBATCH --output=job.out\n\n\navailable_devs=""\nfor devidx in (seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z “$available_devs” ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n’, ‘send_cmd_tpl’: ‘{{ command }}’, ‘title’: ‘ccw0820’, ‘tpl_vars’: [‘cryosparc_username’, ‘job_log_path_abs’, ‘cluster_job_id’, ‘run_cmd’, ‘command’, ‘worker_bin_path’, ‘run_args’, ‘num_gpu’, ‘project_uid’, ‘job_uid’, ‘job_creator’, ‘num_cpu’, ‘ram_gb’, ‘project_dir_abs’, ‘job_dir_abs’], ‘type’: ‘cluster’, ‘worker_bin_path’: ‘/home/groups/brunger/software/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘L_SCRATCH', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'desc': None, 'hostname': 'owners56G_24h', 'lane': 'owners56G_24h', 'name': 'owners56G_24h', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n##SBATCH -N 1\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p owners\n##SBATCH --mem={{ (ram_gb)|int }}G \n#SBATCH --mem=56G\n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --time=24:00:00\n#SBATCH --error=job.err\n#SBATCH --output=job.out\n\n\navailable_devs=""\nfor devidx in (seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z “$available_devs” ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n’, ‘send_cmd_tpl’: ‘{{ command }}’, ‘title’: ‘ccw0820’, ‘tpl_vars’: [‘cryosparc_username’, ‘job_log_path_abs’, ‘cluster_job_id’, ‘run_cmd’, ‘command’, ‘worker_bin_path’, ‘run_args’, ‘num_gpu’, ‘project_uid’, ‘job_uid’, ‘job_creator’, ‘num_cpu’, ‘ram_gb’, ‘project_dir_abs’, ‘job_dir_abs’], ‘type’: ‘cluster’, ‘worker_bin_path’: ‘/home/groups/brunger/software/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘L_SCRATCH', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'desc': None, 'hostname': 'owners_224G_48h', 'lane': 'owners_224G_48h', 'name': 'owners_224G_48h', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n##SBATCH -N 1\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p owners\n##SBATCH --mem={{ (ram_gb)|int }}G \n#SBATCH --mem=224G\n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --time=48:00:00\n#SBATCH --error=job.err\n#SBATCH --output=job.out\n\n\navailable_devs=""\nfor devidx in (seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z “$available_devs” ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n’, ‘send_cmd_tpl’: ‘{{ command }}’, ‘title’: ‘ccw0820’, ‘tpl_vars’: [‘cryosparc_username’, ‘job_log_path_abs’, ‘cluster_job_id’, ‘run_cmd’, ‘command’, ‘worker_bin_path’, ‘run_args’, ‘num_gpu’, ‘project_uid’, ‘job_uid’, ‘job_creator’, ‘num_cpu’, ‘ram_gb’, ‘project_dir_abs’, ‘job_dir_abs’], ‘type’: ‘cluster’, ‘worker_bin_path’: ‘/home/groups/brunger/software/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘L_SCRATCH', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'desc': None, 'hostname': 'gpu56G_48h', 'lane': 'gpu56G_48h', 'name': 'gpu56G_48h', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n##SBATCH -N 1\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p gpu\n##SBATCH --mem={{ (ram_gb)|int }}G \n#SBATCH --mem=56G\n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --time=48:00:00\n#SBATCH --error=job.err\n#SBATCH --output=job.out\n\n\navailable_devs=""\nfor devidx in (seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z “$available_devs” ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n’, ‘send_cmd_tpl’: ‘{{ command }}’, ‘title’: ‘ccw0820’, ‘tpl_vars’: [‘cryosparc_username’, ‘job_log_path_abs’, ‘cluster_job_id’, ‘run_cmd’, ‘command’, ‘worker_bin_path’, ‘run_args’, ‘num_gpu’, ‘project_uid’, ‘job_uid’, ‘job_creator’, ‘num_cpu’, ‘ram_gb’, ‘project_dir_abs’, ‘job_dir_abs’], ‘type’: ‘cluster’, ‘worker_bin_path’: ‘/home/groups/brunger/software/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘L_SCRATCH', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'desc': None, 'hostname': 'brunger_gpu', 'lane': 'brunger_gpu', 'name': 'brunger_gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n##SBATCH -N 1\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p brunger \n##SBATCH --nodelist=sh02-13n07\n#SBATCH --mem={{ (ram_gb)|int }}G \n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --time=120:00:00\n#SBATCH --error=job.err\n#SBATCH --output=job.out\n\n\navailable_devs=""\nfor devidx in (seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z “$available_devs” ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n’, ‘send_cmd_tpl’: ‘{{ command }}’, ‘title’: ‘ccw0820’, ‘tpl_vars’: [‘cryosparc_username’, ‘job_log_path_abs’, ‘cluster_job_id’, ‘run_cmd’, ‘command’, ‘worker_bin_path’, ‘run_args’, ‘num_gpu’, ‘project_uid’, ‘job_uid’, ‘job_creator’, ‘num_cpu’, ‘ram_gb’, ‘project_dir_abs’, ‘job_dir_abs’], ‘type’: ‘cluster’, ‘worker_bin_path’: ‘/home/groups/brunger/software/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/lscratch/ccw0820’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘custom_var_names’: , ‘custom_vars’: {}, ‘desc’: None, ‘hostname’: ‘gpu_brunger_4gpu’, ‘lane’: ‘gpu_brunger_4gpu’, ‘name’: ‘gpu_brunger_4gpu’, ‘qdel_cmd_tpl’: ‘scancel {{ cluster_job_id }}’, ‘qinfo_cmd_tpl’: ‘sinfo’, ‘qstat_cmd_tpl’: ‘squeue -j {{ cluster_job_id }}’, ‘qstat_code_cmd_tpl’: None, ‘qsub_cmd_tpl’: ‘sbatch {{ script_path_abs }}’, ‘script_tpl’: '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }}

***
Running job on hostname %s gpu_brunger
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'gpu_brunger', 'lane': 'gpu_brunger', 'lane_type': 'cluster', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3, 4, 5], 'GPU': [0], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}, 'target': {'cache_path': '/lscratch/ccw0820', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'gpu_brunger', 'lane': 'gpu_brunger', 'name': 'gpu_brunger', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name {{ project_uid }}_{{ job_uid }}\n##SBATCH -n {{ num_cpu }}\n#SBATCH -n 20\n#SBATCH --nodes=1\n#SBATCH --gres=gpu:{{ num_gpu }}\n##SBATCH --gres=gpu:1\n#SBATCH -p brunger\n##SBATCH --mem={{ (ram_gb*1000)|int }}MB             \n#SBATCH --mem=118G\n#SBATCH -o {{ job_dir_abs }}/out.txt\n#SBATCH -e {{ job_dir_abs }}/error.txt\n#SBATCH --time=168:00:00\n#SBATCH --mail-user=ccw0820@stanford.edu\n#SBATCH --mail-type=ALL\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\n#available_devs=0,1,2,3\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'gpu_brunger', 'tpl_vars': ['num_gpu', 'cryosparc_username', 'job_uid', 'job_creator', 'ram_gb', 'job_dir_abs', 'run_cmd', 'worker_bin_path', 'run_args', 'project_uid', 'project_dir_abs', 'command', 'job_log_path_abs', 'cluster_job_id', 'num_cpu'], 'type': 'cluster', 'worker_bin_path': '/home/groups/brunger/software/cryosparc/cryosparc_worker/bin/cryosparcw'}}
Override mask provided for volume_0
FSC No-Mask...         0.143 at 53.492 radwn. 0.5 at 40.764 radwn. Took 2.468s.
FSC With Mask...      ========= sending heartbeat at 2023-12-12 14:42:19.768640
0.143 at 80.531 radwn. 0.5 at 71.036 radwn. Took 2.092s.
========= sending heartbeat at 2023-12-12 14:42:29.788625
........
========= sending heartbeat at 2023-12-12 14:47:10.336843
DIE: allocate: out of memory (reservation insufficient)
========= main process now complete at 2023-12-12 14:47:15.456193.
========= monitor process now complete at 2023-12-12 14:47:15.518293.

wtempel · January 3, 2024, 10:43pm

Thanks @CleoShen. Please can you also post the output of the command (with correct project and job IDs):

cryosparcm eventlog P99 J123 | tail -n 30

CleoShen · January 3, 2024, 10:52pm

Hi wtempel,
Thanks for the fast response.

[CPU RAM used: 176 MB] Working in directory: /oak/stanford/groups/brunger/vATPase/SPA/SPA_SupRes/P3/J1729
[CPU RAM used: 176 MB] Running on lane brunger_sh03-14n15
[CPU RAM used: 176 MB] Resources allocated:
[CPU RAM used: 176 MB]   Worker:  brunger_sh03-14n15
[CPU RAM used: 176 MB]   CPU   :  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[CPU RAM used: 176 MB]   GPU   :  [0, 1, 2, 3]
[CPU RAM used: 176 MB]   RAM   :  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
[CPU RAM used: 176 MB]   SSD   :  False
[CPU RAM used: 176 MB] --------------------------------------------------------------
[CPU RAM used: 176 MB] Importing job module for job type reference_motion_correction...
[CPU RAM used: 228 MB] Job ready to run
[CPU RAM used: 228 MB] ***************************************************************
[CPU RAM used: 228 MB] Random seed: 1602315488
[CPU RAM used: 228 MB] PID: 90661
[CPU RAM used: 318 MB] Scales of input particles in group 0 are all 1.0. Recommendation: re-run the upstream refinement job with 'Minimize over per-particle scale' turned on.
[CPU RAM used: 318 MB] Processing stages enabled:
[CPU RAM used: 318 MB] [X] Optimize hyperparameters
[CPU RAM used: 318 MB] [X] Compute empirical dose weights
[CPU RAM used: 318 MB] [X] Motion-correct particles
Mask for volume 0
FSC for volume 0
[CPU RAM used: 517 MB] Resolution cutoffs: alignment 5.596 A, cross-validation 3.957 A
[CPU RAM used: 2086 MB] Removed 9006 movies with fewer than 2 particles.
[CPU RAM used: 2111 MB] Removed 3295 particles (in micrographs with fewer than 2 total particles).
[CPU RAM used: 2136 MB] --------------------------------------------------------------
        STARTING: OPTIMIZE HYPERPARAMETERS
--------------------------------------------------------------
[CPU RAM used: 2137 MB] Working with 2777 movies containing 12501 particles
[CPU RAM used: 2137 MB] Computing intended data cache configuration
[CPU RAM used: 188 MB] ====== Job process terminated abnormally.

hsnyder · January 9, 2024, 4:48pm

Hi @CleoShen,

Below, I’ve pasted code for a small C program that asks the machine it’s running on how much memory is available, and prints that number out. It does so in the same way that cryosparc decides how much memory it can use for reference motion. Please run this program, using the same cluster submission script that you use for cryosparc jobs, and let us know what the output is. If it says something like 50 GB, then somehow the cluster system is telling the job an inaccurate number for how much RAM is available. If it gives you the right number (128 GB) then maybe your job really is exhausting the machine RAM, which would mean you’re working at a very large box size, you have a very large number of frames, or some other combination of such factors.

If you’re not familiar with how to compile and run C programs, just paste the code into a text file (e.g. check_memory.c) and compile it (gcc check_memory.c -o check_memory). You’ll have to figure out how the cluster submission process works for your site, and you may have to pipe the output to a file so that you can read it afterwards - I’m not sure.

#include <stdio.h>
#include <unistd.h>
int main(void)
{
        long long pgsz = sysconf(_SC_PHYS_PAGES);
        long long npg  = sysconf(_SC_PAGE_SIZE);
        printf("%lli\n", pgsz * npg);
}

CleoShen · January 10, 2024, 12:34am

Hi Hsnyder,

I requested 118G this time, and the output for the memory check is “Mem check: 540759629824”. Is this what you suggested to see?

hsnyder · January 11, 2024, 5:11pm

Hi @CleoShen, that’s odd - unless I’m misreading the number, that’s 540 GB. Do you know how much physical RAM the compute node actually has? Also did you add the "Mem check: " portion of that string before you compiled the code? It wasn’t present in the version that I posted.