Update worker cache size

yodamoppet · May 19, 2023, 2:10pm

We recently replaced all our SSD cache drives on all cluster nodes with larger drives.

What parameters do I need to feed to cryosparcw to update the node ssd cache quota? Is “cryosparcw cluster connect --ssdquota XXXX – update” sufficient to update the worker ssd cache size without changing any other existing values?

Also is the ssdquota always set in megabytes? So for 4TB for example it would be “–ssdquota 4,000,000”?

Does this process require stopping jobs, or a restart of cryosparcm?

Thanks!

wtempel · May 19, 2023, 6:12pm

Please can you post the output of

cryosparcm cli "get_scheduler_targets()"

so we have a better idea about your current setup.

yodamoppet · May 19, 2023, 6:28pm

Glad to do so! Here is that output:

cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/local', 'cache_quota_mb': 1000000, 'cache_reserve_mb': 10000, 'custom_var_names': ['command'], 'desc': None, 'hostname': 'vision', 'lane': 'vision', 'name': 'vision', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cs_{{ cryosparc_username }}_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p defq\n#SBATCH --mem={{ (ram_gb*1000)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/out.txt\n#SBATCH -e {{ job_dir_abs }}/err.txt\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'vision', 'tpl_vars': ['num_cpu', 'job_dir_abs', 'job_log_path_abs', 'project_dir_abs', 'job_creator', 'num_gpu', 'command', 'project_uid', 'cluster_job_id', 'worker_bin_path', 'job_uid', 'ram_gb', 'cryosparc_username', 'run_args', 'run_cmd'], 'type': 'cluster', 'worker_bin_path': '/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw'}, {'cache_path': '/local', 'cache_quota_mb': 1000000, 'cache_reserve_mb': 10000, 'custom_var_names': ['command'], 'desc': None, 'hostname': 'vision-testing', 'lane': 'vision-testing', 'name': 'vision-testing', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p defq\n#SBATCH --mem={{ (ram_gb*1000)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/out.txt\n#SBATCH -e {{ job_dir_abs }}/err.txt\n\nmodule load cuda10.2/toolkit/10.2.89\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'vision', 'tpl_vars': ['num_cpu', 'job_dir_abs', 'job_log_path_abs', 'project_dir_abs', 'job_creator', 'num_gpu', 'command', 'project_uid', 'cluster_job_id', 'worker_bin_path', 'job_uid', 'ram_gb', 'cryosparc_username', 'run_args', 'run_cmd'], 'type': 'cluster', 'worker_bin_path': '/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw'}]

wtempel · May 23, 2023, 9:33pm

You could use cryosparcw connect --update, but may (I am not sure) have to specify all parameters (even those that you do not wish to change) again. It may be simpler to run (details)

cryosparcm cli "set_scheduler_target_property('vision', 'cache_quota_mb', 4000000)"

If needed, you may similarly change the 'cache_reserve_mb' property.
A restart of CryoSPARC is not needed, but only jobs started after the change was made can use the additional cache.

yodamoppet · May 30, 2023, 5:07pm

Very good, I will try this week and report back.

What is the best practice for setting ‘cache_reserve_mb’? Should that be some value in relation to the total cache, in general?

yodamoppet · June 5, 2023, 2:17pm

@wtempel Thanks for the advice. This worked fine.

I’d still like to know what the best practice is for setting “cache_reserve_mb” – is there a recommended ratio or metric? Thanks!

wtempel · June 5, 2023, 7:41pm

Discussion of this question within our team revealed:

the original motivation of the cache_reserve_mb was to facilitate sharing of the space with non-CryoSPARC applications.
there may be an effect on device lifetime, but I am not sure how strong the effect is.

The specific circumstances of your CryoSPARC installation should be the principal determinant of this setting.

yodamoppet · June 5, 2023, 7:59pm

@wtempel Perfect – I will set as is best for our environment then. Thanks for checking into this and the useful details.