Updating Advanced SSD Parameters on a Cluster

ncoudray · March 26, 2020, 5:26pm

Hi,

I have a question regarding the following message:
SSD cache : cache does not have enough space for download

It seems to happen although we still have free space on the cache. However, it seems to happen when there are a lot of jobs running at the same time, so I was wondering if in addition to checking free space on the SSD cache, there may be an internal threshold that triggers that message? If so, how can we modify that threshold?

My understanding is that it could be changed with the “bin/cryosparcw connect” options (maybe ssdquota or ssdreserve), but I have a hard time to understand how to run it properly and which options should be used:

first, should I stop cryosparc before doing it?
then, can I check how what the threshold is currently?
finally, how do I change the threshold?
Can you please give examples? Also we’re running it on a cluster. I also see that the “bin/cryosparcw connect” command has a “master, worker and port” options that I’m not sure how to specify or if needed.

Thanks,
Best,
Nicolas

stephan · March 27, 2020, 4:55pm

Hey @ncoudray,

first, should I stop cryosparc before doing it?

No need to stop cryoSPARC when you update the configuration for a worker node.

then, can I check how what the threshold is currently?

In cryoSPARC, click on the Resource Manager button on the bottom of the page. Then, click on the Instance Information sub-tab. You will see the values for your worker’s configuration there.

finally, how do I change the threshold?

Please see the SSD Guide here, which explains advanced parameters and how to update a worker’s configuration:

<master_hostname> = cryoem1.structura.bio

The long-form hostname of the machine that is running the cryoSPARC master process.

<worker_hostname> = cryoem2.structura.bio

The long-form hostname of the machine that you are currently trying to update.

<port_num> = 39000

The port number that was used to install the cryoSPARC master process, 39000 by default.

ncoudray · March 27, 2020, 5:30pm

Thanks a lot, @stephan. One last thing I’m confused with (sorry if the question is stupid): is finding the long-form hostnames of the master and the worker in our installation, specially when working on a cluster. I would expect the name to be variable and change depending on the node in which a job can land. Or am I getting this wrong? I can’t figure out what I need to put in place of “cryoem1.structura.bio” & “cryoem2.structura.bio”

stephan · March 27, 2020, 7:12pm

Hi @ncoudray,

Sorry, I totally missed this part.

Your questions make sense, updating the SSD config for a cluster integration isn’t clear in our documentation or the SSD guide.

Advanced Parameters

You can specify two advanced parameters to fine-tune your SSD cache (these values are specific to a cluster integration):

cache_quota_mb : The maximum amount of space that cryoSPARC can use on the SSD (MB)

cache_reserve_mb : The minimum amount of free space to leave on the SSD (MB)

Modifying a cluster integration

Please note you still don’t need to stop cryoSPARC if you’re updating a cluster integration.

To see what the current values for your cluster’s integration are, in a shell on the master node, run the command:

cryosparcm cluster dump
# dumps out existing config and script to current working directory

To modify any configuration values for your cluster integration, open a shell on the master node, navigate to the location where the cluster_info.json and cluster_script.sh file exist, and run the command:

cryosparcm cluster connect
# connects new or updates existing cluster configuration, 
# reading cluster_info.json and cluster_script.sh from the 
# current directory, using the name from cluster_info.json

To specifically modify advanced SSD options, modify cluster_info.json to have the new values, the run the cryosparcm cluster connect command.

Example

cluster_info.json with advanced SSD options:

{
    "name" : "slurmcluster",
    "worker_bin_path" : "/path/to/cryosparc2_worker/bin/cryosparcw",
    "cache_path" : "/path/to/local/SSD/on/cluster/nodes",
    "send_cmd_tpl" : "ssh loginnode {{ command }}",
    "qsub_cmd_tpl" : "sbatch {{ script_path_abs }}",
    "qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
    "qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
    "qinfo_cmd_tpl" : "sinfo",
    "transfer_cmd_tpl" : "scp {{ src_path }} loginnode:{{ dest_path }}",
    "cache_reserve_mb" : 0,
    "cache_quota_mb" : 500000
}

I’ll update our documentation accordingly. Sorry for any confusion!

ncoudray · March 27, 2020, 9:37pm

Great, thanks @stephan. That helps!

YueLiu · November 2, 2020, 5:26am

We encountered a similar problem on SGE or Slurm cluster after setting cache_quota_mb and cache_reserve_mb, because the total file size exceeds the capacity of our SSD. I wonder if it might be possible to copy a portion of files onto SSD to reach the limit of cache_quota_mb while retaining the remaining files locally (for local refinement)? Thanks in advance for any help.

nfrasser · November 2, 2020, 3:52pm

Hi @YueLiu, unfortunately the current version of cryoSPARC does not support this - either all the particle files required for a job are sent to the SSD, or none of them are.

For Local Refinement, I assume you’re also using Particle Subtraction? The subtracted particles are stored at different locations from the originals. In this case, you can kind of do what you’re suggesting by running multiple Local Refinement jobs with particles from different Subtraction jobs. One Local Refinement job has SSD cache enabled. It’s disabled on all the rest.

Hope that helps,

Nick

YueLiu · November 2, 2020, 4:18pm

Hi @nfrasser, thanks. That’s a good idea. Similarly, one can split a particle set into several subsets and run multiple local refinement jobs with SSD cache enabled. However, is there a good way to combine the maps derived from these different jobs at the end? Perhaps one workaround is to combine all subsets after aforementioned jobs and run one more round of local refinement using all particles where SSD cache is disabled? Any suggestions would be much appreciated. Best, Yue

nfrasser · November 2, 2020, 6:58pm

I’m not aware of any way to do this combination that you’re describing but let me check with the rest the team about this and get back to you.

nfrasser · November 4, 2020, 3:46pm

Hi @YueLiu, unfortunately there’s no way to this recombination of maps/particles. Just disabling cache for these scenarios is your best option here.

YueLiu · November 4, 2020, 5:05pm

Hi @nfrasser, thanks a lot for your help. Yes, was testing with cache disabled.

DavidHoover · August 27, 2021, 3:06pm

How does one configure a cluster when the SSD path is job-dependent? On our cluster, local scratch space is allocated and the path uses an environment variable, /lscratch/$SLURM_JOB_ID.

leetleyang · August 27, 2021, 6:33pm

When configuring the cluster lane, I believe you can either i) set cache_path in cluster_info.json to reflect that, or ii) define the variable in cluster_script.sh, i.e. export CRYOSPARC_SSD_PATH="/lscratch/$SLURM_JOB_ID".