Allow multiple jobs to copy to SSD cache

I do processing on a super computer where each node has a volatile cache (data is deleted after each job has finished) and therefore each job has to synchronise the particle stack to the local SSD each time. I have noticed when I run multiple jobs in parallel (e.g. running homogeneous refinement on all 3D classes) that the first job starting will lock the particle stack and therefore all the other jobs will have to wait. The parallel file system and network is very good on the computer, so there is no performance reason to not allow all jobs synchronising at the same time. Having the jobs idling while waiting for the stack to be unlocked hurts me in two ways. It costs computational time (time is limited on the system) and decreases the average performance (jobs which are not using resources efficiently are killed).

Could you please (perhaps as a configurable option) allow multiple jobs synchronizing at the same time. And/or could you have an option to have dependencies so that jobs are not submitted until the previous job is done synchronising.

What impact does disabling caching have? If the network/file system are that good, you might not need to cache at all (or, rather, the impact from not caching will be less harmful in your scenario than the one-by-one lock for each run…)

Not a solution to your problem per se, but maybe a workaround.

If said nodes are equipped with multiple GPUs, could you request the necessary resources under a single session such that as many GPUs as possible share a common volatile cache? Registering said node(s) as normal workers in cryoSPARC will instantly reduce the number of times data has to be cached. You’ll need to terminate the parental session manually though.

I’ve done this via SLURM on a HPC cluster, using a combination of ssh hostname aliases to manage the redirection and a simple script that spawns cryosparcw connect --update to refresh the worker configuration in the database, and it works relatively painlessly.

Cheers,
Yang

If you have not already, you may want to try a new cache implementation available in CryoSPARC v4.4+. It can be enabled by defining, inside cryosparc_worker/config.sh,

export CRYOSPARC_IMPROVED_SSD_CACHE=true 

This cache implementation permits certain cache transfers to occur in parallel.
An additional performance improvement may be achieved by defining, also in cryosparc_worker/config.sh,
a larger value for the CRYOSPARC_CACHE_NUM_THREADS variable, like

export CRYOSPARC_CACHE_NUM_THREADS=4

(documentation).

1 Like