Job script variables for input job folders

qitsweauca · October 18, 2021, 1:54pm

Hi All,

Are there any variables inside job script template for the absolute path of input job folders?

For example PBS script has {{ job_dir_abs }} - absolute path to the job directory
But are there any variables for the input data’s absolute path of the current jobs?

Thanks~

Regards,
qitsweauca

stephan · October 19, 2021, 2:52pm

Hi @qitsweauca,

The variables currently available in the cluster submission script are detailed here:
https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#create-the-files

Do you mean absolute path to the raw data that a job uses to process? Unfortunately no, in jobs where you’re able to specify the path to raw data (Import Movies, Import Micrographs), the value is a build-time parameter that can’t be modified after being queued. Every other job (e.g., Patch Motion Correction, Non-Uniform Refinement) uses the .cs files created by the input job to reference the files it has to process.

qitsweauca · October 20, 2021, 1:59am

Hi @stephan,

Thank you for your reply.
Yes, the absolute path to the raw data that a job uses to process.
I encounter the situation where in our shared HPC environment with PBS, we have been advised/recommended to tar all input files and copy the tarred input files to SSD cache as a one bigger file (rather than multiple or thousands of smaller files), and untar it before executing any processing tasks directly with the files on local SSD caches on the compute nodes, and not to use our shared Lustre filesystem storage. So, I was wondering if there are any ways do this within our PBS job submission scripts. If I can call a variable for the input data’s absolute path, then I may be able to do those actions for better file I/O performance.

Previously, I did ab-initio and 3D refinement directly on our shared lustre storage, the performance was degraded about 3 times or more in terms of the processing time. After using the SSD cache on the compute nodes, the performance is comparable to CryoSPARC’s AWS benchmarking. However, when loading the input files to SSD cache, it seems to copy mrc files one by one.

Regards,
qitsweauca

stephan · December 23, 2021, 10:32pm

Hi @qitsweauca,

The cryoSPARC cache expects a specific folder structure for it’s cached files- I believe if you try to replicate that on the target SSD, cryoSPARC should then acknowledge the files exist on the disk already and skip it’s own caching process. You should be able to orchestrate this inside the cluster submission script using some clever scripting. Take a look at an existing cache folder to see what the structure is: you’ll notice the cache copies files into a folder named after the project on the SSD cache e.g., : /scratch/cryosparc_cache/instance_cryosparc.server:39001/projects/P167