Is there any way to get all the inputs file paths for any jobs in cryoSPARC

Praveen · September 25, 2023, 9:06pm

Is there any way to get the list of all the inputs file paths for any jobs.
I know if we read “job.json” we may get “blob_paths”, “gainref_path” and “connections”. But my requirement is to get the list of all the files needed for a job to run and does it prepare “job.json” for al the jobs?

wtempel · September 26, 2023, 4:24pm

Please can your illustrate your question with a concrete example and (pseudo-)code and information like

job type?
input specifications: are the intended inputs created by an upstream CryoSPARC job?
will you be using cryosparc-tools?

Praveen · October 17, 2023, 7:04pm

Hi @wtempel , For this scenario we are not using cryosparc-tools. What I am trying to do here is, before any job rans I would like to place a check in place so that I can make sure all the input files are available in the file system.
How I am doing this → I have placed script inside the cluster_script.sh file, that script runs before the actual cryosparc job(I know, all the jobs types does not use cluster_script.sh file)
How the scripts find the input file paths and other depended jobs → Script checks the job.json and it looks for all the reference path, blob path and then it looks for the connections, with the help of connections I will find the depended jobs.

So the requirement is, before any job get submitted or started running I would like to place a check and verify all the input files are available in the file system(we are using lustre filesystem, where files will get loaded from S3)

job type? Ans: For all the jobs
input specifications: are the intended inputs created by an upstream CryoSPARC job? Ans: Yes
will you be using cryosparc-tools? Currently no.
Thanks,
Praveen K V

wtempel · October 17, 2023, 9:28pm

Interesting. What exactly will you be checking for (with which command?). Please can you explain the expected state of the Lustre filesystem at the time of the check? A relatively “fresh” filesystem that still needs to import data from S3? An “older” filesytem from which data have been “released” as described here?

Praveen · October 18, 2023, 11:39am

Hi @wtempel, Our scenario is like, we are trying to re-run the old jobs. Most of the jobs were ran an year back and all the files moved to S3- Glacier(Tape drive), now few of the jobs we have re run, for that I would like to bring depended files only for the particular jobs. So I have to bring it to first S3-standard(we have the scripts for that) then Lustre.
All other scripts are ready now I would need to have a logic how can I find the depended files needed to re-run the any jobs again. Once I have that I will bring only those files to the Lustre and we will run the job again. I think, with the help of cryoSPARC DB I might be able to get something. But before exploring that option I just wanted to check with you.
Few of the command to read the data from job.json is

jq -r '.params_spec.blob_paths.value' job.json | sed 's/\*.*//'
jq -r '.params_spec.gainref_path.value' job.json

to get all the connection

index=0
while [ $index -lt ${#INPUT_HSM_connections[@]} ]; do
    connection="${INPUT_HSM_connections[$index]}"
    echo $connection
    if [ "$connection" != "null" ]; then
           current_dir=$(pwd)
           project_path=$(echo "$current_dir" | sed 's#/[^/]*$#/'$connection'#')
           echo $connection
           echo $project_path
           hsm_check "$project_path"
    fi
    ((index++))

These are the few commands I am using currently, please let me know if you any additional queries . Thanks.

wtempel · October 18, 2023, 4:26pm

Thanks for explaining. We unfortunately do not have a tool that does exactly what you are trying to do, but you may want to have a look at the job_find_ancestors() CLI function.