Is there any way to get all the inputs file paths for any jobs in cryoSPARC

Is there any way to get the list of all the inputs file paths for any jobs.
I know if we read “job.json” we may get “blob_paths”, “gainref_path” and “connections”. But my requirement is to get the list of all the files needed for a job to run and does it prepare “job.json” for al the jobs?

Please can your illustrate your question with a concrete example and (pseudo-)code and information like

  • job type?
  • input specifications: are the intended inputs created by an upstream CryoSPARC job?
  • will you be using cryosparc-tools?

Hi @wtempel , For this scenario we are not using cryosparc-tools. What I am trying to do here is, before any job rans I would like to place a check in place so that I can make sure all the input files are available in the file system.
How I am doing this → I have placed script inside the cluster_script.sh file, that script runs before the actual cryosparc job(I know, all the jobs types does not use cluster_script.sh file)
How the scripts find the input file paths and other depended jobs → Script checks the job.json and it looks for all the reference path, blob path and then it looks for the connections, with the help of connections I will find the depended jobs.

So the requirement is, before any job get submitted or started running I would like to place a check and verify all the input files are available in the file system(we are using lustre filesystem, where files will get loaded from S3)

  • job type? Ans: For all the jobs
  • input specifications: are the intended inputs created by an upstream CryoSPARC job? Ans: Yes
  • will you be using cryosparc-tools? Currently no.
    Thanks,
    Praveen K V

Interesting. What exactly will you be checking for (with which command?). Please can you explain the expected state of the Lustre filesystem at the time of the check? A relatively “fresh” filesystem that still needs to import data from S3? An “older” filesytem from which data have been “released” as described here?

Hi @wtempel, Our scenario is like, we are trying to re-run the old jobs. Most of the jobs were ran an year back and all the files moved to S3- Glacier(Tape drive), now few of the jobs we have re run, for that I would like to bring depended files only for the particular jobs. So I have to bring it to first S3-standard(we have the scripts for that) then Lustre.
All other scripts are ready now I would need to have a logic how can I find the depended files needed to re-run the any jobs again. Once I have that I will bring only those files to the Lustre and we will run the job again. I think, with the help of cryoSPARC DB I might be able to get something. But before exploring that option I just wanted to check with you.
Few of the command to read the data from job.json is

jq -r '.params_spec.blob_paths.value' job.json | sed 's/\*.*//'
jq -r '.params_spec.gainref_path.value' job.json

to get all the connection

index=0
while [ $index -lt ${#INPUT_HSM_connections[@]} ]; do
    connection="${INPUT_HSM_connections[$index]}"
    echo $connection
    if [ "$connection" != "null" ]; then
           current_dir=$(pwd)
           project_path=$(echo "$current_dir" | sed 's#/[^/]*$#/'$connection'#')
           echo $connection
           echo $project_path
           hsm_check "$project_path"
    fi
    ((index++))

These are the few commands I am using currently, please let me know if you any additional queries . Thanks.

Thanks for explaining. We unfortunately do not have a tool that does exactly what you are trying to do, but you may want to have a look at the job_find_ancestors() CLI function.