Is there any way to get the list of all the inputs file paths for any jobs.
I know if we read “job.json” we may get “blob_paths”, “gainref_path” and “connections”. But my requirement is to get the list of all the files needed for a job to run and does it prepare “job.json” for al the jobs?
Please can your illustrate your question with a concrete example and (pseudo-)code and information like
- job type?
- input specifications: are the intended inputs created by an upstream CryoSPARC job?
- will you be using cryosparc-tools?
Hi @wtempel , For this scenario we are not using cryosparc-tools. What I am trying to do here is, before any job rans I would like to place a check in place so that I can make sure all the input files are available in the file system.
How I am doing this → I have placed script inside the cluster_script.sh file, that script runs before the actual cryosparc job(I know, all the jobs types does not use cluster_script.sh file)
How the scripts find the input file paths and other depended jobs → Script checks the job.json and it looks for all the reference path, blob path and then it looks for the connections, with the help of connections I will find the depended jobs.
So the requirement is, before any job get submitted or started running I would like to place a check and verify all the input files are available in the file system(we are using lustre filesystem, where files will get loaded from S3)
- job type? Ans: For all the jobs
- input specifications: are the intended inputs created by an upstream CryoSPARC job? Ans: Yes
- will you be using cryosparc-tools? Currently no.
Thanks,
Praveen K V
Interesting. What exactly will you be checking for (with which command?). Please can you explain the expected state of the Lustre filesystem at the time of the check? A relatively “fresh” filesystem that still needs to import data from S3? An “older” filesytem from which data have been “released” as described here?
Hi @wtempel, Our scenario is like, we are trying to re-run the old jobs. Most of the jobs were ran an year back and all the files moved to S3- Glacier(Tape drive), now few of the jobs we have re run, for that I would like to bring depended files only for the particular jobs. So I have to bring it to first S3-standard(we have the scripts for that) then Lustre.
All other scripts are ready now I would need to have a logic how can I find the depended files needed to re-run the any jobs again. Once I have that I will bring only those files to the Lustre and we will run the job again. I think, with the help of cryoSPARC DB I might be able to get something. But before exploring that option I just wanted to check with you.
Few of the command to read the data from job.json is
jq -r '.params_spec.blob_paths.value' job.json | sed 's/\*.*//'
jq -r '.params_spec.gainref_path.value' job.json
to get all the connection
index=0
while [ $index -lt ${#INPUT_HSM_connections[@]} ]; do
connection="${INPUT_HSM_connections[$index]}"
echo $connection
if [ "$connection" != "null" ]; then
current_dir=$(pwd)
project_path=$(echo "$current_dir" | sed 's#/[^/]*$#/'$connection'#')
echo $connection
echo $project_path
hsm_check "$project_path"
fi
((index++))
These are the few commands I am using currently, please let me know if you any additional queries . Thanks.
Thanks for explaining. We unfortunately do not have a tool that does exactly what you are trying to do, but you may want to have a look at the job_find_ancestors()
CLI function.