Is it safe to tar and delete .npy files

Hello Cryosparc Team,

We have a cluster and the storage system has an Inode limit which we hit regularly while the size limit is not close to be touched. We realized that the sub-job-folders ctfestimation, motioncorrection, reconstrution and hyp_opt_trajs folder specifically contain a lot of those npy files.

Our thought now was to tar and delete those files so that they can be recovered quickly if needed.

Can this be done safely or are they critical for later stages of processing?

Best,
Markus Stabrin

P.S.

We of course also looked at other possible candidates like thumbnails, but those should be safe to tar once the dataset is screened.

Hi @mstabrin ,

My first thought is of course to recommend you contact the IT team and see if they can increase the inode limit, but I’m sure you’re already doing that.

  • If you remove the .npy files produced by patch motion correction, you won’t be able to do reference-based motion correction later (without first untarring).
  • If you remove any of the npy files produced by patch motion or patch ctf, exposure curation jobs may fail.

So if those aren’t important to you, it should be safe to manually tar and delete those .npy files.

– Harris

Hello @hsnyder ,

Thank you for the information :slight_smile:

Unfortunately, we the storage follows a strict X inodes per TB quota policy. Therefore, increasing the inodes without increasing the quota is not an option right now.

I had a feeling that those jobs might suffer, but to teach my users to untar their files specifically for those runs is a risk I am willing to take :smiley: We have like 700TB quota of which 50TB is still free, which relates to 45M inodes of which 0 are free in our case. Taring those npy files freed up now about 15M, so it is pretty significant.

One idea: Since we use a cluster setup our worker jobs are submitted to the cluster anyway. Is there an easy way to use the cli to get the job time and the related inputs from the job? That way I could to the untar automatically for those job types.

Best,
Markus

Hi @mstabrin ,

In principle it should be possible to automate such untarring with some python and cryosparc tools. It’s not exactly trivial, but here’s a possible outline for how it could be done.

First, here are some useful guide pages for context:

You could write a script that does the following:

  • Accepts the project uid, job uid, and project directory as an argument

  • Connects to CryoSPARC master and loads the job

  • Loads all the exposure inputs

  • Checks for presence of .npy files in the ctf/path, rigid_motion/path, etc. fields

  • Untars those if necessary

This script should be in its own conda/mamba environment and invoked from the cluster submission script similarly to this:

conda run -n [insert-env-name-here] python /path/to/untar_npy.py --project {{ project_uid }} --job {{ job_uid }} --project-dir {{ project_dir_abs }}

– Harris

Hello @hsnyder ,

Thank you very much for the context! I think that I got something usable running :slight_smile:

Best,

Markus

1 Like