Is there a way for either nonprivileged users, or users who are cryoSPARC admins, but not root users on the master/worker, to reset file locks? And is there a preferred method for root users / the cryosparc UNIX user to do so?
I have some data with spurious file locks that prevent caching on one of our cluster lanes but not another (both on the same SLURM queue, but different node lists), it would be cool if we could fix them without disrupting everything else going on in cryoSPARC.
The lock is intended to prevent modification of file sets that are also being written to cache by another job. If you suspect an inconstency in CryoSPARC’s cache tracking and you know that the cached (or to-be-cached) files are not currently being modified by any other running jobs, various interventions are possible. Unfortunately, the documented interventions assume no jobs are running, whereas you specified
Please can you elaborate more on
Do you suspect the reportedly locked files are not legitimately related to another active job?
Are these data being processed on both sets of nodes, but long-term locking only occurs on one set of nodes?
Are cache devices shared between nodes, but not between the sets of nodes?
Maybe I am confused about what’s going on. Are they actually write locks on the destination cache? Is the intended behavior for multiple jobs supposed to be, go ahead and write if the destinations are two different workers and wait if they’re the same?
CryoSPARC cache files are tracked in the database on the master node, so that workers can coordinate writing files using a shared cache space. When a job (on the same or different worker node) starts writing a file to a shared cache space, other jobs using that same file (with the same file path) will wait for the job to finish writing that file, then access the cached file. While jobs are waiting in this manner, SSD cache : requested files are locked for past 26306s, checking again in 5s could appear in the event log to show that another job is currently writing that file to the SSD cache. Once the file is done being written, the job should continue as normal using that file on the cache.
@nwong I was just revisiting this because I have this problem again on multiple projects. (“Build X all classes” is a good way to trigger it).
After rereading your explanation carefully - “on the same or different node with the same file path” - it seems if there are two nodes with the same cache path (e.g. /scratch is node-local NVME on all nodes) running jobs on the same particle set, then there might be spurious locks.
You also wrote “when a job starts writing to cache” but that isn’t what I see, I see e.g. “Build X all classes” and then queuing the jobs leads to all those jobs being locked simultaneously (on same or different nodes). There isn’t one that writes while the others wait - they all wait for each other indefinitely, thus the extremely long wait times from my locked file messages. What does work is queuing to a different lane, even though the cache paths are the same.