Another job has locked the cache. Waiting for it to unlock... (waited 3361 seconds)

Hello, I think I have locked the cryosparc systam by a mistake: I launced a couple of jobs assuming that they would be queued, but nothing happened after the first job was finished:

Launched job zfMTbEhrrhuNE4k4t

License is valid.

Job Type: superrefine Job ID: zfMTbEhrrhuNE4k4t

Assigned Slots: RAM: [0, 1] GPU: [0]

Starting High-Res Refinement , Experiment ID xHai4r4wQAL8fTeTf

Loading Dataset …
---- Loading File Set. Type: star Path: /home/commoncryosparc/cryosparc/run/bulk/./local_fs/mnt/img-data/commoncryosparc/Ottilie/Extract/job033/particles.star
Loaded meta data for 604938 images.
Found raw data for 604938 images.
Loading image data…
Found references to 1720 unique data files.
Another job has locked the cache. Waiting for it to unlock… (waited 3601 seconds)

Is there a way to see what is happening? The job seems to be running, but does not give any output…

OK stopping and restarting solved the problem, but I would really like to know what caused it…

Hi @Ottilie,

Thanks for reporting.
Just to be clear, is this what happened:

  1. you launched two jobs one after another that both use the same dataset
  2. both jobs started at the same time (you have two GPUs at least)
  3. one of the jobs ran to completion
  4. the second job got stuck forever with the “Another job has locked the cache” message

The new cache system has be revamped to deal with the case where multiple jobs at the same time request the same dataset (which was causing a race condition before) but this may be a new kink to iron out.

Ali

Hi Ali,
No actually I launched the jobs from two different datasets. So this was the mistake?
Ottilie

Someone here had this issue, they said it persisted through a restart.

But when I cleared the cache and restarted cryosparc, it was working again.

You also might double check there are no filesystem errors and the cache is still mounted read-write and the cryosparc user has access.