Job directory is not empty error after strange queueing behaviour

charliebe2 · March 24, 2025, 6:31pm

Hello all,

We’ve had a strange incident we hope someone might be able to help with. In one of our projects, Cryosparc appeared to randomly re-queue hundreds of older jobs (as in years old, not touched) that had already been run, completely overwhelming our system. This seemed to occur shortly after a small workspace clean up had been run and a new job queued.

To clear the backlog, all the queued jobs were marked as complete in the database, which restored many of the jobs to normal. However, some of the jobs that had been randomly queued also seemingly tried to launch and entered a failure state. After these were also reset to completed, their output data is no longer visible in the UI. Instead the jobs carry a message such as:

Job directory /mnt/ome/data07/cryosparc/XXX/J3688 is not empty, found: /mnt/ome/data07/cryosparc/XXX/J3688/job.log

The underlying directories still contain the original data and the output tabs still look correct, but it seems the UI is locked in this state and we can’t see the outputs in the event log window

Does anyone have any suggestions for a solution? We already detached and re-attached the project.

Best regards,

Charlie

wtempel · March 25, 2025, 5:11pm

@charliebe2 On the given CryoSPARC instance, are there any automations or other mechanisms in place that manage CryoSPARC jobs and/or data bypassing the web app?

Were these jobs queued to a node or cluster-type scheduler lane?

charliebe2 · March 26, 2025, 8:16am

There are no automations of other mechanisms in place to manage jobs and no bypassing of the web app. The jobs are queued to a node.

wtempel · April 9, 2025, 7:03pm

We unfortunately can not explain how older jobs where apparently re-queued. That said, if one accepted that a job is requeued, even as we do not know how, the message

and
an empty event log are expected. One should, however, be able to connect such a (re-queued and) failed job’s output to a downstream job after setting the failed job’s status to "completed", as you wrote you already did. Have you tried whether the outputs of jobs that

were “accidentally” re-queued,
then failed,
then were marked as completed

can in fact be connected as inputs to new jobs?