RBMC failed jobs not cleaned completely

DanielAsarnow · July 4, 2024, 1:03am

Clear/restart of a number of different failure types is giving me:

Job directory /home/asarnow/Projects/DeltaCoV/P461/J875 is not empty, found: /home/asarnow/Projects/DeltaCoV/P461/J875/hyp_opt_trajs

wtempel · July 15, 2024, 7:18pm

@DanielAsarnow What is the output of the command

find /home/asarnow/Projects/DeltaCoV/P461/J875/ -ls

Does the error still occur after the job is cleared and queued again? If it does, please can you collect again the output of

find /home/asarnow/Projects/DeltaCoV/P461/J875/ -ls

DanielAsarnow · July 15, 2024, 11:54pm

Deleting the hyp_opt_trajs dir allows the jobs to requeue normally, it just wasn’t deleted by the cryoSPARC clear action. That directory and its contents (.txt and .npy files) are the only extra stuff in the job dir (i.e. besides the job metadata files that are always there).

Now looking at the full ls output more carefully per your suggestion - it seems like a permissions issue. Somehow the directory and its contents are owned by me and don’t have group write (which is confusing).

-rw-rw-r-- 1 cryosparc veesler   18 Jul  3 17:57 events.bson
drwxrwsr-x 2 cryosparc veesler    6 Jul  3 17:57 gridfs_data/
drwxr-sr-x 2 asarnow   veesler  60K Jul  3 17:03 hyp_opt_trajs/
-rw-rw-r-- 1 cryosparc veesler 158K Jul  3 17:57 job.json

-rw-r--r-- 1 asarnow veesler   60 Jul  3 16:58 0_hyps.txt
-rw-r--r-- 1 asarnow veesler  10K Jul  3 16:58 0_traj.npy
-rw-r--r-- 1 asarnow veesler   59 Jul  3 17:02 1000_hyps.txt
-rw-r--r-- 1 asarnow veesler 6.3K Jul  3 17:02 1000_traj.npy
-rw-r--r-- 1 asarnow veesler   58 Jul  3 17:02 1001_hyps.txt
-rw-r--r-- 1 asarnow veesler 8.8K Jul  3 17:02 1001_traj.npy
-rw-r--r-- 1 asarnow veesler   60 Jul  3 17:02 1002_hyps.txt
-rw-r--r-- 1 asarnow veesler 8.8K Jul  3 17:02 1002_traj.npy
-rw-r--r-- 1 asarnow veesler   60 Jul  3 17:02 1003_hyps.txt

wtempel · July 16, 2024, 1:54pm

Interesting. Do you recall creating or chowning this directory?

DanielAsarnow · July 16, 2024, 5:27pm

I don’t, and my records (I have a line-by-line eternal history) don’t show it either. OTOH I don’t see how the cryosparc user could have done this.

wtempel · July 16, 2024, 6:36pm

Can you confirm that there is not a worker node on this CryoSPARC instance that has an "ssh_str": value starting with asarnow@?

cryosparcm icli
for node in filter(lambda x: x['type'] == 'node', cli.get_scheduler_targets()):
    print(node["ssh_str"])
exit()

DanielAsarnow · July 16, 2024, 6:48pm

OK - now I see. The worker wrote back over sshfs.

DanielAsarnow · July 16, 2024, 6:50pm

I think -o default_permissions is needed as well as -o allow_other.

Also - this only causes an error if the job fails. If it succeeds all is good, though the permissions are still not as expected due to sshfs. Anyway not a cryoSPARC bug…sorry lol.

wtempel · July 16, 2024, 7:01pm

Thanks @DanielAsarnow for posting your finding.