RBMC failed jobs not cleaned completely

Clear/restart of a number of different failure types is giving me:

Job directory /home/asarnow/Projects/DeltaCoV/P461/J875 is not empty, found: /home/asarnow/Projects/DeltaCoV/P461/J875/hyp_opt_trajs

@DanielAsarnow What is the output of the command

find /home/asarnow/Projects/DeltaCoV/P461/J875/ -ls

Does the error still occur after the job is cleared and queued again? If it does, please can you collect again the output of

find /home/asarnow/Projects/DeltaCoV/P461/J875/ -ls

Deleting the hyp_opt_trajs dir allows the jobs to requeue normally, it just wasn’t deleted by the cryoSPARC clear action. That directory and its contents (.txt and .npy files) are the only extra stuff in the job dir (i.e. besides the job metadata files that are always there).

Now looking at the full ls output more carefully per your suggestion - it seems like a permissions issue. Somehow the directory and its contents are owned by me and don’t have group write (which is confusing).

-rw-rw-r-- 1 cryosparc veesler   18 Jul  3 17:57 events.bson
drwxrwsr-x 2 cryosparc veesler    6 Jul  3 17:57 gridfs_data/
drwxr-sr-x 2 asarnow   veesler  60K Jul  3 17:03 hyp_opt_trajs/
-rw-rw-r-- 1 cryosparc veesler 158K Jul  3 17:57 job.json
-rw-r--r-- 1 asarnow veesler   60 Jul  3 16:58 0_hyps.txt
-rw-r--r-- 1 asarnow veesler  10K Jul  3 16:58 0_traj.npy
-rw-r--r-- 1 asarnow veesler   59 Jul  3 17:02 1000_hyps.txt
-rw-r--r-- 1 asarnow veesler 6.3K Jul  3 17:02 1000_traj.npy
-rw-r--r-- 1 asarnow veesler   58 Jul  3 17:02 1001_hyps.txt
-rw-r--r-- 1 asarnow veesler 8.8K Jul  3 17:02 1001_traj.npy
-rw-r--r-- 1 asarnow veesler   60 Jul  3 17:02 1002_hyps.txt
-rw-r--r-- 1 asarnow veesler 8.8K Jul  3 17:02 1002_traj.npy
-rw-r--r-- 1 asarnow veesler   60 Jul  3 17:02 1003_hyps.txt

Interesting. Do you recall creating or chowning this directory?

I don’t, and my records (I have a line-by-line eternal history) don’t show it either. OTOH I don’t see how the cryosparc user could have done this.

Can you confirm that there is not a worker node on this CryoSPARC instance that has an "ssh_str": value starting with asarnow@?

cryosparcm icli
for node in filter(lambda x: x['type'] == 'node', cli.get_scheduler_targets()):
    print(node["ssh_str"])
exit()

OK - now I see. The worker wrote back over sshfs.

I think -o default_permissions is needed as well as -o allow_other.

Also - this only causes an error if the job fails. If it succeeds all is good, though the permissions are still not as expected due to sshfs. Anyway not a cryoSPARC bug…sorry lol.

Thanks @DanielAsarnow for posting your finding.

1 Like