Clear/restart of a number of different failure types is giving me:
Job directory /home/asarnow/Projects/DeltaCoV/P461/J875 is not empty, found: /home/asarnow/Projects/DeltaCoV/P461/J875/hyp_opt_trajs
Clear/restart of a number of different failure types is giving me:
Job directory /home/asarnow/Projects/DeltaCoV/P461/J875 is not empty, found: /home/asarnow/Projects/DeltaCoV/P461/J875/hyp_opt_trajs
@DanielAsarnow What is the output of the command
find /home/asarnow/Projects/DeltaCoV/P461/J875/ -ls
Does the error still occur after the job is cleared and queued again? If it does, please can you collect again the output of
find /home/asarnow/Projects/DeltaCoV/P461/J875/ -ls
Deleting the hyp_opt_trajs
dir allows the jobs to requeue normally, it just wasn’t deleted by the cryoSPARC clear action. That directory and its contents (.txt and .npy files) are the only extra stuff in the job dir (i.e. besides the job metadata files that are always there).
Now looking at the full ls output more carefully per your suggestion - it seems like a permissions issue. Somehow the directory and its contents are owned by me and don’t have group write (which is confusing).
-rw-rw-r-- 1 cryosparc veesler 18 Jul 3 17:57 events.bson
drwxrwsr-x 2 cryosparc veesler 6 Jul 3 17:57 gridfs_data/
drwxr-sr-x 2 asarnow veesler 60K Jul 3 17:03 hyp_opt_trajs/
-rw-rw-r-- 1 cryosparc veesler 158K Jul 3 17:57 job.json
-rw-r--r-- 1 asarnow veesler 60 Jul 3 16:58 0_hyps.txt
-rw-r--r-- 1 asarnow veesler 10K Jul 3 16:58 0_traj.npy
-rw-r--r-- 1 asarnow veesler 59 Jul 3 17:02 1000_hyps.txt
-rw-r--r-- 1 asarnow veesler 6.3K Jul 3 17:02 1000_traj.npy
-rw-r--r-- 1 asarnow veesler 58 Jul 3 17:02 1001_hyps.txt
-rw-r--r-- 1 asarnow veesler 8.8K Jul 3 17:02 1001_traj.npy
-rw-r--r-- 1 asarnow veesler 60 Jul 3 17:02 1002_hyps.txt
-rw-r--r-- 1 asarnow veesler 8.8K Jul 3 17:02 1002_traj.npy
-rw-r--r-- 1 asarnow veesler 60 Jul 3 17:02 1003_hyps.txt
Interesting. Do you recall creating or chown
ing this directory?
I don’t, and my records (I have a line-by-line eternal history) don’t show it either. OTOH I don’t see how the cryosparc user could have done this.
Can you confirm that there is not a worker node on this CryoSPARC instance that has an "ssh_str":
value starting with asarnow@
?
cryosparcm icli
for node in filter(lambda x: x['type'] == 'node', cli.get_scheduler_targets()):
print(node["ssh_str"])
exit()
OK - now I see. The worker wrote back over sshfs.
I think -o default_permissions
is needed as well as -o allow_other
.
Also - this only causes an error if the job fails. If it succeeds all is good, though the permissions are still not as expected due to sshfs. Anyway not a cryoSPARC bug…sorry lol.