Clearing intermediate results reports errors due to jobs being in status "building"

richardw · September 4, 2023, 2:10pm

Dear cryosparc support,

I am operating a couple of cryosparc instances in our network. On one of them (running v4.2.1) I’ve noticed some weird behavior when trying to clear intermediate results on the project level.

$ cryosparcm log command_core |grep "clear_intermediate_results_run WARNING  | AssertionError: Cannot clear intermediate results for" |grep "2023-09-04 1"
2023-09-04 13:21:04,340 COMMAND.DATA         clear_intermediate_results_run WARNING  | AssertionError: Cannot clear intermediate results for P25 J108 in status building

When checking for those jobs, I do not find them using the GUI. However, I do find them using the cli (can’t judge whether this is desired or not):

In [1]: cli.get_jobs_by_status(status="building", count_only=True)
Out[1]: 938

I might be wrong, but this does look kindof wrong?

Moreover, the clearing does not lead to any reduction in the total size of the project folder.

Can anyone tell me what I can do to debug this in greater depth?

wtempel · September 5, 2023, 8:29pm

Do you think that the count of jobs in building status is incorrect? Can you identify any jobs in the output of

list(db.jobs.find({"status": "building"},{"_id": 0, "uid": 1, "project_uid": 1}))

that are not in building status?

richardw · September 7, 2023, 11:05am

Running your command gives me the exact same number of jobs (now changed to 943). It’s the same jobs, and includes lots of jobs that do no longer exit, i.e. I can not find them in the GUI, and suspect they’re also missing on the filesystem.
What other options do I have to debug this?

wtempel · September 7, 2023, 2:52pm

It just occurred to me that the earlier query would also return jobs that have been marked as deleted. How long is the list returned by the command

list(db.jobs.find({"status": "building", "deleted": False},{"_id: ": 0, "uid": 1, "project_uid": 1,}))

It would be important to know whether a given job in this list is missing either

in the GUI
or in the filesystem
or both

richardw · September 8, 2023, 7:00am

Here’s the desired output:

list(db.jobs.find({"status": "building", "deleted": False},{"_id": 0, "uid": 1, "project_uid": 1})).__len__()
154

This is clearly a different number, but it’s still clearly too high. I’ve checked a few of those jobs, they appear both on the filesystem and in the GUI.

Looking at one of them (P25 J251) in greater detail reveals that it’s in state “building”, and the event log only shows “Build”, but at the same time this job has already some output attached, which I would not expect. And of course that job is in fact NOT in building state. I can give you more info about the contents of the job.json if I know what to look for (I’m not allowed attach this file here).

Would it actually help to change the state of all jobs that are in this wrong state to “Completed” or “Failed”, depending on what actually matches? I would, however, need to filter out jobs that are - in fact - in building state.

Maybe once that’s fixed, the initial issue (failing to clear intermediate results) would work again for that particular project (P25).

I should also note that a couple of projects on that instance have been copied over from an older CryoSPARC instance (both instances were running the same version of CryoSPARC at the time of migration).

Thanks for the help!

wtempel · September 8, 2023, 3:18pm

Please describe the output and where the output is shown or can be found.

What would be the expected state of the job(s) in question?

I would initially be interested in confirming the status (likely building) and the sequence of timestamps for some of the non-deleted, “building” jobs (via cryosparcm icli):

for (puid, juid) in [('P25', 'J251'),]:
     print(f"{puid}-{juid}: ",{pair[0]: pair[1] for pair in filter(lambda x: x[0].endswith('_at') or x[0] == 'status', cli.get_job(puid, juid).items())})

Please can you also post a screenshot of the cleanup dialog that simulates the options/parameters that were enabled when you encountered the problem (then Cancel instead of Cleanup)?

richardw · September 8, 2023, 3:56pm

Lets focus on P25 J251 for now:

In [1]: for (puid, juid) in [('P25', 'J251'),]:
   ...:      print(f"{puid}-{juid}: ",{pair[0]: pair[1] for pair in filter(lambda x: x[0].endswith('_at') or x[0] == 'status', cli.get_job(puid, juid).items())})
   ...: 
P25-J251:  {'cluster_job_monitor_last_run_at': None, 'completed_at': None, 'created_at': 'Thu, 07 Sep 2023 13:16:24 GMT', 'failed_at': None, 'heartbeat_at': None, 'killed_at': None, 'last_intermediate_data_cleared_at': None, 'last_scheduled_at': None, 'launched_at': None, 'queued_at': None, 'running_at': None, 'started_at': None, 'status': 'building', 'tokens_acquired_at': None, 'tokens_requested_at': None, 'waiting_at': None}

Output of that job:
[Screenshot deleted]

The job can also be found in the job’s folder on the filesystem we’ve mounted on the cryosparc instance. There’s a events.bson, gridfs_data directory and job.json.

I can’t tell what the expected state would be, since I’m just the operator and not user of that system, but since it got output I suppose it should rather be “completed” than “building”. As noted beforehand, this mismatch might be a result from an incorrect or incomplete merging of jobs from an older cryosparc instance.

There’s nothing I could cancel. When I press “Yes” as shown on below screenshot, the dialog disappears and I don’t see any further dialog or other information on the state of my request.

Sorry, I would have liked to post my screenshot here, but since I’m newbie I’m not allowed to attach more than one file…

wtempel · September 12, 2023, 4:31pm

Please can you

describe the process of

richardw:

incorrect or incomplete merging of jobs from an older cryosparc instance

including commands and the common CryoSPARC version between the instances at the time of “merging”

post the output of these icli commands (repetition of previously run command intended)

puid, juid = 'P25', 'J251'
print(f"{puid}-{juid}: ",{pair[0]: pair[1] for pair in filter(lambda x: x[0].endswith('_at') or x[0] == 'status', cli.get_job(puid, juid).items())})
[(f.name, os.stat(f)) for f in os.scandir(os.path.join(cli.get_project(puid).get('project_dir'), juid))]

describe any measures that ensured that at no time more than one active CryoSPARC instance had access to the same project directory. If there may have been concurrent or overlapping access by two or more CryoSPARC instances, please describe the circumstances

This limitation should eventually be lifted based on your continued participation in the forum. Please can you try again attaching that screenshot.

richardw · September 21, 2023, 6:21am

Here’s my (vacatoin-delayed) response:

merging of jobs from the predecessor cryosparc instance:
- process: The migration has been performed following the detach/attach procedure outlined here. My only active involvement with this was making job folders from the old instance available to the new instance by mounting it as an additional NFS volume.
- CryoSPARC versions during initial migration:
  - old instance: v3.3
  - new instance: v3.3
- CryoSPARC versions when copying over remaining jobs from old instance:
  - old instance: v4.2.1
  - new instance: v4.2.1
icli output:

In [1]: puid, juid = 'P25', 'J251'
In [2]: print(f"{puid}-{juid}: ",{pair[0]: pair[1] for pair in filter(lambda x: x[0].endswith('_at') or x[0] == 'status', cli.get_job(puid, juid).items())})
P25-J251:  {'cluster_job_monitor_last_run_at': None, 'completed_at': None, 'created_at': 'Thu, 07 Sep 2023 13:16:24 GMT', 'failed_at': None, 'heartbeat_at': None, 'killed_at': None, 'last_intermediate_data_cleared_at': None, 'last_scheduled_at': None, 'launched_at': None, 'queued_at': None, 'running_at': None, 'started_at': None, 'status': 'building', 'tokens_acquired_at': None, 'tokens_requested_at': None, 'waiting_at': None}
In [3]: [(f.name, os.stat(f)) for f in os.scandir(os.path.join(cli.get_project(puid).get('project_dir'), juid))]
Out[3]: 
[('gridfs_data',
  os.stat_result(st_mode=17917, st_ino=9264451663874637589, st_dev=58, st_nlink=2, st_uid=579467, st_gid=480361, st_size=4096, st_atime=1695183167, st_mtime=1694092584, st_ctime=1694092584)),
 ('events.bson',
  os.stat_result(st_mode=33204, st_ino=9264451668169878324, st_dev=58, st_nlink=1, st_uid=579467, st_gid=480361, st_size=18, st_atime=1695272912, st_mtime=1694509622, st_ctime=1694509622)),
 ('job.json',
  os.stat_result(st_mode=33204, st_ino=9264451646694415989, st_dev=58, st_nlink=1, st_uid=579467, st_gid=480361, st_size=95761, st_atime=1694853378, st_mtime=1694509622, st_ctime=1694509622))]

measures that ensured that at no time more than one active CryoSPARC instance had access to the same project directory:
- the project directory used by the new instance is on a separate volume, which is inaccessible to the old instance, so the data has never been accessed by two instances simultaneously. Likewise, the old project directory is inaccessible to the new instance. Project/job directories have been rsynced via a separate NFS-mounted NAS share that is accessible to both instances.
screenshot of the job output of P25-J251 (top part only):

image1196×688 115 KB
number of jobs in bulding (again; note that the number of building jobs that are not flagged as deleted has remained the same):

In [10]: list(db.jobs.find({"status": "building"},{"_id": 0, "uid": 1, "project_uid": 1})).__len__()
Out[10]: 951
In [11]: list(db.jobs.find({"status": "building", "deleted": False},{"_id": 0, "uid": 1, "project_uid": 1})).__len__()
Out[11]: 154

screenshot when triggering “clear intermediate results”
I still cannot attache more than one screen-shot at a time. But I can describe it:
After clicking on “Actions → Clear Intermediate Results” at the project-level of P25, there’s a pop-up window asking: “Clear intermediate results for this project? [YES] [No]”. Right after clicking on “Yes”, the dialog disappears, thus not allowing me to cancel it.

Hope this info is valuable enough to further narrow down the problem?

wtempel · September 25, 2023, 10:23pm

Thanks @richardw for posting additional details.
An Outputs tab with 0 particles and 0 volumes, the presence of job.json, events.bson, gridfs_data/ inside the job directory are all consistent with a job that is in building mode that has not yet run.
The transfer of project data between instances may have failed for various reasons, including, but not limited to,

errors during project import or attachment. Such errors may leave a message inside the command_core log. If you find such errors or warnings in the command_core log and require clarification, please post them on the forum.
disallowed (see below) inter-weaving of project detachment, data copying, project import/attachment

A few notes:

The “handover” of project directories differs between CryoSPARC v3 and v4.
rsync or cp between project directories must not be allowed after the destination copy has been imported into or attached to the destination distance. An exception to this rule is copying the directory created by a job or output group export into a subdirectory like imports/ inside the destination project directory but outside any job directory inside that project directory.

richardw · September 26, 2023, 7:47am

Unfortunately, the command_core logs do no longer cover the period when the imports have been processed.
Concerning your other point: I can’t rule out completely that we have always been fully compliant with point 2, although I remember that we and our customers have applied to the standard procedure:

detach project on old instance
rsync data to the appropriate directory on the new cryosparc instance
re-attach project on new instance

In any case, I take from your replies that there’s no way to manually adjust the job statuses (via some python script) in order to fix the inconsistencies and making the clear intermediate results work smoothly again?

wtempel · September 26, 2023, 4:07pm

cryosparc_master/run/ may contain additional rotated logs like command_core.log.1, command_core.log.2, etc.

To me at least, there is no obvious way. But there is a caveat:

This job did not seem to have any intermediate results to clear. How about

for (puid, juid) in [('P25', 'J108'),]:
     print(f"{puid}-{juid}: ",{pair[0]: pair[1] for pair in filter(lambda x: x[0].endswith('_at') or x[0] == 'status', cli.get_job(puid, juid).items())})
     print([result.get('uid', '') for result in cli.get_job(puid, juid).get('output_results', {'uid': 'no result'})])

richardw · September 28, 2023, 1:59pm

Thanks for pointing me to where I’m expected to find logs stored es files, that makes things easier, and it indeed gives me access to the the full timeline.

For P25-J108, however, I only find the following messages indicating my initial error:

2023-08-21 14:35:11,385 COMMAND.DATA         clear_intermediate_results_run WARNING  | AssertionError: Cannot clear intermediate results for P25 J108 in status building
2023-08-21 14:35:11,385 COMMAND.DATA         clear_intermediate_results_run WARNING  | Failed to clear intermediate results for P25, J108

There are, however, a number of errors/warnings matching import or attach, with the following looking suspicious (but none matching P25-J108) - e.g.:

2023-03-23 14:34:44,452 COMMAND.DATA         import_jobs          WARNING  | Unable to locate exported job document for P27 J75

This was printed for several times (without any other messages in between), and then followed by this:

2023-03-23 14:34:44,657 COMMAND.DATA         import_project_run   WARNING  | Failed laying out tree in P27: 'J97'
2023-03-23 14:34:44,659 COMMAND.DATA         import_project_run   WARNING  | Imported project from /cluster/project/locher/projects/to-be-attached/P136 as P27 in 161.11s with errors.

This may indicate some problem with the import of P27, but it seems unrelated to the project/job I was looking at. I also detected a couple of still-existing permission errors on a directory which apparently had the intention to contain all datasets to be imported:

2023-07-10 16:12:07,673 COMMAND.DATA         clear_intermediate_results_run WARNING  | PermissionError: [Errno 13] Permission denied: '/cluster/project/locher/projects/to-be-attached/P2/J98/gridfs_data/gridfsdata_0'

Unfortunately also here I can’t tell what the users were doing or trying to achieve exactly. However, since this directory now contains several files that aren’t writable by the user/group with which the application is running, a potential measure would be to fix those permission issues.

The output of the two lines you’ve proposed for P25-J108:

In [1]: for (puid, juid) in [('P25', 'J108'),]:
   ...:      print(f"{puid}-{juid}: ",{pair[0]: pair[1] for pair in filter(lambda x: x[0].endswith('_at')
   ...: or x[0] == 'status', cli.get_job(puid, juid).items())})
   ...:      print([result.get('uid', '') for result in cli.get_job(puid, juid).get('output_results', {'ui
   ...: d': 'no result'})])
   ...: 
P25-J108:  {'cluster_job_monitor_last_run_at': None, 'completed_at': None, 'created_at': 'Wed, 22 Mar 2023 17:33:24 GMT', 'failed_at': None, 'heartbeat_at': None, 'killed_at': None, 'last_intermediate_data_cleared_at': None, 'last_scheduled_at': None, 'launched_at': None, 'queued_at': None, 'running_at': None, 'started_at': None, 'status': 'building', 'tokens_acquired_at': None, 'tokens_requested_at': None, 'waiting_at': None}
['J108-R0', 'J108-R1', 'J108-R2', 'J108-R3', 'J108-R4', 'J108-R5', 'J108-R6', 'J108-R7', 'J108-R8', 'J108-R9', 'J108-R10']

Thanks again for the assistance!

wtempel · September 28, 2023, 2:22pm

Likely the files were originally owned by the Linux account that runs/ran the source CryoSPARC instance. If you can confirm this, it may be best that the ownership, not necessarily permissions, of the copy’s files be adjusted to match the destination CryoSPARC instance. But this decision depends on the specific local circumstances.
If the copy’s source has problems (like corrupt project or job directories with missing job documents), or inconsistencies were introduced during the copy process, even correcting file ownership and permissions of the faulty copy would not resolve the problem, of course.

richardw · September 28, 2023, 2:38pm

We have a setup where access to cryosparc is controlled via the group that all users of the instance belong to, and thus I only have to adjust group permissions (g+rw) for each of the files/folders within the affected directory. I’m running this now and then will try again and let you know.