Database corruption while attaching project

In fact, our cluster has recently had bad disks, and the administrator has been dealing with it recently, so there will often be cases of distributed storage demounting.

If I am attaching a project when the cluster is unmounted, and then the database exits, and then restarts cryosparc, many jobs are not imported. I don’t know how to re-import them

@luisshulk Is this question related to Missing jobs after attaching project - #2 by luisshulk?

Yes, some missing jobs of some projects is caused by bad disks. I want find my jobs quickly, So I start a new question.
Some log message is in this question.
Missing jobs after attaching project

Thanks a lot.

@luisshulk Please can you provide more details on the relationship(s) between

  • the cluster
  • the faulty disks
  • database storage (also: what type of filesystem and redundancy)
  • project directory storage (also: what type of filesystem and redundancy)

Please ensure all storage issues that could affect project directories or the database have been identified and resolved before attempting recovery.

Thanks for your reply.

Actually, the faulty disk problem had been solved yesterday. Until now, the cryosparc database never exited. So I think it’s time to recover the jobs not be imported.

I am so sorry about the problem I met before, our cluster sometimes have “stale file handle” error message because of the faulty disk caused by corrupted data, which lead to cryosparc database exited unexpected. But this problem had been solved by our cluster manager.

The only question is, how to recover the missing jobs, which caused by the database exited unexpected while project importing.

Thanks a lot.

I am still not sure whether the disk and fileserver problems lead to corruption of the project directory, the database, or both.
Please can you confirm whether the project directory whose attachment failed is the only copy available, or there is a backup?
If it is the only copy, to help determine the extent of possible corruption, you may want to try the steps described in this post. Warning: These steps are not generally recommended as they would disrupt the normal data management workflow of a functioning CryoSPARC instance. Their use here is justified only because the storage malfunction may have corrupted the project directory, the database, or both.

  1. mark as detached the project whose attachment/import has failed when there were storage disruptions,
    cryosparcm icli # enter the interactive CryoSPARC cli
    puid = 'P9999' # replace with actual project UID
    db.projects.update_one({'uid': puid},{"$set": {"detached": True}})
    exit()
    
    then
  2. rename the cs.lock file inside the project directory
    cd /path/to/project/dir/
    mv cs.lock cs.lock_old20240118
    
  3. reattempt the project attachment.
  4. observe: do the previously missing jobs show up?
  5. check the command core log for errors:
    cryosparcm filterlog command_core -l ERROR
    

Thanks for your solution.

I have 6 attachment failed project directories. 3 of them were recovered by your solution. But other projects can not been recovered. Here is one of the **cryosparcm filterlog command_core -l ERROR** results. Another failed project error log is the same as this.

2024-01-19 17:34:55,066 wrapper              ERROR    | JSONRPC ERROR at set_user_viewed_project
2024-01-19 17:34:55,066 wrapper              ERROR    | Traceback (most recent call last):
2024-01-19 17:34:55,066 wrapper              ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/cryosparc_command/commandcommon.py", line 195, in wrapper
2024-01-19 17:34:55,066 wrapper              ERROR    |     res = func(*args, **kwargs)
2024-01-19 17:34:55,066 wrapper              ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/cryosparc_command/command_core/__init__.py", line 1196, in set_user_viewed_project
2024-01-19 17:34:55,066 wrapper              ERROR    |     update_project(project_uid, {'last_accessed' : {'name' : get_username_by_id(user_id), 'accessed_at' : datetime.datetime.utcnow()}}, operation='$set', export=False)
2024-01-19 17:34:55,066 wrapper              ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/cryosparc_command/commandcommon.py", line 186, in wrapper
2024-01-19 17:34:55,066 wrapper              ERROR    |     return func(*args, **kwargs)
2024-01-19 17:34:55,066 wrapper              ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/cryosparc_command/commandcommon.py", line 261, in wrapper
2024-01-19 17:34:55,066 wrapper              ERROR    |     assert not project['detached'], f"validation error: project {project_uid} is detached"
2024-01-19 17:34:55,066 wrapper              ERROR    | AssertionError: validation error: project P19 is detached

And this screenshot which disappeared quickly

3B)CWMZ8A~)CO${0A9{W~`Q

P33 is the re-attachment of P19.

And last failed project got this error log:

cryosparcm filterlog command_core -l ERROR

2024-01-19 19:52:33,552 import_project_run   ERROR    | Unable to import project from /work/caolab/yu.cao/SYVN1
2024-01-19 19:52:33,552 import_project_run   ERROR    | Traceback (most recent call last):
2024-01-19 19:52:33,552 import_project_run   ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/cryosparc_command/command_core/__init__.py", line 4479, in import_project_run
2024-01-19 19:52:33,552 import_project_run   ERROR    |     warning = import_jobs(jobs_manifest, abs_path_export_project_dir, new_project_uid, owner_user_id, notification_id) or warning
2024-01-19 19:52:33,552 import_project_run   ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/cryosparc_command/command_core/__init__.py", line 4722, in import_jobs
2024-01-19 19:52:33,552 import_project_run   ERROR    |     job_doc_data = json.load(openfile, object_hook=json_util.object_hook)
2024-01-19 19:52:33,552 import_project_run   ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/json/__init__.py", line 293, in load
2024-01-19 19:52:33,552 import_project_run   ERROR    |     return loads(fp.read(),
2024-01-19 19:52:33,552 import_project_run   ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/json/__init__.py", line 370, in loads
2024-01-19 19:52:33,552 import_project_run   ERROR    |     return cls(**kw).decode(s)
2024-01-19 19:52:33,552 import_project_run   ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/json/decoder.py", line 337, in decode
2024-01-19 19:52:33,552 import_project_run   ERROR    |     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
2024-01-19 19:52:33,552 import_project_run   ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/json/decoder.py", line 355, in raw_decode
2024-01-19 19:52:33,552 import_project_run   ERROR    |     raise JSONDecodeError("Expecting value", s, err.value) from None
2024-01-19 19:52:33,552 import_project_run   ERROR    | json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2024-01-19 19:52:33,626 run                  ERROR    | POST-RESPONSE-THREAD ERROR at import_project_run
2024-01-19 19:52:33,626 run                  ERROR    | Traceback (most recent call last):
2024-01-19 19:52:33,626 run                  ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/cryosparc_command/commandcommon.py", line 72, in run
2024-01-19 19:52:33,626 run                  ERROR    |     self.target(*self.args)
2024-01-19 19:52:33,626 run                  ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/cryosparc_command/command_core/__init__.py", line 4479, in import_project_run
2024-01-19 19:52:33,626 run                  ERROR    |     warning = import_jobs(jobs_manifest, abs_path_export_project_dir, new_project_uid, owner_user_id, notification_id) or warning
2024-01-19 19:52:33,626 run                  ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/cryosparc_command/command_core/__init__.py", line 4722, in import_jobs
2024-01-19 19:52:33,626 run                  ERROR    |     job_doc_data = json.load(openfile, object_hook=json_util.object_hook)
2024-01-19 19:52:33,626 run                  ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/json/__init__.py", line 293, in load
2024-01-19 19:52:33,626 run                  ERROR    |     return loads(fp.read(),
2024-01-19 19:52:33,626 run                  ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/json/__init__.py", line 370, in loads
2024-01-19 19:52:33,626 run                  ERROR    |     return cls(**kw).decode(s)
2024-01-19 19:52:33,626 run                  ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/json/decoder.py", line 337, in decode
2024-01-19 19:52:33,626 run                  ERROR    |     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
2024-01-19 19:52:33,626 run                  ERROR    |   File "/cm/shared/apps/cryosparc/cylab/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/json/decoder.py", line 355, in raw_decode
2024-01-19 19:52:33,626 run                  ERROR    |     raise JSONDecodeError("Expecting value", s, err.value) from None
2024-01-19 19:52:33,626 run                  ERROR    | json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

And also the screenshot

L2})CWQ4ARU24(7GWT`A

By the way, about the old detached project, can I delete them from database?

Please can you post the output of the command

cat /work/caolab/yu.cao/SYVN1/project.json

Please see this section of the guide.

This is the results of

cat /work/caolab/yu.cao/SYVN1/project.json
{
    "uid": "P35",
    "uid_num": 35,
    "title": "SYVN1",
    "description": "Enter a description.",
    "project_dir": "/work/caolab/yu.cao/SYVN1",
    "project_params_pdef": {},
    "owner_user_id": "658012f352c2b931070919c6",
    "created_at": {
        "$date": 1696649347734
    },
    "status": "completed",
    "queue_paused": false,
    "deleted": false,
    "users_with_access": [
        "658012f352c2b931070919c6"
    ],
    "size": 1166289964295,
    "last_accessed": {
        "name": "yu.cao",
        "accessed_at": {
            "$date": 1705822458372
        }
    },
    "archived": false,
    "detached": false,
    "hidden": true,
    "project_stats": {
        "workspace_count": 9,
        "session_count": 0,
        "job_count": 116,
        "job_types": {
            "import_particles": 21,
            "hetero_refine": 21,
            "homo_refine": 14,
            "class_2D": 12,
            "select_2D": 11,
            "homo_abinit": 11,
            "nonuniform_refine_new": 9,
            "import_volumes": 8,
            "nonuniform_refine": 5,
            "particle_sets": 3,
            "homo_refine_new": 1
        },
        "job_sections": {
            "import": 29,
            "particle_curation": 23,
            "reconstruction": 11,
            "refinement": 50,
            "utilities": 3
        },
        "job_status": {
            "completed": 112,
            "killed": 2,
            "building": 2
        },
        "updated_at": {
            "$date": 1705825573690
        }
    },
    "generate_intermediate_results_settings": {
        "class_2D": false,
        "class_3D": false,
        "var_3D_disp": false
    },
    "created_at_version": null,
    "import_status": "failed",
    "imported": true,
    "imported_at": {
        "$date": 1705664824026
    },
    "intermediate_results_size_bytes": 91808810634,
    "intermediate_results_size_last_updated": {
        "$date": 1705855697680
    },
    "size_last_updated": {
        "$date": 1705855697680
    }

There may be an empty or corrupt job document inside the project directory.
What is the output of the command

find /work/caolab/yu.cao/SYVN1 -name job.json -empty

?

The output is:

/work/caolab/yu.cao/SYVN1/J23/job.json
/work/caolab/yu.cao/SYVN1/J229/job.json

And how about this one? This looks different with the other failed projects.

Are the files

/work/caolab/yu.cao/SYVN1/J23/job.log
/work/caolab/yu.cao/SYVN1/J229/job.log

available? What project ID does each of the job.log files contents refer to?

Please inspect the command_core log for related messages. If the current

/cm/shared/apps/cryosparc/cylab/cryosparc_master/run/command_core.log

file’s earliest entry is newer than the P33 attachment attempt, you may have to check older versions of the file:
command_core.log.1, command_core.log.2, etc.