Followed your instructions and sent you the log file by email. J70 still waited in the queue.
After turning off the logging and cryosparcm restart
, J70 started running.
Followed your instructions and sent you the log file by email. J70 still waited in the queue.
After turning off the logging and cryosparcm restart
, J70 started running.
Thanks for sending the logs. Please can you also post
cryosparcm cli "get_job('P16', 'J69', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
cryosparcm cli "get_job('P16', 'J70', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
cryosparcm cli "get_job('P16', 'J69', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '659d956481a656dde2f4a856', 'completed_at': 'Fri, 19 Jan 2024 21:54:13 GMT', 'created_at': 'Tue, 09 Jan 2024 18:50:12 GMT', 'job_type': 'restack_particles', 'launched_at': 'Fri, 19 Jan 2024 21:49:33 GMT', 'project_uid': 'P16', 'queued_at': 'Fri, 19 Jan 2024 21:49:31 GMT', 'started_at': 'Fri, 19 Jan 2024 21:50:10 GMT', 'uid': 'J69'}
cryosparcm cli "get_job('P16', 'J70', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '659d95ca81a656dde2f4e484', 'completed_at': None, 'created_at': 'Tue, 09 Jan 2024 18:51:54 GMT', 'job_type': 'nonuniform_refine_new', 'launched_at': 'Fri, 19 Jan 2024 21:59:22 GMT', 'project_uid': 'P16', 'queued_at': 'Fri, 19 Jan 2024 21:49:37 GMT', 'started_at': 'Fri, 19 Jan 2024 22:01:23 GMT', 'uid': 'J70'}
Interesting. Please can you also run this command:
cryosparcm cli "get_job('P16', 'J53', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
cryosparcm cli "get_job('P16', 'J53', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '6596bf229906ac8299b73c58', 'completed_at': 'Tue, 02 Jan 2024 03:58:58 GMT', 'created_at': 'Tue, 02 Jan 2024 03:12:59 GMT', 'job_type': 'nonuniform_refine_new', 'launched_at': 'Tue, 02 Jan 2024 03:13:53 GMT', 'project_uid': 'P16', 'queued_at': 'Tue, 02 Jan 2024 03:13:52 GMT', 'started_at': 'Tue, 02 Jan 2024 03:14:06 GMT', 'uid': 'J53'}
@jhzhu We appreciate your efforts in gathering debugging information. Unfortunately, we could not identify the cause of the problem from the logs. It is possible that some additional job(s) whose lower level inputs were required did not complete. We suggest starting processing from scratch in a new project.
OK. I started a new project to test. Just use the “Extensive Validation”. Still have the same problem.
@jhzhu We unfortunately do not what is causing this problem. I understand that currently, cryosparc_master
services
which, taken together, constitutes two “layers” of workload management. I wonder whether simplifying workload management would help in either diagnosing or circumventing the problem.
You could try:
Alternative 1. running cryosparc_master
processes outside slurm
Alternative 2. running cryosparc_master
processes as a slurm job
cryosparc_master
processes would not be interrupted by slurmcryosparc_master
processes are running on a GPU nodecryosparc_master
processesCRYOSPARC_MASTER_HOSTNAME
, CRYOSPARC_HOSTNAME_CHECK
and thecryosparcw connect --worker
parameter all set to localhost
. (These settings are incompatible for CryoSPARC instance with worker nodes in addition to the cryosparc_master
node.)