Stuck in queue forever

Followed your instructions and sent you the log file by email. J70 still waited in the queue.

After turning off the logging and cryosparcm restart, J70 started running.

Thanks for sending the logs. Please can you also post

  1. outputs of the commands
    cryosparcm cli "get_job('P16', 'J69', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
    cryosparcm cli "get_job('P16', 'J70', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
    
  2. the expanded Inputs section of job J70.
cryosparcm cli "get_job('P16', 'J69', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '659d956481a656dde2f4a856', 'completed_at': 'Fri, 19 Jan 2024 21:54:13 GMT', 'created_at': 'Tue, 09 Jan 2024 18:50:12 GMT', 'job_type': 'restack_particles', 'launched_at': 'Fri, 19 Jan 2024 21:49:33 GMT', 'project_uid': 'P16', 'queued_at': 'Fri, 19 Jan 2024 21:49:31 GMT', 'started_at': 'Fri, 19 Jan 2024 21:50:10 GMT', 'uid': 'J69'}
cryosparcm cli "get_job('P16', 'J70', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '659d95ca81a656dde2f4e484', 'completed_at': None, 'created_at': 'Tue, 09 Jan 2024 18:51:54 GMT', 'job_type': 'nonuniform_refine_new', 'launched_at': 'Fri, 19 Jan 2024 21:59:22 GMT', 'project_uid': 'P16', 'queued_at': 'Fri, 19 Jan 2024 21:49:37 GMT', 'started_at': 'Fri, 19 Jan 2024 22:01:23 GMT', 'uid': 'J70'}

Interesting. Please can you also run this command:

cryosparcm cli "get_job('P16', 'J53', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"

cryosparcm cli "get_job('P16', 'J53', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '6596bf229906ac8299b73c58', 'completed_at': 'Tue, 02 Jan 2024 03:58:58 GMT', 'created_at': 'Tue, 02 Jan 2024 03:12:59 GMT', 'job_type': 'nonuniform_refine_new', 'launched_at': 'Tue, 02 Jan 2024 03:13:53 GMT', 'project_uid': 'P16', 'queued_at': 'Tue, 02 Jan 2024 03:13:52 GMT', 'started_at': 'Tue, 02 Jan 2024 03:14:06 GMT', 'uid': 'J53'}

@jhzhu We appreciate your efforts in gathering debugging information. Unfortunately, we could not identify the cause of the problem from the logs. It is possible that some additional job(s) whose lower level inputs were required did not complete. We suggest starting processing from scratch in a new project.

OK. I started a new project to test. Just use the “Extensive Validation”. Still have the same problem.

@jhzhu We unfortunately do not what is causing this problem. I understand that currently, cryosparc_master services

  1. run within their own slurm job allocation
  2. launch additional slurm jobs,

which, taken together, constitutes two “layers” of workload management. I wonder whether simplifying workload management would help in either diagnosing or circumventing the problem.

You could try:
Alternative 1. running cryosparc_master processes outside slurm

  • CryoSPARC jobs would be submitted to slurm

Alternative 2. running cryosparc_master processes as a slurm job

  • would need to ensure that cryosparc_master processes would not be interrupted by slurm
  • would need to ensure that cryosparc_master processes are running on a GPU node
  • would need to configure CryoSPARC in single workstation mode
  • would run jobs on the same host as cryosparc_master processes
  • could simplify setup by setting CRYOSPARC_MASTER_HOSTNAME, CRYOSPARC_HOSTNAME_CHECK and the
    cryosparcw connect --worker parameter all set to localhost. (These settings are incompatible for CryoSPARC instance with worker nodes in addition to the cryosparc_master node.)