Automating Job Resubmission After Failure Due to OOM in CryoSPARC

Hi,
In CryoSPARC, when a job fails due to an OUT_OF_MEMORY (OOM) error, its status is recorded as “failed” in the CryoSPARC database. If the failure is due to memory limitations, we need a way to automatically resubmit the same job to a higher memory node while keeping the same job ID for tracking.

I am specifically looking for a CryoSPARC CLI command or built-in functionality that can achieve this.If such a CLI command exists, we can automate the process using a shell script.

  1. Identify failed jobs in the CryoSPARC database.
  2. Extract job details, such as job ID and required resources.
  3. Resubmit the job with increased memory allocation.
  4. Assign it to a node with sufficient memory.
  5. Run periodically via cron to automate this process.

Does CryoSPARC provide any built-in commands or API options to achieve this?

Any guidance or suggestions would be highly appreciated!

System Info:

  • CryoSPARC Version: 4.6.2
  • Scheduler: Slurm
  • OS: alinux2

in addition, we get this error quite frequently, and a resubmission without any changes works without issue. So some number of automatic resubmission to the same node without parameter change could be useful as well. Pretty annoying when queuing long workflows that are stunted.

Here are is the outline of a possible implementation. The code snippet and cli functions should work in CryoSPARC v4.6, but may not work with future releases of CryoSPARC. Other details, like the path of the cluster submission script, may also change in a future release of CryoSPARC.

  1. Include the custom_ram_gb variable
    #SBATCH --mem={{ custom_ram_gb | default(ram_gb) | int }}G
    
    in the slurm script template of the hypothetical scheduler lane cryosparc_lane. Once identified, OOM jobs can be submitted to this lane with an augmented RAM request for custom_ram_gb.
  2. Find jobs that have failed within a defined TIME_WINDOW that ended now.
    from cryosparc_compute import database_management
    import datetime
    import os
    
    TIME_WINDOW = datetime.timedelta(days=1)
    
    db = database_management.get_pymongo_client('meteor')['meteor']
    
    for job in db.jobs.find({'status': 'failed',
                             'failed_at': {'$gt': datetime.datetime.utcnow() - TIME_WINDOW}},
                            {'project_uid': 1, 'uid': 1}):
        print(f"{job['project_uid']} {job['uid']}")
    
    One could run this script with the command
    cryosparcm call python /path/to/recently_failed.py
  3. Refer to the slurm database of jobs and/or the queue_sub_script.sh scripts in the respective job directories and/or slurm stdout/stderr files to select those among the failed jobs that had OOM errors.
  4. Clear and resubmit (with increased --mem= spec) jobs from the list of OOM jobs. The clear_job, set_cluster_job_custom_vars and enqueue_job cli functions may be useful in this context.