Automating Job Resubmission After Failure Due to OOM in CryoSPARC

wtempel · March 12, 2025, 7:54pm

Here are is the outline of a possible implementation. The code snippet and cli functions should work in CryoSPARC v4.6, but may not work with future releases of CryoSPARC. Other details, like the path of the cluster submission script, may also change in a future release of CryoSPARC.

Include the custom_ram_gb variable
```
#SBATCH --mem={{ custom_ram_gb | default(ram_gb) | int }}G
```
in the slurm script template of the hypothetical scheduler lane cryosparc_lane. Once identified, OOM jobs can be submitted to this lane with an augmented RAM request for custom_ram_gb.

Find jobs that have failed within a defined TIME_WINDOW that ended now.

from cryosparc_compute import database_management
import datetime
import os

TIME_WINDOW = datetime.timedelta(days=1)

db = database_management.get_pymongo_client('meteor')['meteor']

for job in db.jobs.find({'status': 'failed',
                         'failed_at': {'$gt': datetime.datetime.utcnow() - TIME_WINDOW}},
                        {'project_uid': 1, 'uid': 1}):
    print(f"{job['project_uid']} {job['uid']}")

One could run this script with the command
cryosparcm call python /path/to/recently_failed.py

Refer to the slurm database of jobs and/or the queue_sub_script.sh scripts in the respective job directories and/or slurm stdout/stderr files to select those among the failed jobs that had OOM errors.
Clear and resubmit (with increased --mem= spec) jobs from the list of OOM jobs. The clear_job, set_cluster_job_custom_vars and enqueue_job cli functions may be useful in this context.