Here are is the outline of a possible implementation. The code snippet and cli functions should work in CryoSPARC v4.6, but may not work with future releases of CryoSPARC. Other details, like the path of the cluster submission script, may also change in a future release of CryoSPARC.
- Include the
custom_ram_gb
variable
in the slurm script template of the hypothetical scheduler lane#SBATCH --mem={{ custom_ram_gb | default(ram_gb) | int }}G
cryosparc_lane
. Once identified, OOM jobs can be submitted to this lane with an augmented RAM request forcustom_ram_gb
. - Find jobs that have failed within a defined
TIME_WINDOW
that ended now.
One could run this script with the commandfrom cryosparc_compute import database_management import datetime import os TIME_WINDOW = datetime.timedelta(days=1) db = database_management.get_pymongo_client('meteor')['meteor'] for job in db.jobs.find({'status': 'failed', 'failed_at': {'$gt': datetime.datetime.utcnow() - TIME_WINDOW}}, {'project_uid': 1, 'uid': 1}): print(f"{job['project_uid']} {job['uid']}")
cryosparcm call python /path/to/recently_failed.py
- Refer to the slurm database of jobs and/or the
queue_sub_script.sh
scripts in the respective job directories and/or slurm stdout/stderr files to select those among the failed jobs that had OOM errors. - Clear and resubmit (with increased
--mem=
spec) jobs from the list of OOM jobs. Theclear_job
,set_cluster_job_custom_vars
andenqueue_job
cli functions may be useful in this context.