Jobs not submitting to slurm scheduler

RussellM · June 14, 2022, 4:24pm

Hi!

I’m just posting here because I started having issues with submitting jobs on a SLURM scheduler shortly after updating to v3.3. In the last couple days I started getting this error when I try to submit submit jobs that indicate there is an issue allocating memory:

ServerError: Traceback (most recent call last): 
File "/usr/local/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 150, in wrapper 
    res = func(*args, **kwargs) 
File "/usr/local/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2309, in run_job 
    res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode() 
File "/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 411, in check_output
     **kwargs).stdout
File "/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 488, in run 
    with Popen(*popenargs, **kwargs) as process: 
File "/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 800, in __init__ 
    restore_signals, start_new_session) 
File "/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 1482, in _execute_child 
    restore_signals, start_new_session, preexec_fn) 
OSError: [Errno 12] Cannot allocate memory 
The above exception was the direct cause of the following exception: 

Traceback (most recent call last): 
File "/usr/local/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 150, in wrapper 
    res = func(*args, **kwargs) 
File "/usr/local/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 1861, in scheduler_run 
    scheduler_run_core(do_run) 
File "/usr/local/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2079, in scheduler_run_core 
    run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case 
File "/usr/local/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 157, in wrapper 
    raise ServerError(s.getvalue(), code=400) from e flask_jsonrpc.exceptions.ServerError 
The above exception was the direct cause of the following exception: 

Traceback (most recent call last): 
File "/usr/local/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 150, in wrapper 
    res = func(*args, **kwargs) 
File "/usr/local/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 5121, in enqueue_job 
    scheduler_run() 
File "/usr/local/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 157, in wrapper 
    raise ServerError(s.getvalue(), code=400) from e flask_jsonrpc.exceptions.ServerError

I’m trying to decipher what might be going on and why CS jobs are not getting submitted so I can work with the HPC manager to get jobs working again.

Thank you,
Russell McFarland

wtempel · June 14, 2022, 7:20pm

Hi Russell,

Am I assuming correctly you are trying to submit cryoSPARC jobs to a slurm cluster, as opposed to run a cryoSPARC instance as a slurm job?
A few questions suggestions:

What was the cryoSPARC version prior to the update?
Were slurm submission working prior to the update?
Did the update complete without error?
What was the job type?
If, at the time of the error, a slurm submission script queue_sub_script.sh has already been created inside the job directory, you may try (as the Linux user “owning” the cryoSPARC installation, on the cryoSPARC master node):
sbatch -vv /job/directory/queue_sub_script.sh

RussellM · June 14, 2022, 8:07pm

Thanks for the quick response! I was trying to do a little legwork before reaching out to the clusters admins.

Before updating, I believe we were running 3.3.1. I was trying to submit a homogeneous refinement job on the slurm cluster; I already have an instance of the master running.
Submissions and the submit script itself was working before the update and I believe that the update went through successfully. This was a homogenous refinement job. I have been able to submit a couple of 3DVA jobs since the update, I just mention that we updated because it was recent and the last major change on the cluster that I know of.

Best,
Russell

wtempel · June 15, 2022, 8:58pm

How much DRAM is available on the cryoSPARC master node?
free -g

RussellM · June 15, 2022, 9:16pm

The master node is hosted on a VM that has 16gb of RAM on it.

I was in contact with the cluster’s admins earlier today about this problem and apparently my lab group ran into a problem with our inode quota that caused the master node to get tied up. They restarted the cryosparcm service and jobs were able to be queued again.

Thanks for you help!