Memory usage SLURM

jelka · September 22, 2020, 12:01pm

Dear cryoSPARC team,

Again another SLURM related post from me. Sorry!

CryoSPARC seems to have predefined setting for different jobs. e.g. “Homogeneous Refinement (NEW!)” always set “{{ ram_gb }}” to 24GB.
Depending on the jobs box/pixel size and final resolution, this is not always enough and it becomes a problem then you have a SLURM setup as I do, where cgroups is enabled. cgroups works as a resource “jail” where a job cannot go beyond it’s allocated resources. Super smart and it makes all resources on processing nodes much more modular.
Thus, a job submitted via cryoSPARC, that requires more RAM than was allocated upon submission, will crash when it reaches the max allowed RAM usage.
Therefore I have made some extra lanes which allocates additional 8 and 16 GB of RAM to jobs.
It is a workable workaround, but not ideal, as users submits many jobs that crashes due to lack of RAM.

Could it be possible to make a more accurate and dynamic estimate on RAM usage?
I guess so, since it should be possible to calculate usage from pixel size, box size and nyquist from each job.
It would be a big relief if one could submit cryoSPARC jobs where allocated resources were more precisely calculated before submission.

//Jesper

vatese · September 23, 2020, 5:17am

+1 for this.

I ended up having to double the memory requested for jobs like Local Refinement to overcome this.

stephan · September 24, 2020, 8:36pm

Hi @jelka,

Thanks for reporting this. This is on our radar, and we’ll hopefully be able to re-profile our jobs soon. For the time being, the best way to get around this is to allocate additional memory manually. Sorry for the inconvenience!

andrew-niaid · July 22, 2021, 4:32pm

@stephan, do you have any more details or time frame for this type of enhancement?

We have a user trying to do a large 3D helical refinement. The job type seems to default to 48GB ram_gb, which we triple in our slurm job submission template (to 144GB), but the job ultimately used 383GB before crashing on a numpy fftw memory allocation failure. We don’t really want to increased the slurm memory multiplier for all jobs, as this doesn’t seem necessary. But being able to do so for some job types and not others would be nice.

Thanks,
-Andrew

borsta · August 17, 2021, 9:14pm

Would also like to inquire about this. I have some particles that I would like to down sample, but it appears that their box size is too large and the down sampling job is crashing once it runs out of memory (default is 16 GB for this job type).

alburse · November 1, 2021, 5:43am

I would also like to see a more accurate memory calculation. I also want to say that large box reconstructions require much more ram than I would expect. I have a 640px size NU reconstruction that requires at least 160GB of ram. Relion on the other hand does not really require this much ram for the same particle set reconstruction. 160GB ram for 1GPU is quite a bit restrictive to use other 3GPUs. It would be beneficial if the jobs can use less ram if possible.

vatese · January 16, 2023, 4:26am

This is still an issue in 4.1.1 with 3D Flex now.

I see cryosparc requesting 65GB for a 3D Flex reconstruction job in the script. The job got OOMed.

We ran on a separate worker and seems to be using around 2GB more of what it requests.

leetleyang · January 16, 2023, 6:19pm

Hi,

Until particle caching to SSD is worked in, the memory template for flex jobs is very much a placeholder. I believe the developers have even said so.

FWIW, in our hands, memory requirements seem to adhere loosely to the following templates as a lower limit:

flex_train: (<size_of_training_data> * 1.6) +20GB
flex_highres: (<size_of_full_res_data> * 1.6)

Where dataset size = <num_particles> * <box_size>^2 * 4 / 1024^3 (GB)

I would recommend making use of the new custom script variable feature for ease of job submission.

Cheers,
Yang

vatese · January 16, 2023, 10:11pm

Thank you for that! Just updated to 4.1.1 and missed the variable script. Will give it a go.

Edit: worked great. Now even using it to select GPU types. Thanks @leetleyang!

leetleyang · January 17, 2023, 9:28am

No problem. If you wish to build out a singular, monolithic submission template, note that custom variables can be incorporated into existing if-else conditional branching structures as well.

For instance, in the arbitrary example below, if custom_mem is assigned a value, it supersedes the logic that follows.

...
{%- if custom_mem %}
#SBATCH --mem={{ custom_mem }}G
{%- else %}
{%- if num_gpu == "0" %}
#SBATCH --mem=32G
{%- else %}
#SBATCH --mem={{ num_gpu*64 }}G
{%- endif %}
{%- endif %}
...

Just ensure that statements are free of identations. It doesn’t like that.

Cheers,
Yang