RAM management for local refinement jobs

AndreGraca · December 9, 2022, 12:32pm

Hi!

I have a question regarding how cryoSPARC manages RAM resources overall or specifically for Local Refinement jobs.

For now, I am still using v3.3.2+220518.
I have recently realised that if two Local Refinement jobs are running at the same time (on different GPUs of the same worker), they proceed very very slowly (more than one order of magnitude slower) compared to if they would run one at each time.
Even if Local Refinement uses quite a lot of CPU at certain points of the process, my assessment tells me that such a slowdown cannot be due to CPU bottleneck. Therefore I suspect RAM, however, when I have both jobs running, I have never seen the RAM being used over 75% (for any job type, if I recall correctly). This is true when I have swap memory on or off.
I do not see cryoSPARC using up more than 80% of RAM in our machines at any given point. If there is swap memory, it will store much data there as RAM starts to fill up, and in such a scenario I expect a job to slow down. On the other hand, with swap off, I would expect RAM to be used fully until the point that a job may fail due to running out of memory. Instead what seems to be happening is that memory is kept capped (somewhat at a threshold below 80%) and the jobs are running slow.

As I observe similar in more than one machine, it seems to me a cryoSPARC RAM management issue that I do not understand. @team can you bring some light?

Or is this a known issue that cryoSPARC or Ubuntu try to leave some free memory for other software running? 20% of RAM never being used on biffy machines seems to me a huge waste of resources.

Have a great weekend!
André

ccgauvin94 · December 13, 2022, 11:16pm

The Linux kernel has a customizable parameter for how swap is utilized, called “swapiness”. If you set swapiness to zero, nothing will get written to swap unless you get an out-of-memory error. If you set it to 100, the system will basically act as if it prefers swap over RAM.

You can see your current swapiness value by running:

cat /proc/sys/vm/swappiness

I believe most systems, including Ubuntu, default to 60. Therefore, when RAM hits 40%, the kernel will start swapping. If you want to it start swapping at 90%, set this value 10.

I can’t remember how Ubuntu sets up this value. Traditionally, it was in /etc/sysctl.conf, but now on a lot of distros, those files are autogenerated by comparing the contents of system defaults with the contents of /etc/sysctl.d/*.conf, and if you want to make a change, you might need to place a file in /etc/sysctl.d/ although this manpage doesn’t say anything about it:

https://manpages.ubuntu.com/manpages/bionic/man8/sysctl.8.html

Maybe you can just add or edit vm.swappiness=10 to your /etc/sysctl.conf as root? Then you would want to run sudo sysctl -p.

Anyways, I do doubt that swapping is the issue. I don’t think that cryoSPARC processes will be getting swapped, because of how the kernel prioritizes what to swap. I wonder if you are seeing a slow down because you are hitting the I/O limit of whatever drives you are reading off of.

EDIT: It looks like I have slightly out-of-date information, and swapiness now ranges from 0-200, with 100 assuming equal I/O cost for swapping vs RAM. Therefore, 60 would be slightly biased toward RAM.

github.com

torvalds/linux/blob/497a6c1b09902b22ceccc0f25ba4dd623e1ddb7d/Documentation/admin-guide/sysctl/vm.rst

===============================
Documentation for /proc/sys/vm/
===============================

kernel version 2.6.29

Copyright (c) 1998, 1999,  Rik van Riel <riel@nl.linux.org>

Copyright (c) 2008         Peter W. Morreale <pmorreale@novell.com>

For general info and legal blurb, please look in index.rst.

------------------------------------------------------------------------------

This file contains the documentation for the sysctl files in
/proc/sys/vm and is valid for Linux kernel version 2.6.29.

The files in this directory can be used to tune the operation
of the virtual memory (VM) subsystem of the Linux kernel and
the writeout of dirty data to disk.

This file has been truncated. show original

AndreGraca · December 14, 2022, 5:55am

Hi @ccgauvin94

Your answer is very much appreciated! I am quite sure that your instructions where will help many.

However, I have to dismiss swappiness as being an issue to fix. Early in the days setting up our workstation I played with the swappiness value as I suspected it would influence and I found that swappiness values under 10 are beneficial for EM workstations running cryoSPARC.

As you later concluded, a swappiness value of 10 does not linearly mean that swapping from RAM will start when RAM is at 90% capacity. In fact, the swappiness values set in our workstations are 5 or 8. Even recently I lowered it to 2, just to see if something would change in the behaviour I described in my post above. The behaviour did not change.

I also tried to sort out if the problem was I/O. It is not the case even when the system reads and writes to PCIe 4.0 NVMe drives. Also, I/O would not explain why I constantly see a hard limit on RAM capacity, right? Literally never moves to 81%

I started to read on zRAM as a different swap method and I wonder if it could help. At the moment I do not have many opportunities to play around with this. Does anyone have experience with zRAM?

@team, do you have anything to say on this topic overall?

Thanks once again @ccgauvin94

Cheers,
André

ccgauvin94 · December 14, 2022, 7:05pm

You can safely delete your swap partition entirely on most systems. The system should run fine except in the event that you do run out of memory, in which case you may wind up with an unresponsive system for a few minutes until the kernel OOM killer, or systemd-oomd (depending on what you’re using) kicks in. That should prevent the computer from doing any swapping at all.

Obviously, how cryoSPARC reserves memory could be an issue here, but have you checked to see if you have any active cgroups that could be limiting memory usage? I think Ubuntu has an lscgroup command that should tell you which cgroups are active.

AndreGraca · December 14, 2022, 7:25pm

Indeed. I think I missed to make clear a very important detail here: even with swapoff, the system never surpasses 80% RAM when using cryoSPARC.

Great tip! I didn’t come across cgroup usage and did not consider that. I will have a look. I have just listed them, now I need to know how to dig into each of them (long list) regarding memory usage.

Thank you @ccgauvin94

wtempel · December 15, 2022, 2:33pm

CryoSPARC does not dynamically constrain its RAM use based on amount of RAM free or available on the worker once a job has been allocated to a worker.
It is possible that RAM constraints are imposed by the OS or cluster resource manager.

AndreGraca · March 22, 2023, 4:58pm

Hi!

I keep seeing the observed and while it is happening with cryoSPARC, it does not happen with at least one other software which is performs large memory operations.

Has anyone detected the same?

I have cryoSPARC updated to the latest release, so this behavior has not changed since I detected it with v3.3.2

And I have realized that it it not only for local refinement jobs…probably just needs to be a job that requires a lot of RAM. I have detected during Non-uniform refinement jobs as well.

Cheers!
André

DanielAsarnow · March 23, 2023, 1:42am

Might be related to the amount of data being cached in memory. By default Linux uses up to 40% of the system memory of file caching. Say one job uses 90% of that space, then with two jobs active around half of each particle set will have to be read from the SSD over and over. That could result in much longer wall times, without there being a bottleneck on the SSDs.

AndreGraca · March 23, 2023, 7:32am

Hi @DanielAsarnow

Thanks for your input!
I think I understood what you mean.
Do you know how to circumvent that? How to change that 40% limit, for instance.

Also this would only be the case if two jobs are running right?
I have to confirm but I think the problem I have happens the same with only one job running.

DanielAsarnow · March 23, 2023, 7:00pm

You can increase the % available for caching, or you can add more RAM.

Memory is much faster than SSDs so if the particles fit then you see this caching benefit. (And in such a case, caching on the SSD is almost irrelevant because the files will be held in memory the whole time anyway). It doesn’t matter if multiple jobs are running or one, it’s just dependent on the size of the data relative to the memory cache. (And access patterns, if we have 80% of the data in memory but for some reason always choose to look at the 20% that’s not, we won’t get a benefit).

As to your previous question about only using “80%” of system memory, measuring memory usage on Linux is actually quite challenging. The operating system doesn’t waste time clearing memory unless it’s needed, and lots of the memory use is actually double (or triple, N times) counting because of only one copy of a shared library is loaded for all of the programs that use it, yet it contributes to all of their memory totals. It’s likely that the “80%” you are seeing is actually much less, which is why it’s not increasing.

AndreGraca · March 24, 2023, 6:34am

Thanks @DanielAsarnow

I get this is a complex topic, but I wish to learn more.
How to increase the percentage available for caching? I am at max RAM I can have in the current motherboard: 256GB. If I never see 20% of RAM being used, that seems to me a waste of 50GB.

DanielAsarnow · March 24, 2023, 7:13am

I’m not sure why I thought it’s only 40%. I think all the free memory can be used for the buffer/page cache, and this memory is still listed as “free.” Thus your unused memory is actually likely in use in the page cache. The other point I made, that when it says 80% used, it’s actually probably much less, is still valid.

AndreGraca · March 24, 2023, 7:42am

I was confused about the 40%, thank you for looking more into it.

Well… it feels always so bad to know that I have 50GB of ‘free memory’ (perhaps not so free, as you suggest) and have everything slowing down. When I have swapon I see that extra data cached there, something between 30 to 60GB. That sometimes could perhaps fit in the free memory but instead makes Non-uniform refinement jobs take 3 days to finish. Even if I have for 3 x 3090 GPUs in the system, I end up only using one at this step of the workflow because of this RAM issue. If I have other jobs running in parallel, 2D classification or 3D refinements, then to finish any of those jobs takes a week.

DanielAsarnow · March 24, 2023, 6:09pm

I thought you already eliminated swap as an issue by unmounting the swap partition?

AndreGraca · March 24, 2023, 9:48pm

Yes, I did. I have been keeping swap off and only 80% of RAM is used anyway.

DanielAsarnow · March 27, 2023, 11:40pm

Then I think the memory is a non sequitur. You’re not going into swap (or else turning it off would result in OOM errors and the process would terminate), and the amount of memory used is less than 80%. You have more memory than you need, and something else is limiting the speed of your jobs.