This is very useful, thanks @hsynder!
@hsynder: Many thanks for this insight and your recommendation! This sounds very helpful! I will give it a try as soon as I will update to v4.6.x, again! (At the moment I want to calm down the situation, first.)
Best regards,
Dirk
Hi everyone,
CryoSPARC v4.6.1, released today, contains a change which we believe will fix the non-transparent-hugepage-related stalls on cluster nodes. We were not able to reproduce the problem ourselves so we cannot be 100% certain, but with the help of forum users we discovered a possible stall scenario and fixed it. We would greatly appreciate it if anyone previously experiencing this issue could update to v4.6.1 and confirm that the problem is resolved.
v4.6.1 also reconfigures Python’s numerical library (numpy) to not request huge pages from the operating system. We have found that this change resolves stalls related to transparent huge pages and it is therefore no longer necessary to turn off THP at the system level (leaving the setting at the default “madvise” should no longer cause problems). In v4.6.1, jobs will also emit a warning if the OS is set to “always” enable THP. If you have already changed your OS configuration to disable THP, it is possible (though not necessary) to revert the OS configuration change after upgrade to v4.6.1.
–Harris
Hi @hsnyder ,
Today I have upgraded from 4.5.3 to 4.6.1.
our cluster nodes has the following configuration.
cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
Job does emit a warning
[CPU: 254.7 MB Avail: 297.94 GB]
Transparent hugepages are enabled. You may encounter stalls or performance problems with CryoSPARC jobs.
This is just a warning and nothing to be worried about as numpy is not requesting huge pages from the operating system ? I dont think I can request them to change this cluster wide, as we may have to live with it. Would be great if you could clarify.
Good question, my previous message probably could have been clearer about this. Numpy by default will request THPs from the OS using the madvise
system call. That’s what the madvise
setting is about. The change we made in 4.6.1 is to prevent numpy from making that request. That change will only have an effect if the system-wide setting is madvise
. If it’s always
, then the kernel will always try to use THPs, whether an application requests it or not, and likewise never
menns the OS will never try to use THPs. There’s nothing we can really do about a system that is set to always
use THP, which is why we issue the warning. That said, some users don’t experience these problems - it possibly depends on Linux kernel version. The warning is just to bring to your attention the fact that CryoSPARC itself can’t do anything about the fact that the system will try to use THP, and if you experience jobs stalling or becoming egregiously slow, turning that system-wide setting off could be indicated.
Harris
Thanks for the clarification @hsnyder
would adding the following to config.sh in worker help ?
export NUMPY_MADVISE_HUGEPAGE=0
No, that’s exactly what we do in v4.6.1. We don’t do it via config.sh, but it’s exactly the same mechanism. It works the way I described previously.
Harris
Today I updated the cryosparc from v4.6.0 to v4.6.1 on my workstation. Now I am doing Local Refinement and NU Refinement and in both cases I got these two warnings which I have never seen before:
- Transparent hugepages are enabled. You may encounter stalls or performance problems with CryoSPARC jobs
- WARNING: io_uring support disabled (not supported by kernel), I/O performance may degrade
Should I worry about these warnings? Thanks!
Hi @donghuachen,
You don’t need to worry about them, but they do indicate potential problems. It seems you have transparent huge pages set to [always]
. This is only a problem if it results in job stalls on your particular system, so it may or may not be something you should change. Also, CryoSPARC disabled io_uring due to lack of kernel support, which suggests you might be using a very old linux distribution, like centos7? I recommend upgrading for many reasons, but this is just a performance thing, not a correctness problem.
Harris
Hi @hsnyder ,
Thanks for your reply.
I just checked my linux version and found the following:
cat /etc/centos-release
CentOS Stream release 8