AVX-512 Code Execution Issues, Kernel Panics

jr10 · May 24, 2021, 7:32pm

Hello,

We recently got a new workstation from a vendor, and after extensive testing, we realize we have an issue similar to this forum issue Kernel panic - linux box with 2 GPUS.

Our Linux machine (Centos 7) is running a 10980XE processor on a ASUS x299 SAGE 10GbE board with 128 GB RAM, and 2 3090 RTX cards. The system issues only appear in specific Cryosparc and Relion jobs, 2D and 3D classification and refinements, and only when running a job on both GPU’s at once, or running one or more of the offending job type on each GPU. The jobs run for 3-4 iterations before crashing the entire system and forcing a reboot. No logs (Cryosparc or syslogs) show any errors beyond normal errors for watchdog processes or jobs terminating abnormally.

Our current theory is that BIOS issues are causing instability when running AVX-512 code and managing jobs over 2 GPUs, leading to a shutdown by the CPU (maybe due to voltage, heat shouldn’t be an issue as the CPU is water cooled). When changing the BIOS settings to reduce overclocking and use Intel stock core settings, the issues reduce (2D jobs run to 15-20 iterations) but are not entirely removed.

What jobs in Cryosparc use AVX-512 code? For example we are able to use 2 GPUs for particle extraction and CTF estimation jobs, but not the 2D classification jobs. Additionally, if anyone has worked through these issues on a workstation before, and has helpful BIOS settings or recommended AVX-512 bench marking programs, that would be incredibly helpful. Thank you for any help!

Best,
Justas

olibclarke · May 25, 2021, 2:36am

We have a very similarly configured system (CentOS 7, 2x3090, 256GB RAM) with very similar symptoms. Single refinements run fine, but running one refinement on each GPU often leads to shutdown/reboot (GPU temps look fine in our case also). Following in case anyone has useful suggestions!

alburse · May 25, 2021, 7:11am

It sounds like a power supply issue. The Nvidia GPUs notoriously can pull max power listed or even more.

Your cpu can draw close to 350W and each of your GPU can draw upwards of 400W. You should have at least 1500W power supply and even more to adequately power your PC. Do you have 1200Watt power supply?

Those relion jobs you listed are the most GPU heavy jobs therefore you would see the max load on GPU and even cpu. Probably the power draw is peaking at some point and it shuts down your PC.

https://www.notebookcheck.net/The-NVIDIA-GeForce-RTX-3090-may-have-a-350-W-TDP-but-it-can-consume-nearly-60-more.494757.0.html#:~:text=The%20NVIDIA%20GeForce%20RTX%203090%20already%20has%20an%20insanely%20high,all%20senses%20of%20the%20word.

Geoffrey · May 25, 2021, 12:32pm

Like I posted, Kernel panic - linux box with 2 GPUS, I just did the steps in that thread and the problem was solved. To this day there’s never been an issue.

I looked through my old emails and didn’t find anything extra to help… only a brief correspondence with Vitor Balasco Serrao from Jeff Lee’s lab who had the same exact computer as us and experienced the issue before we did, and solved it in the same way.

olibclarke · May 25, 2021, 1:11pm

In our case I am not sure now that it is the same issue, though symptoms are similar - we have a threadripper CPU, not intel

jr10 · May 25, 2021, 4:11pm

Hi @alburse,

The system we have has a 2000W power supply, so I don’t believe that is the issue unless the PSU is bad. That said, we did run burn-in tests using mprime and an NVIDIA test, both ran very well. I guess I could try running both together, but we also see the crashes under minimal load (1 2D job ran over parallel GPU’s, drawing ~100W per GPU according to nvidia-smi).

Best,
Justas

alburse · May 25, 2021, 4:32pm

2000watt should be good. @Geoffrey solution with bias seems to be promising. I hope you solve this quickly.

priiteek · June 21, 2021, 6:22pm

We are also having similar issues on our workstation: CentOS 8, AMD Ryzen Threadripper 3970X, ASRock TRX40 board, 2x RTX 3090. It crashes quite reproducibly while running Relion Refine3D on both GPUs, but sometimes also when running two cryoSPARC refinements simultaneously.

@olibclarke, have you found a cure to this issue?

jr10 · June 21, 2021, 6:39pm

Hi @priiteek, a quick question, but did you buy the workstation prebuilt, and if so, which builder did you use? @olibclarke and I have found that we both used the same builder, Exxact, and are still not sure if that has something to do with it.

Best,
Justas

olibclarke · June 21, 2021, 6:52pm

In our case it seems to have turned out the issue was GPU card placement - they were restricting air flow to the motherboard. After moving the cards down from slots 1&3 to slots 2&4, we have yet to have another crash.

priiteek · June 21, 2021, 7:10pm

Hi @jr10,
We also have an Exxact machine. They took our system in as RMA and replaced some parts, but we are still having issues. Before that, general stress tests were also able to crash the system. After getting the system back that’s no longer the case, but Relion still crashes it. I wonder if it’s some kind of a system configuration incompatibility. It also seems that the physical case might not be rigid enough and things can bend and scratch during transportation, which can cause bad connections.

@olibclarke, that sounds interesting. I will try that out, since another round of shuffling the GPUs is in order anyway.

Best,
Priit

jr10 · May 3, 2022, 5:13pm

Hi @priiteek and others,

I wanted to update this thread as during our troubleshooting we noticed that this error, while rare, was a persistent issue for other users. From our testing, the multi-GPU execution issue seemed to be related to the use of CentOS 7 (kernel 3.10), and switching to Ubuntu 22.04 LTS was a nearly instant fix.

As Exxact ships CentOS7 as their default OS, this is something to be aware of for any Exxact workstation users. We have been in contact with them regarding this issue and have directed them to Cryosparc’s notice that CentOS 7 and kernel 3.10 cause issues with the software. Hopefully, this will not be such a difficult issue for others to track down.

Best,
Justas