I recently started using Cryosparc on a new machine, and it has repeatedly crashed the system. The system becomes completely unresponsive and cannot be accessed via SSH or webgui.
It seems to happen during 2D class jobs (usually with large numbers of classes and particles)… on a previous machine I had 2D class jobs fail if the number of classes was too much for the GPUs memory but it never crashed the entire system.
Has anyone else experienced hard crashes, is there any way to make sure that jobs wont crash the system (even if they fail instead>?).
Yes, I am having that same issue: new linux box and hard crash of the entire machine when I was doing blob picking (so not using either the CPU or the GPU at any appreciable level).
I managed to get through the motion correction and CTF estimation steps without issue.
Unfortunately, I haven’t had any response so far and I have yet to figure out a solution.
Best regards, Tom
If you notice any patterns, let me know. I haven’t detected any consistent cause yet, and there isnt a single job that I cant do.
My best guess so far is that its a glitch whereby the job should fail due to some resource limit (or perhaps someone else SSHing into the box and using some RAM etc… but there isn’t an adequate safeguard to prevent it confusing the kernel.
I have a few Cuda synchronisation type errors too, so it might not even be the same problem.
As my new box is at home, I’m pretty sure nobody else is trying to use it at the same time.
I found it weird/ strange as the CPU and GPU were barely being used during the blob picking process when my jobs consistently failed- but they never failed on the same image/movie, so I’m pretty sure it wasn’t a ‘bad data’ issue.
Still trying to figure out any pattern that might be present…
Best regards, Tom
Maybe a similar issue here: One of our researchers starts running a job, and after between 3-30 seconds, Ubuntu just hard resets, no warning, straight back to BIOS, nothing notable in the logs. Would love to know if someone else is still experiencing this or if there’s a decent troubleshooting route to look into.
Some more details would help: full system resets like this can be hard to pin down. I’ve had it happen myself (although not with CryoSPARC) where compiling across all threads would make the system hard reboot - turned out to be the motherboard overcurrent protection kicking in, which is triggered at the hardware level and so system went down before the OS could even log it. So information on mobo/CPU/PSU/circuit system is connected to would prove useful for troubleshooting.
If you’re still interested, here’s the machine specs as far as I know:
Custom-built gaming PC/workstation
ASRock B550M/ac motherboard
64GB DDR4 RAM
AMD Ryzen 7 3700X CPU
Nvidia GeForce RTX 3080 GPU
I’m not sure what power supply it came with at the moment, I suppose that would be worth checking. I have a feeling they put something decently beefy in it, and 800W or more should be fine for this hardware, but maybe they went small or cheap. It’s plugged into a reliable 120V circuit, as far as I know.
Thanks. Nothing raises any red flags, provided the PSU is up to spec, which it probably will be.
Would still be worth checking in the UEFI for any OCP trip values being set too low (not sure where Asrock keep them, sorry) as instant hard reboots without OS logs are, as I said, a symptom of that.