Workstation crashing during 2D class

Hi

I recently started using Cryosparc on a new machine, and it has repeatedly crashed the system. The system becomes completely unresponsive and cannot be accessed via SSH or webgui.
It seems to happen during 2D class jobs (usually with large numbers of classes and particles)… on a previous machine I had 2D class jobs fail if the number of classes was too much for the GPUs memory but it never crashed the entire system.

Has anyone else experienced hard crashes, is there any way to make sure that jobs wont crash the system (even if they fail instead>?).

Thanks

Matthew.

1 Like

Yes, I am having that same issue: new linux box and hard crash of the entire machine when I was doing blob picking (so not using either the CPU or the GPU at any appreciable level).
I managed to get through the motion correction and CTF estimation steps without issue.
Unfortunately, I haven’t had any response so far and I have yet to figure out a solution.
Best regards, Tom

If you notice any patterns, let me know. I haven’t detected any consistent cause yet, and there isnt a single job that I cant do.
My best guess so far is that its a glitch whereby the job should fail due to some resource limit (or perhaps someone else SSHing into the box and using some RAM etc… but there isn’t an adequate safeguard to prevent it confusing the kernel.

I have a few Cuda synchronisation type errors too, so it might not even be the same problem.

As my new box is at home, I’m pretty sure nobody else is trying to use it at the same time.
I found it weird/ strange as the CPU and GPU were barely being used during the blob picking process when my jobs consistently failed- but they never failed on the same image/movie, so I’m pretty sure it wasn’t a ‘bad data’ issue.
Still trying to figure out any pattern that might be present…
Best regards, Tom

Maybe a similar issue here: One of our researchers starts running a job, and after between 3-30 seconds, Ubuntu just hard resets, no warning, straight back to BIOS, nothing notable in the logs. Would love to know if someone else is still experiencing this or if there’s a decent troubleshooting route to look into.

Some more details would help: full system resets like this can be hard to pin down. I’ve had it happen myself (although not with CryoSPARC) where compiling across all threads would make the system hard reboot - turned out to be the motherboard overcurrent protection kicking in, which is triggered at the hardware level and so system went down before the OS could even log it. So information on mobo/CPU/PSU/circuit system is connected to would prove useful for troubleshooting.