I found one of our data has problem with 2D classification which has 2 million particles with larger box size as 350 Angstrom. Jobs are running on a standalone workstation. The 2D jobs will reported “Job is unresponsive - no heartbeat received in 30 second” arbitrarily and force the whole workstation go through reboot process. Dose this often caused by an insufficient RAM? Or could it due to some weird data? Is there a limitation of RAM for 2D jobs? Look forward to all the suggestion. Thanks!


@zjr, does your issue look similar to this:

Kernel panic - linux box with 2 GPUS

Hi @apunjani My issue seems easier than this, I’m not encountering the computer reboot that frequently. Only certain 2D jobs crashed. Are you saying this could connect to the workstation harddrive?


@zjr,

Generally speaking, cryoSPARC should never cause the machine to reboot itself unless there is an underlying hardware issue as in the other thread.
What can happen (depending on how much CPU RAM you have) is that the job will use so much CPU RAM that the system will start to swap (i.e. temporarily move RAM content to very slow disk) and that can cause the system to appear to hang (but not reboot). That can often cause the “job unresponsive” problem. How much CPU RAM do you have?

@anpunjani,

I have 81 logic cpus and 256G RAM which should be pretty good enough. However I guess sometime for some super tough jobs, still be overwhelming for the RAM or cpu, then the computer got crashed.


@zjr,

What GPUs are running in your system? Is it possible that the Power Supply Unit isn’t able to keep up with the load? How many watts is your power supply rated for?

@stephan,

We are running 1080Ti, Yep, I guess you’re also right, your idea remind me every time we got a crash, the cooling fan works super heavy, Sorry I didn’t know the watts of our power supply exactly, but since everything is running smoothly most of the time. The power should be fine. Though I can look into that if you think it’s super critical.


@zjr,

Yes definitely- each 1080Ti is about 250W-270W. Running 4 of them at a time will breach the limit of most consumer PSU’s. That as well as your beefy CPU should be drawing quite a bit of power. I think it’s definitely worth looking into.