Panic Dump and Reboot on GPU Servers

We are having CryoSparc GPU compute nodes panic dumping and rebooting.

Environment: HPC cluster using Slurm 20.11.7. We use VMWare ESX version 7 with CentOS 8 VMs. The CryoSparc GPU compute nodes (11 of them) have 4 GPUs assigned. A mixture of Titan XP and RTX 2080 GPUs (same type on a node x4). We’ve been running CryoSparc with OS, CUDA, and CryoSparc upgrades for quite a while. In July we had maintenance window where we updated the CentOS 8 OS with the latest patches and upgraded the CUDA version to 11.4. Since then we’ve had jobs panic the compute VM and reboot it. Not all jobs cause this to happen, many still run to completion.

We’ve tried:

  1. Updating just the NVidia driver leaving CUDA 11.4 in place. Tried 2 different driver version without success.
  2. When CUDA 11.4.1 was released we tried upgrading to that. no success.
  3. Tried upgrading from CryoSparc 3.1.0 to 3.2.0, no success
  4. Tried downgrading to CUDA 11.3, no sucess

The researcher say they are mainly running Blobpicker and Heterogeneous classification when this occurs. They also noticed that jobs that are more memory intensive when they start up are the ones that are failing, such as heterogeneous refinement. They also commented “If a heterogeneous refinement job is already running and I start another one, it keeps going. But if the node is idle and I start a job, more chances it fails”.

I know that Cuda version 11.2 is supported. Have you considered downgrading? That helped me before…

1 Like