Instance information:
Single workstation
Current cryoSPARC version: v4.4.1
$ uname -a && free -g
Linux nereus 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 62 9 27 0 25 52
Swap: 31 0 31
Worker environment:
CRYOSPARC_PATH=/home/cryosparcuser/cryosparc/cryosparc_worker/bin
PYTHONPATH=/home/cryosparcuser/cryosparc/cryosparc_worker
NUMBA_CUDA_INCLUDE_PATH=/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include
LD_LIBRARY_PATH=
PATH=/home/cryosparcuser/cryosparc/cryosparc_worker/bin:/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/condabin:/home/cryosparcuser/cryosparc/cryosparc_master/bin:/usr/local/IMOD/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/IMOD/pythonLink
libicudata.so.70 (libc6,x86-64) => /lib/x86_64-linux-gnu/libicudata.so.70
libicudata.so.70 (ELF) => /lib/i386-linux-gnu/libicudata.so.70
libicudata.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libicudata.so
libcudadebugger.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudadebugger.so.1
libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
libcuda.so.1 (libc6) => /lib/i386-linux-gnu/libcuda.so.1
libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so
libcuda.so (libc6) => /lib/i386-linux-gnu/libcuda.so
Linux nereus 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 62 9 27 0 25 52
Swap: 31 0 31
Wed Apr 24 11:58:13 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 Off | 00000000:09:00.0 On | N/A |
| 0% 40C P8 15W / 370W | 388MiB / 10240MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1333 G /usr/lib/xorg/Xorg 333MiB |
| 0 N/A N/A 2483 G xfwm4 4MiB |
| 0 N/A N/A 2714 G ...irefox/4090/usr/lib/firefox/firefox 38MiB |
+---------------------------------------------------------------------------------------+
Issue:
One of our researchers will start a job, it will run for between 5-30 seconds, and then with no warning, the system will reset. Nothing strange shows up in the logs – the screen just goes blank and the machine has restarted back to the BIOS with no warning.
I’m happy to provide any logs that folks feel would be relevant. It just seems like such a strange problem that maybe there’s some conventional wisdom about system hardware that’s more likely to give a promising result. We did already do a full multi-day memory test to confirm that the new RAM we installed is not the problem. Thank you in advance!