In short:
Local refinement without “low-memory” mode gets killed for running out of RAM (not GPU memory) for one particular dataset, showing as a heartbeat failure. Turning on “low-memory” makes the jobs succeed
Longer:
One of my users ran into an issue on cryosparc 4.4.0 (might have existed before that version) where if they run one input set through Local refinement (all default settings) it would consistently fail at iteration 20 with a heartbeat failure.
This persisted after updating to 4.4.1 and setting the "CRYOSPARC_HEARTBEAT_SECONDS=300”
What makes it even weirder is that it is run on a single workstation so heartbeat failures are not really expected anyway.
Looking through the system logs I noticed:
Jan 12 04:14:16 [machine] kernel: Out of memory: Kill process 10047 (python) score 486 or sacrifice child
Jan 12 04:14:16 [machine] kernel: Killed process 10047 (python), UID 1000, total-vm:39146585928kB, anon-rss:55944324kB, file-rss:100832kB, shmem-rss:3068272kB
Jan 12 04:14:16 [machine] kernel: Cannot map memory with base addr 0x710730000000 and size of 0x15352 pages
as a final resort I asked them to run it with “low-memory” mode turned on, which surprisingly solved the issue (as the hover over text only mentions GPU memory). This topic is mainly to make you aware that this behavior exists and provide a possible solution for someone that might run into similar issues
Other info:
The setup runs on a workstation with one GTX 1080 (8 GB VRAM, I know this is below minimal spec, but that makes the error behavior even weirder in my opinion) assigned to cryosparc and 64 GB RAM
CryoSPARC instance information
Type: single workstation
$ cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/sander/cryosparc_test/cryosparc_master
Current cryoSPARC version: v4.4.1
----------------------------------------------------------------------------
CryoSPARC process status:
app RUNNING pid 5524, uptime 3 days, 23:42:29
app_api RUNNING pid 5550, uptime 3 days, 23:42:27
app_api_dev STOPPED Not started
command_core RUNNING pid 5391, uptime 3 days, 23:42:41
command_rtp RUNNING pid 5483, uptime 3 days, 23:42:32
command_vis RUNNING pid 5457, uptime 3 days, 23:42:33
database RUNNING pid 5174, uptime 3 days, 23:42:44
----------------------------------------------------------------------------
License is valid
----------------------------------------------------------------------------
global config variables:
export CRYOSPARC_LICENSE_ID=[redacted]
export CRYOSPARC_MASTER_HOSTNAME=[redacted]
export CRYOSPARC_DB_PATH="/data/sander/cryosparc_test_data"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false
export CRYOSPARC_CLICK_WRAP=true
export CRYOSPARC_HEARTBEAT_SECONDS=300
uname -a && free -g
Linux [redacted] 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 62 50 2 0 9 10
Swap: 63 4 59
CryoSPARC worker environment
env | grep PATH
NUMBA_CUDA_INCLUDE_PATH=/home/sander/cryosparc_test/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include
LD_LIBRARY_PATH=
PATH=/home/sander/cryosparc_test/cryosparc_worker/bin:/home/sander/cryosparc_test/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/home/sander/cryosparc_test/cryosparc_worker/deps/anaconda/condabin:/home/sander/cryosparc_test/cryosparc_master/bin:/home/sander/cryosparc_test/cryosparc_master/bin:/home/sander/cryosparc_test/cryosparc_master/bin:/home/sander/cryosparc_test/cryosparc_master/bin:/home/sander/cryosparc_test/cryosparc_master/bin:/home/sander/cryosparc_test/cryosparc_master/bin:/home/sander/cryosparc_test/cryosparc_master/bin:/home/sander/cryosparc_test/cryosparc_master/bin:/home/sander/cryosparc_test/cryosparc_master/bin:/home/sander/cryosparc_test/cryosparc_master/bin:/opt/apps/imod/IMOD/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/apps/imod/IMOD/pythonLink:/var/lib/snapd/snap/bin:/home/sander/.local/bin:/home/sander/bin
MODULEPATH=/etc/modulefiles
CRYOSPARC_PATH=/home/sander/cryosparc_test/cryosparc_worker/bin
PYTHONPATH=/home/sander/cryosparc_test/cryosparc_worker
CRYOSPARC_CUDA_PATH=/usr/local/cuda-11.8
/sbin/ldconfig -p | grep -i cuda
libpcsamplingutil.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libpcsamplingutil.so
libnvrtc.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so.12
libnvrtc.so.11.2 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvrtc.so.11.2
libnvrtc.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so
libnvrtc.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvrtc.so
libnvrtc-builtins.so.12.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc-builtins.so.12.0
libnvrtc-builtins.so.11.8 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvrtc-builtins.so.11.8
libnvrtc-builtins.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc-builtins.so
libnvrtc-builtins.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvrtc-builtins.so
libnvperf_target.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvperf_target.so
libnvperf_host.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvperf_host.so
libnvjpeg.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvjpeg.so.12
libnvjpeg.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvjpeg.so.11
libnvjpeg.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvjpeg.so
libnvjpeg.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvjpeg.so
libnvblas.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvblas.so.12
libnvblas.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvblas.so.11
libnvblas.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvblas.so
libnvblas.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvblas.so
libnvToolsExt.so.1 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvToolsExt.so.1
libnvToolsExt.so.1 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvToolsExt.so.1
libnvToolsExt.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvToolsExt.so
libnvToolsExt.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvToolsExt.so
libnvJitLink.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvJitLink.so.12
libnvJitLink.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvJitLink.so
libnpps.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnpps.so.12
libnpps.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnpps.so.11
libnpps.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnpps.so
libnpps.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnpps.so
libnppitc.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppitc.so.12
libnppitc.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppitc.so.11
libnppitc.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppitc.so
libnppitc.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppitc.so
libnppisu.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppisu.so.12
libnppisu.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppisu.so.11
libnppisu.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppisu.so
libnppisu.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppisu.so
libnppist.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppist.so.12
libnppist.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppist.so.11
libnppist.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppist.so
libnppist.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppist.so
libnppim.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppim.so.12
libnppim.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppim.so.11
libnppim.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppim.so
libnppim.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppim.so
libnppig.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppig.so.12
libnppig.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppig.so.11
libnppig.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppig.so
libnppig.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppig.so
libnppif.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppif.so.12
libnppif.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppif.so.11
libnppif.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppif.so
libnppif.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppif.so
libnppidei.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppidei.so.12
libnppidei.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppidei.so.11
libnppidei.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppidei.so
libnppidei.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppidei.so
libnppicc.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppicc.so.12
libnppicc.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppicc.so.11
libnppicc.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppicc.so
libnppicc.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppicc.so
libnppial.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppial.so.12
libnppial.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppial.so.11
libnppial.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppial.so
libnppial.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppial.so
libnppc.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppc.so.12
libnppc.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppc.so.11
libnppc.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppc.so
libnppc.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libnppc.so
libicudata.so.50 (libc6,x86-64) => /lib64/libicudata.so.50
libicudata.so (libc6,x86-64) => /lib64/libicudata.so
libcusparse.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusparse.so.12
libcusparse.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcusparse.so.11
libcusparse.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusparse.so
libcusparse.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcusparse.so
libcusolverMg.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusolverMg.so.11
libcusolverMg.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcusolverMg.so.11
libcusolverMg.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusolverMg.so
libcusolverMg.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcusolverMg.so
libcusolver.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusolver.so.11
libcusolver.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcusolver.so.11
libcusolver.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusolver.so
libcusolver.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcusolver.so
libcurand.so.10 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcurand.so.10
libcurand.so.10 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcurand.so.10
libcurand.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcurand.so
libcurand.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcurand.so
libcupti.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcupti.so.12
libcupti.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcupti.so
libcuinj64.so.12.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcuinj64.so.12.0
libcuinj64.so.11.8 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcuinj64.so.11.8
libcuinj64.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcuinj64.so
libcuinj64.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcuinj64.so
libcufile_rdma.so.1 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufile_rdma.so.1
libcufile_rdma.so.1 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcufile_rdma.so.1
libcufile_rdma.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufile_rdma.so
libcufile_rdma.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcufile_rdma.so
libcufile.so.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufile.so.0
libcufile.so.0 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcufile.so.0
libcufile.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufile.so
libcufile.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcufile.so
libcufftw.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufftw.so.11
libcufftw.so.10 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcufftw.so.10
libcufftw.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufftw.so
libcufftw.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcufftw.so
libcufft.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.11
libcufft.so.10 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcufft.so.10
libcufft.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufft.so
libcufft.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcufft.so
libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
libcudart.so.11.0 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart.so.11.0
libcudart.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
libcudart.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart.so
libcudadebugger.so.1 (libc6,x86-64) => /lib64/libcudadebugger.so.1
libcuda.so.1 (libc6,x86-64) => /lib64/libcuda.so.1
libcuda.so (libc6,x86-64) => /lib64/libcuda.so
libcublasLt.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12
libcublasLt.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcublasLt.so.11
libcublasLt.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so
libcublasLt.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcublasLt.so
libcublas.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12
libcublas.so.11 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcublas.so.11
libcublas.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so
libcublas.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcublas.so
libcheckpoint.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcheckpoint.so
libaccinj64.so.12.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libaccinj64.so.12.0
libaccinj64.so.11.8 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libaccinj64.so.11.8
libaccinj64.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libaccinj64.so
libaccinj64.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libaccinj64.so
libOpenCL.so.1 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libOpenCL.so.1
libOpenCL.so.1 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libOpenCL.so.1
libOpenCL.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libOpenCL.so
libOpenCL.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libOpenCL.so
uname -a
Linux [redacted] 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
free -g
total used free shared buff/cache available
Mem: 62 50 2 0 9 10
Swap: 63 4 59
nvidia-smi
Mon Jan 15 13:18:06 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 27% 29C P8 10W / 180W | 364MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:02:00.0 Off | N/A |
| 37% 58C P2 53W / 180W | 2662MiB / 8192MiB | 53% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1227 G /usr/bin/X 18MiB |
| 0 N/A N/A 2611 G /usr/bin/gnome-shell 61MiB |
| 0 N/A N/A 4360 G /usr/bin/X 34MiB |
| 0 N/A N/A 4953 G /usr/bin/gnome-shell 63MiB |
| 0 N/A N/A 24507 G /usr/bin/X 89MiB |
| 0 N/A N/A 25187 G /usr/bin/gnome-shell 89MiB |
| 1 N/A N/A 24955 C python 2658MiB |
+-----------------------------------------------------------------------------+