Hi everyone,
I’m trying to run a 2D classification job on ~1.5M of 400-px particles and it fails at random points (no heartbeat). I use the most recent v4.5.1 but had similar issues on 4.4. Increasing the number of classes to 200 and the Batchsize per class to 200 causes the job to fail earlier. Nevertheless, it always fails at online-EM iterations, never getting to full iterations.
I checked the job logfile and it is full of the following warnings:
cryosparc_worker/cryosparc_compute/jobs/runcommon.py:2294: UserWarning: Kernel function check_error_for_nans called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray
Could this be related?
I’m running a single workstation with an RTX 4090 (24 GB), 512 GB of RAM and 64-thread CPU
Here is the output ov the evaluation script:
CRYOSPARC_PATH=/home/milosz/software/cryosparc/cryosparc_worker/bin
WINDOWPATH=2
PYTHONPATH=/home/milosz/software/cryosparc/cryosparc_worker
NUMBA_CUDA_INCLUDE_PATH=/home/milosz/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include
LIBTBX_OPATH=
LD_LIBRARY_PATH=
PATH=/home/milosz/software/cryosparc/cryosparc_worker/bin:/home/milosz/software/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/home/milosz/software/cryosparc/cryosparc_worker/deps/anaconda/condabin:/home/milosz/software/miniconda3/condabin:/home/milosz/software/cryosparc/cryosparc_master/bin:/home/milosz/software/amber/amber22/bin:/home/milosz/software/ccp4-8.0/etc:/home/milosz/software/ccp4-8.0/bin:/home/milosz/software/ccpem-1.6.0/bin:/home/milosz/software/phenix/phenix-1.21rc1-5058/build/bin:/home/milosz/software/MGLTools-1.5.7/bin:/home/milosz/software/vina:/home/milosz/software/xds:/home/milosz/.local/bin:/home/milosz/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/thinlinc/bin:/opt/thinlinc/sbin
libpcsamplingutil.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libpcsamplingutil.so
libnvrtc.so.11.2 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so.11.2
libnvrtc.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so
libnvrtc-builtins.so.11.8 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc-builtins.so.11.8
libnvrtc-builtins.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc-builtins.so
libnvperf_target.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvperf_target.so
libnvperf_host.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvperf_host.so
libnvjpeg.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvjpeg.so.11
libnvjpeg.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvjpeg.so
libnvblas.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvblas.so.11
libnvblas.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvblas.so
libnvToolsExt.so.1 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvToolsExt.so.1
libnvToolsExt.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvToolsExt.so
libnpps.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnpps.so.11
libnpps.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnpps.so
libnppitc.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppitc.so.11
libnppitc.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppitc.so
libnppisu.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppisu.so.11
libnppisu.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppisu.so
libnppist.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppist.so.11
libnppist.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppist.so
libnppim.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppim.so.11
libnppim.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppim.so
libnppig.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppig.so.11
libnppig.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppig.so
libnppif.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppif.so.11
libnppif.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppif.so
libnppidei.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppidei.so.11
libnppidei.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppidei.so
libnppicc.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppicc.so.11
libnppicc.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppicc.so
libnppial.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppial.so.11
libnppial.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppial.so
libnppc.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppc.so.11
libnppc.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnppc.so
libicudata.so.70 (libc6,x86-64) => /lib/x86_64-linux-gnu/libicudata.so.70
libicudata.so.70 (ELF) => /lib/i386-linux-gnu/libicudata.so.70
libcusparse.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusparse.so.11
libcusparse.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusparse.so
libcusolverMg.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusolverMg.so.11
libcusolverMg.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusolverMg.so
libcusolver.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusolver.so.11
libcusolver.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcusolver.so
libcurand.so.10 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcurand.so.10
libcurand.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcurand.so
libcupti.so.11.8 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcupti.so.11.8
libcupti.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcupti.so
libcuinj64.so.11.8 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcuinj64.so.11.8
libcuinj64.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcuinj64.so
libcufile_rdma.so.1 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufile_rdma.so.1
libcufile_rdma.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufile_rdma.so
libcufile.so.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufile.so.0
libcufile.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufile.so
libcufftw.so.10 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufftw.so.10
libcufftw.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufftw.so
libcufft.so.10 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10
libcufft.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcufft.so
libcudart.so.11.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
libcudart.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
libcudadebugger.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudadebugger.so.1
libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
libcuda.so.1 (libc6) => /lib/i386-linux-gnu/libcuda.so.1
libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so
libcuda.so (libc6) => /lib/i386-linux-gnu/libcuda.so
libcublasLt.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.11
libcublasLt.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so
libcublas.so.11 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.11
libcublas.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so
libcheckpoint.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcheckpoint.so
libaccinj64.so.11.8 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libaccinj64.so.11.8
libaccinj64.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libaccinj64.so
libOpenCL.so.1 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libOpenCL.so.1
libOpenCL.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libOpenCL.so
Linux proline 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 503 7 13 0 483 492
Swap: 0 0 0
Sat May 25 10:23:49 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 On | Off |
| 0% 49C P8 28W / 450W | 776MiB / 24564MiB | 18% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 4007 G /usr/lib/xorg/Xorg 430MiB |
| 0 N/A N/A 4165 G /usr/bin/gnome-shell 65MiB |
| 0 N/A N/A 4270 G /opt/teamviewer/tv_bin/TeamViewer 19MiB |
| 0 N/A N/A 34321 G ...irefox/4259/usr/lib/firefox/firefox 236MiB |
+---------------------------------------------------------------------------------------+
Many thanks in advance for any help.
Milosz