Almost regular occurence of "database EXITED"

It is possible that the particle stack has been removed from cache during routine cache operation, unrelated to the job’s I/O error.

If you encounter the I/O error when running a reconstruction, refinement of classification job with cached particles in the future, please run the
ls -lHs command for the affected file immediately after observing an I/O error in the job on a node where the shared cache folder is mounted.

Hi,

in the last 24 hours we had two failed Non-uniform refinement jobs with SSD cache turned on:

Traceback (most recent call last):
  File "/opt/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2306, in run_with_except_hook
    run_old(*args, **kw)
  File "/opt/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 2730, in cryosparc_master.cryosparc_compute.engine.newengine.process.work
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 2763, in cryosparc_master.cryosparc_compute.engine.newengine.process.work
  File "/opt/cryosparc_worker/cryosparc_compute/ioengine/cmdbuf.py", line 87, in wait
    raise IOError('\n\n'.join(errs))
OSError: I/O error, iosys_process_cached_fd_reads line 2311: Interrupted system call
The file is probably corrupt. If this is a movie, try deleting it and re-importing the movie set. If this is a particle stack, try the 'check for corrupt particles' job (if corrupt particles are found, they will be excluded from the job's output).

I/O request details:
	filename:  /ceph/hpc/scratch/user/cryosparc/instance/links/P103-J242-1779855707/9e3aec69c578c2ff8858ae80d44a1ea66db570b4.mrc
	data type: 0x10
	frames:    [28:29]
	eer upsample factor: 2
	eer number of fractions: 40
...
...
(and it continues like this for a while)

I tried running the ls -lHs on both, one of the cluster nodes and master and got the same response as above in Almost regular occurence of "database EXITED" - #20 by eMKiso .

The folder /ceph/hpc/scratch/user/cryosparc/instance/links was empty so the command ls didn’t have anything to list anyway. Is this expected, that the folder links is empty?

It will be challenging to catch the moment that the job fails because it happens randomly. In the case of the two NU refine jobs one failed after 1.5 hours and the other one after 5 hours.

Hi,

it was suggested to me to run this command on one of the nodes where jobs failed:

singularity exec --nv /pathtosoftware/cryosparc_worker/cryosparc-worker.sif /opt/cryosparc_worker/bin/cryosparcw call env | grep -v LICENSE_ID

And here is the output:

singularity exec --nv /path/cryosparc-worker.sif /opt/cryosparc_worker/bin/cryosparcw call env | grep -v LICENSE_ID
INFO:    Setting 'NVIDIA_VISIBLE_DEVICES=all' to emulate legacy GPU binding.
INFO:    Setting --writable-tmpfs (required by nvidia-container-cli)
SHELL=/bin/bash
NV_LIBCUBLAS_VERSION=11.11.3.6-1
NVIDIA_VISIBLE_DEVICES=all
NV_NVML_DEV_VERSION=11.8.86-1
HISTCONTROL=ignoredups
CRYOSPARC_MASTER_HOSTNAME=REDACTED
NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.16.2-1+cuda11.8
CONDA_EXE=/opt/cryosparc_worker/deps/anaconda/bin/conda
_CE_M=
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.16.2-1
PYTHONNOUSERSITE=true
HOSTNAME=gn26
HISTSIZE=20000
SINGULARITY_NAME=cryosparc-worker.sif
CRYOSPARC_USE_GPU=true
FPATH=/usr/local/share/zsh/site-functions:/usr/share/zsh/site-functions:/usr/share/zsh/5.5.1/functions:/usr/share/lmod/lmod/init/ksh_funcs
NUMEXPR_NUM_THREADS=1
LC_ADDRESS=sl_SI.UTF-8
NVIDIA_REQUIRE_CUDA=cuda>=11.8 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471
_ModuleTable002_=fSwKc3lzdGVtQmFzZU1QQVRIID0gIi9ldGMvbW9kdWxlZmlsZXM6L3Vzci9zaGFyZS9tb2R1bGVmaWxlczovdXNyL3NoYXJlL21vZHVsZWZpbGVzL0xpbnV4Oi91c3Ivc2hhcmUvbW9kdWxlZmlsZXMvQ29yZTovdXNyL3NoYXJlL2xtb2QvbG1vZC9tb2R1bGVmaWxlcy9Db3JlIiwKfQo=
LC_NAME=sl_SI.UTF-8
NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-11-8=11.11.3.6-1
CRYOSPARC_PATH=/opt/cryosparc_worker/bin
NV_NVTX_VERSION=11.8.86-1
__LMOD_REF_COUNT_MODULEPATH=/path/Modules/modulefiles:2;/path/software/modulefiles:1;/path/modules/el7/modules/all:1;/etc/modulefiles:1;/usr/share/modulefiles:1;/usr/share/modulefiles/Linux:1;/usr/share/modulefiles/Core:1;/usr/share/lmod/lmod/modulefiles/Core:1
CRYOSPARC_INSECURE=false
NV_CUDA_CUDART_DEV_VERSION=11.8.89-1
NV_LIBCUSPARSE_VERSION=11.7.5.86-1
NV_LIBNPP_VERSION=11.8.0.86-1
LC_MONETARY=sl_SI.UTF-8
SINGULARITY_ENVIRONMENT=/.singularity.d/env/91-environment.sh
NCCL_VERSION=2.16.2-1
LMOD_DIR=/usr/share/lmod/lmod/libexec
PWD=/opt/cryosparc_worker
LOGNAME=cryosparc
CONDA_PREFIX=/opt/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env
MODULESHOME=/usr/share/lmod/lmod
NVIDIA_DRIVER_CAPABILITIES=compute,utility
MANPATH=/usr/share/lmod/lmod/share/man:
NV_NVPROF_DEV_PACKAGE=cuda-nvprof-11-8=11.8.87-1
NV_LIBNPP_PACKAGE=libnpp-11-8=11.8.0.86-1
NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
EASYBUILD_MODULES_TOOL=Lmod
NV_LIBCUBLAS_DEV_VERSION=11.11.3.6-1
NVIDIA_PRODUCT_NAME=CUDA
USER_PATH=/path/software/cryosparc_master/bin/:/path/home/cryosparc/.local/bin:/path/home/cryosparc/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin
NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-11-8
APPTAINER_ENVIRONMENT=/.singularity.d/env/91-environment.sh
LD_PRELOAD=/opt/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/libpython3.10.so
NV_CUDA_CUDART_VERSION=11.8.89-1
APPTAINER_APPNAME=
HOME=/path/home/cryosparc
_ModuleTable_Sz_=2
LC_PAPER=sl_SI.UTF-8
LANG=en_GB.UTF-8
CRYOSPARC_ROOT_DIR=/opt/cryosparc_worker
LS_COLORS=rs=0:di=38;5;33:ln=38;5;51:mh=00:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=01;05;37;41:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;40:*.tar=38;5;9:*.tgz=38;5;9:*.arc=38;5;9:*.arj=38;5;9:*.taz=38;5;9:*.lha=38;5;9:*.lz4=38;5;9:*.lzh=38;5;9:*.lzma=38;5;9:*.tlz=38;5;9:*.txz=38;5;9:*.tzo=38;5;9:*.t7z=38;5;9:*.zip=38;5;9:*.z=38;5;9:*.dz=38;5;9:*.gz=38;5;9:*.lrz=38;5;9:*.lz=38;5;9:*.lzo=38;5;9:*.xz=38;5;9:*.zst=38;5;9:*.tzst=38;5;9:*.bz2=38;5;9:*.bz=38;5;9:*.tbz=38;5;9:*.tbz2=38;5;9:*.tz=38;5;9:*.deb=38;5;9:*.rpm=38;5;9:*.jar=38;5;9:*.war=38;5;9:*.ear=38;5;9:*.sar=38;5;9:*.rar=38;5;9:*.alz=38;5;9:*.ace=38;5;9:*.zoo=38;5;9:*.cpio=38;5;9:*.7z=38;5;9:*.rz=38;5;9:*.cab=38;5;9:*.wim=38;5;9:*.swm=38;5;9:*.dwm=38;5;9:*.esd=38;5;9:*.jpg=38;5;13:*.jpeg=38;5;13:*.mjpg=38;5;13:*.mjpeg=38;5;13:*.gif=38;5;13:*.bmp=38;5;13:*.pbm=38;5;13:*.pgm=38;5;13:*.ppm=38;5;13:*.tga=38;5;13:*.xbm=38;5;13:*.xpm=38;5;13:*.tif=38;5;13:*.tiff=38;5;13:*.png=38;5;13:*.svg=38;5;13:*.svgz=38;5;13:*.mng=38;5;13:*.pcx=38;5;13:*.mov=38;5;13:*.mpg=38;5;13:*.mpeg=38;5;13:*.m2v=38;5;13:*.mkv=38;5;13:*.webm=38;5;13:*.ogm=38;5;13:*.mp4=38;5;13:*.m4v=38;5;13:*.mp4v=38;5;13:*.vob=38;5;13:*.qt=38;5;13:*.nuv=38;5;13:*.wmv=38;5;13:*.asf=38;5;13:*.rm=38;5;13:*.rmvb=38;5;13:*.flc=38;5;13:*.avi=38;5;13:*.fli=38;5;13:*.flv=38;5;13:*.gl=38;5;13:*.dl=38;5;13:*.xcf=38;5;13:*.xwd=38;5;13:*.yuv=38;5;13:*.cgm=38;5;13:*.emf=38;5;13:*.ogv=38;5;13:*.ogx=38;5;13:*.aac=38;5;45:*.au=38;5;45:*.flac=38;5;45:*.m4a=38;5;45:*.mid=38;5;45:*.midi=38;5;45:*.mka=38;5;45:*.mp3=38;5;45:*.mpc=38;5;45:*.ogg=38;5;45:*.ra=38;5;45:*.wav=38;5;45:*.oga=38;5;45:*.opus=38;5;45:*.spx=38;5;45:*.xspf=38;5;45:
APPTAINER_COMMAND=exec
CRYOSPARC_CACHE_NUM_THREADS=4
CUDA_VERSION=11.8.0
SINGULARITY_CONTAINER=/path/software/cryosparc_worker/cryosparc-worker.sif
NV_LIBCUBLAS_PACKAGE=libcublas-11-8=11.11.3.6-1
LMOD_SETTARG_FULL_SUPPORT=no
CRYOSPARC_IMPROVED_SSD_CACHE=true
CRYOSPARC_DB_PATH=/path/home/cryosparc
NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-11-8=11.8.0-1
CONDA_PROMPT_MODIFIER=(cryosparc_worker_env) 
PROMPT_COMMAND=PS1="Apptainer> "; unset PROMPT_COMMAND
LMOD_VERSION=8.7.48
SSH_CONNECTION=REDACTED
NV_LIBNPP_DEV_PACKAGE=libnpp-dev-11-8=11.8.0.86-1
NV_LIBCUBLAS_PACKAGE_NAME=libcublas-11-8
NV_LIBNPP_DEV_VERSION=11.8.0.86-1
MODULEPATH_ROOT=/usr/share/modulefiles
LMOD_PKG=/usr/share/lmod/lmod
APPTAINER_CONTAINER=/path/software/cryosparc_worker/cryosparc-worker.sif
CRYOSPARC_BASE_PORT=39000
PYTHONPATH=/opt/cryosparc_worker
TERM=xterm-256color
LC_IDENTIFICATION=sl_SI.UTF-8
NV_LIBCUSPARSE_DEV_VERSION=11.7.5.86-1
_CE_CONDA=
LESSOPEN=||/usr/bin/lesspipe.sh %s
USER=cryosparc
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
EASYBUILD_PREFIX=/path/modules/el7
CONDA_SHLVL=1
CRYOSPARC_DEVELOP=false
NUMBA_CUDA_INCLUDE_PATH=/opt/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include
LMOD_ROOT=/usr/share/lmod
SHLVL=1
BASH_ENV=/usr/share/lmod/lmod/init/bash
NV_CUDA_LIB_VERSION=11.8.0-1
NVARCH=x86_64
LMOD_sys=Linux
LC_TELEPHONE=sl_SI.UTF-8
LC_MEASUREMENT=sl_SI.UTF-8
APPTAINER_NAME=cryosparc-worker.sif
SINGULARITY_BIND=
NV_CUDA_COMPAT_PACKAGE=cuda-compat-11-8
APPTAINER_BIND=
_ModuleTable001_=X01vZHVsZVRhYmxlXyA9IHsKTVR2ZXJzaW9uID0gMywKY19yZWJ1aWxkVGltZSA9IGZhbHNlLApjX3Nob3J0VGltZSA9IGZhbHNlLApkZXB0aFQgPSB7fSwKZmFtaWx5ID0ge30sCm1UID0ge30sCm1wYXRoQSA9IHsKIi91c3Ivc2hhcmUvTW9kdWxlcy9tb2R1bGVmaWxlcyIsICIvY2VwaC9ocGMvc29mdHdhcmUvbW9kdWxlZmlsZXMiLCAiL2N2bWZzL3NsaW5nLnNpL21vZHVsZXMvZWw3L21vZHVsZXMvYWxsIiwgIi9ldGMvbW9kdWxlZmlsZXMiLCAiL3Vzci9zaGFyZS9tb2R1bGVmaWxlcyIsICIvdXNyL3NoYXJlL21vZHVsZWZpbGVzL0xpbnV4IiwgIi91c3Ivc2hhcmUvbW9kdWxlZmlsZXMvQ29yZSIsICIvdXNyL3NoYXJlL2xtb2QvbG1vZC9tb2R1bGVmaWxlcy9Db3JlIiwK
CONDA_PYTHON_EXE=/opt/cryosparc_worker/deps/anaconda/bin/python
NV_LIBNCCL_PACKAGE=libnccl2=2.16.2-1+cuda11.8
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/include:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
SLURM_JOB_ID=36572339
S_COLORS=auto
SSH_CLIENT=REDACTED
CONDA_DEFAULT_ENV=cryosparc_worker_env
NUMPY_MADVISE_HUGEPAGE=0
NV_CUDA_NSIGHT_COMPUTE_VERSION=11.8.0-1
LC_TIME=sl_SI.UTF-8
OMP_NUM_THREADS=1
NUMBA_CUDA_USE_NVIDIA_BINDING=1
which_declare=declare -f
NV_NVPROF_VERSION=11.8.87-1
LC_ALL=C
CUDA_HOME=/usr/local/cuda
__Init_Default_Modules=1
PATH=/opt/cryosparc_worker/bin:/opt/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/opt/cryosparc_worker/deps/anaconda/condabin:/usr/local/cuda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
MODULEPATH=/usr/share/Modules/modulefiles:/path/software/modulefiles:/path/modules/el7/modules/all:/etc/modulefiles:/usr/share/modulefiles:/usr/share/modulefiles/Linux:/usr/share/modulefiles/Core:/usr/share/lmod/lmod/modulefiles/Core
NV_LIBNCCL_PACKAGE_NAME=libnccl2
NV_LIBNCCL_PACKAGE_VERSION=2.16.2-1
LMOD_CMD=/usr/share/lmod/lmod/libexec/lmod
MKL_NUM_THREADS=1
NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT=0
MAIL=/var/spool/mail/cryosparc
SSH_TTY=/dev/pts/0
CRYOSPARC_CONDA_ENV=cryosparc_worker_env
LC_NUMERIC=sl_SI.UTF-8
OLDPWD=/opt/cryosparc_worker
BASH_FUNC_ml%%=() {  eval "$($LMOD_DIR/ml_cmd "$@")"
}
BASH_FUNC_which%%=() {  ( alias;
 eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@
}
BASH_FUNC_module%%=() {  if [ -z "${LMOD_SH_DBG_ON+x}" ]; then
 case "$-" in 
 *v*x*)
 __lmod_sh_dbg='vx'
 ;;
 *v*)
 __lmod_sh_dbg='v'
 ;;
 *x*)
 __lmod_sh_dbg='x'
 ;;
 esac;
 fi;
 if [ -n "${__lmod_sh_dbg:-}" ]; then
 set +$__lmod_sh_dbg;
 echo "Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for Lmod's output" 1>&2;
 fi;
 eval "$($LMOD_CMD shell "$@")" && eval "$(${LMOD_SETTARG_CMD:-:} -s sh)";
 __lmod_my_status=$?;
 if [ -n "${__lmod_sh_dbg:-}" ]; then
 echo "Shell debugging restarted" 1>&2;
 set -$__lmod_sh_dbg;
 fi;
 unset __lmod_sh_dbg;
 return $__lmod_my_status
}

Best regards!

Thanks @eMKiso . I may have missed it, but have you already tried including

export CRYOSPARC_CACHE_LOCK_STRATEGY="master"

inside cryosparc_worker/config.sh?

Hi, we didn’t have that enabled.

I enabled it now to see if it makes a difference. Thanks!

Please update this thread with your findings.