KeyError: <pycuda._driver.Context object at 0x7f36df70dc30>

alexandre.durand · March 14, 2022, 12:28pm

Dear all,

We are encouterning the following error message on all our job at the moment:

This appeared quite suddenly, without any apparent changes on the server.
We are running Cryosparc 3.3.1

Any idea what could be the problem?
Cheers,
Alexandre

wtempel · March 16, 2022, 8:49pm

@alexandre.durand There various possible causes.
Could there be a hardware failure? As a precaution, I recommend backing up the cryoSPARC database. (Consider regular database backups on production cryoSPARC instances.)
There also could be a hardware or software incompatibility. Output from the following commands may help figure this out:

uname -a
gcc -v
grep CUDA <path-to-cryosparc>/cryosparc_worker/config.sh
driver version and GPU model(s) from nvidia-smi
grep -m 1 "model name" /proc/cpuinfo
free -g

eval $(<path-to-cryosparc>/cryosparc_worker/bin/cryosparcw env)
python -c "import pycuda.driver; print(pycuda.driver.get_version())"

df -h /usr
df -h /boot

Are OS updates applied automatically on the worker and could have introduced an incompatibility?

balletn · March 21, 2022, 9:51am

Hi @wtempel, I’m @alexandre.durand colleague, here are the outputs you asked:

uname -a:

Linux cbi-cryosparc-02 5.4.0-104-generic #118-Ubuntu SMP Wed Mar 2 19:02:41 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

gcc -v:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-yTrUTS/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04)

grep CUDA <path-to-cryosparc>/cryosparc_worker/config.sh:

export CRYOSPARC_CUDA_PATH="/usr/local/cuda"

nvidia-smi versions and models:

NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4
3x NVIDIA A100-PCI...

grep -m 1 "model name" /proc/cpuinfo:

model name	: AMD EPYC 7543 32-Core Processor

free -gh:

              total        used        free      shared  buff/cache   available
Mem:          288Gi       4.3Gi       137Gi       1.0Mi       145Gi       281Gi
Swap:         2.0Gi       8.0Mi       2.0Gi

[...] print(pycuda.driver.get_version()):

(11, 4, 0)

df -h /usr:

Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  250G  116G  124G  49% /

df -h /boot:

Filesystem      Size  Used Avail Use% Mounted on
/dev/vda2       976M  450M  459M  50% /boot

And no, updates are applied manually when we are not using the server.
No updates have been applied between when it worked and when the bug appeared.

Let me know if you need anything else

balletn · March 24, 2022, 7:18am

For a reason that I don’t understand, I reinstalled build-essentials and cuda and it fixed the issue.
Let me know if you have an explanation

wtempel · March 24, 2022, 4:28pm

@balletn After running your observation by our team, I would still not rule out a hardware (disk?) malfunction.