Heterogeneous Refinement fails in v 4.5.1

I’m getting repeated reports of Heterogeneous refinement jobs failing after upgrading to 4.5.1. A reboot of the worker nodes seems to solve the issue temporarily, but it recurs after a day or two.

Here is the error message:

Traceback (most recent call last): File “/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2294, in run_with_except_hook run_old(*args, **kw) File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1134, in cryosparc_master.cryosparc_compute.engine.engine.process.work File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 348, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.compute_resid_pow File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 374, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py”, line 232, in _require_cuda_context return fn(*args, **kws) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py”, line 189, in pinned_array buffer = current_context().memhostalloc(bytesize) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1378, in memhostalloc return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 889, in memhostalloc pointer = allocator() File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 884, in allocator return driver.cuMemHostAlloc(size, flags) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 348, in safe_cuda_api_call return self._check_cuda_python_error(fname, libfn(*args)) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE

1 Like

A bit more info from master:

uname -a && free -g
Linux  5.4.0-170-generic #188-Ubuntu SMP Wed Jan 10 09:51:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
              total        used        free      shared  buff/cache   available
Mem:             93          22           0           0          70          70
Swap:           152           1         151

from worker:

env | grep PATH
CRYOSPARC_PATH=/var/home/cryosparc_user/cryosparc_worker/bin
MANPATH=:/opt/puppetlabs/puppet/share/man
PYTHONPATH=/var/home/cryosparc_user/cryosparc_worker
CRYOSPARC_CUDA_PATH=/usr/local/cuda
NUMBA_CUDA_INCLUDE_PATH=/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include
LD_LIBRARY_PATH=
PATH=/var/home/cryosparc_user/cryosparc_worker/bin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/condabin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/bin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/puppetlabs/bin:/opt/Supermicro/SuperDoctor5

/sbin/ldconfig -p | grep -i cuda
	libicudata.so.66 (libc6,x86-64) => /lib/x86_64-linux-gnu/libicudata.so.66
	libcudadebugger.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudadebugger.so.1
	libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
	libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so



uname -a
Linux  5.4.0-181-generic #201-Ubuntu SMP Thu Mar 28 15:39:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

free -g
              total        used        free      shared  buff/cache   available
Mem:            250          24         199           0          26         223
Swap:            95           0          95


nvidia-smi
Fri May 24 11:40:39 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               Off | 00000000:1B:00.0 Off |                  Off |
| 30%   47C    P2             131W / 230W |   4005MiB / 24564MiB |     64%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000               Off | 00000000:1C:00.0 Off |                  Off |
| 30%   27C    P8              22W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000               Off | 00000000:1D:00.0 Off |                  Off |
| 30%   28C    P8              19W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000               Off | 00000000:1E:00.0 Off |                  Off |
| 30%   28C    P8              19W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A5000               Off | 00000000:B2:00.0 Off |                  Off |
| 30%   27C    P8              20W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A5000               Off | 00000000:B3:00.0 Off |                  Off |
| 30%   29C    P8              19W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A5000               Off | 00000000:B4:00.0 Off |                  Off |
| 30%   28C    P8              18W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A5000               Off | 00000000:B5:00.0 Off |                  Off |
| 30%   28C    P8              20W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A      9302      C   python                                     3976MiB |
|    1   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    4   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    5   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    6   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    7   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

Thanks @abrilot For posting these details. Please can you add these lines

export NUMBA_CUDA_LOG_LEVEL="DEBUG"
export NUMBA_CUDA_LOG_API_ARGS=1

to your worker config file
/var/home/cryosparc_user/cryosparc_worker/config.sh
and email us the job log (job.log inside the job directory or Metadata|Log in the GUI) when you encounter this error again. I will send you a private message with the email address.

Thanks @abrilot for sending us the job.log. Based on the log’s inspection

  1. May we ask that you email us the corresponding job report.
  2. May we suggest that you define
    export CRYOSPARC_NO_PAGELOCK=true
    
    inside the file
    /var/home/cryosparc_user/cryosparc_worker/config.sh
    
    (guide) and see if this setting has an effect on the occurrence of CUDA_ERROR_INVALID_VALUE.