Heterogeneous Refinement fails in v 4.5.1

I haven’t had reports of this error in a while, but here’s the output from a job I was able to find that exhibited this error:

cryosparcm cli “get_job(‘P825’, ‘J46’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’)”
{‘_id’: ‘6658eb9a68790124e2702a6a’, ‘errors_run’: [{‘message’: ‘[CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE’, ‘warning’: False}], ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘218.91GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz’, ‘driver_version’: ‘12.2’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:1a:00’}, {‘id’: 1, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:1b:00’}, {‘id’: 2, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:3d:00’}, {‘id’: 3, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:3e:00’}, {‘id’: 4, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:88:00’}, {‘id’: 5, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:89:00’}, {‘id’: 6, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:b1:00’}, {‘id’: 7, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:b2:00’}], ‘ofd_hard_limit’: 1048576, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 48, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘hydra’, ‘platform_release’: ‘5.4.0-181-generic’, ‘platform_version’: ‘#201-Ubuntu SMP Thu Mar 28 15:39:01 UTC 2024’, ‘total_memory’: ‘251.54GB’, ‘used_memory’: ‘30.19GB’}, ‘job_type’: ‘class_2D_new’, ‘params_spec’: {‘class2D_K’: {‘value’: 100}, ‘compute_num_gpus’: {‘value’: 1}, ‘compute_use_ssd’: {‘value’: False}}, ‘project_uid’: ‘P825’, ‘status’: ‘failed’, ‘uid’: ‘J46’, ‘version’: ‘v4.5.1’}

@abrilot Please can you post the output of the command

cryosparcm joblog P825 J46 | tail -n 40

== CUDA [822814] INFO – add pending dealloc: cuMemFree 39200000 bytes
== CUDA [822814] INFO – dealloc: cuMemFree 39200000 bytes
== CUDA [822814] DEBUG – call driver api: cuMemFree(<CUdeviceptr 140045562413056>)
== CUDA [822814] INFO – add pending dealloc: cuMemFreeHost ? bytes
== CUDA [822814] INFO – dealloc: cuMemFreeHost ? bytes
== CUDA [822814] DEBUG – call driver api: cuMemFreeHost(140045461749760)
Traceback (most recent call last):
== CUDA [822820] INFO – add pending dealloc: cuMemFree 32400000 bytes
File “/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2294, in run_with_except_hook
== CUDA [822820] INFO – dealloc: cuMemFree 32400000 bytes
== CUDA [822820] DEBUG – call driver api: cuMemFree(<CUdeviceptr 140045226868736>)
== CUDA [822823] DEBUG – call driver api: cuCtxGetCurrent()
== CUDA [822823] DEBUG – call driver api: cuCtxGetDevice()
== CUDA [822823] DEBUG – call driver api: cuCtxPopCurrent()
set status to failed
**custom thread exception hook caught something
**** handle exception rc
run_old(*args, **kw)
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py”, line 639, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 1383, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.compute_resid_pow
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 374, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py”, line 232, in _require_cuda_context
return fn(*args, **kws)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py”, line 189, in pinned_array
buffer = current_context().memhostalloc(bytesize)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1378, in memhostalloc
return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 889, in memhostalloc
pointer = allocator()
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 884, in allocator
return driver.cuMemHostAlloc(size, flags)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 344, in verbose_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE
========= main process now complete at 2024-05-30 16:50:25.091160.
========= monitor process now complete at 2024-05-30 16:50:25.096414.

Thanks @abrilot.
Is there a cluster workload manager, like gridengine or slurm, used in job submission?
Were other jobs (not necessarily CryoSPARC) running on node hydra when job J46 failed?

no sge or slurm. We almost never have other users run jobs on our cryosparc servers, and this used to happen often enough that I doubt it was due to other jobs (though I haven’t seen users report this specific issue in some time).

@abrilot Please can you post the outputs of these commands on hydra:

uname -a
grep -v LICENSE_ID /var/home/cryosparc_user/cryosparc_worker/config.sh

Linux hydra 5.4.0-193-generic #213-Ubuntu SMP Fri Aug 2 19:14:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cryosparc_user@hydra:~$ grep -v LICENSE_ID /var/home/cryosparc_user/cryosparc_worker/config.sh

export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH=“/usr/local/cuda”
export CRYOSPARC_DEVELOP=false
export NUMBA_CUDA_LOG_LEVEL=“DEBUG”
export NUMBA_CUDA_LOG_API_ARGS=1
export CRYOSPARC_NO_PAGELOCK=true

@abrilot Please can you run these commands on the hydra computer:

cd $(mktemp -d)
pwd # make a note so you can find the directory later
tar zcvf worker_config.tgz /var/home/cryosparc_user/cryosparc_worker/config.sh

and send us the worker_config.tgz file. I will let you know the email address via direct messaging.
We would like to examine the file for non-printing characters that may affect the CRYOSPARC_NO_PAGELOCK setting.

To be clear, this issue seems resolved for us, so I don’t think we have any problems in our config file, I was just posting from files I still have from when we had this failure.

Let me know if you still want the config file.

Glad to hear that @abrilot. How did you resolve

?