Heterogeneous Refinement fails in v 4.5.1

I haven’t had reports of this error in a while, but here’s the output from a job I was able to find that exhibited this error:

cryosparcm cli “get_job(‘P825’, ‘J46’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’)”
{‘_id’: ‘6658eb9a68790124e2702a6a’, ‘errors_run’: [{‘message’: ‘[CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE’, ‘warning’: False}], ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘218.91GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz’, ‘driver_version’: ‘12.2’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:1a:00’}, {‘id’: 1, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:1b:00’}, {‘id’: 2, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:3d:00’}, {‘id’: 3, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:3e:00’}, {‘id’: 4, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:88:00’}, {‘id’: 5, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:89:00’}, {‘id’: 6, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:b1:00’}, {‘id’: 7, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:b2:00’}], ‘ofd_hard_limit’: 1048576, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 48, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘hydra’, ‘platform_release’: ‘5.4.0-181-generic’, ‘platform_version’: ‘#201-Ubuntu SMP Thu Mar 28 15:39:01 UTC 2024’, ‘total_memory’: ‘251.54GB’, ‘used_memory’: ‘30.19GB’}, ‘job_type’: ‘class_2D_new’, ‘params_spec’: {‘class2D_K’: {‘value’: 100}, ‘compute_num_gpus’: {‘value’: 1}, ‘compute_use_ssd’: {‘value’: False}}, ‘project_uid’: ‘P825’, ‘status’: ‘failed’, ‘uid’: ‘J46’, ‘version’: ‘v4.5.1’}

@abrilot Please can you post the output of the command

cryosparcm joblog P825 J46 | tail -n 40

== CUDA [822814] INFO – add pending dealloc: cuMemFree 39200000 bytes
== CUDA [822814] INFO – dealloc: cuMemFree 39200000 bytes
== CUDA [822814] DEBUG – call driver api: cuMemFree(<CUdeviceptr 140045562413056>)
== CUDA [822814] INFO – add pending dealloc: cuMemFreeHost ? bytes
== CUDA [822814] INFO – dealloc: cuMemFreeHost ? bytes
== CUDA [822814] DEBUG – call driver api: cuMemFreeHost(140045461749760)
Traceback (most recent call last):
== CUDA [822820] INFO – add pending dealloc: cuMemFree 32400000 bytes
File “/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2294, in run_with_except_hook
== CUDA [822820] INFO – dealloc: cuMemFree 32400000 bytes
== CUDA [822820] DEBUG – call driver api: cuMemFree(<CUdeviceptr 140045226868736>)
== CUDA [822823] DEBUG – call driver api: cuCtxGetCurrent()
== CUDA [822823] DEBUG – call driver api: cuCtxGetDevice()
== CUDA [822823] DEBUG – call driver api: cuCtxPopCurrent()
set status to failed
**custom thread exception hook caught something
**** handle exception rc
run_old(*args, **kw)
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py”, line 639, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 1383, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.compute_resid_pow
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 374, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py”, line 232, in _require_cuda_context
return fn(*args, **kws)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py”, line 189, in pinned_array
buffer = current_context().memhostalloc(bytesize)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1378, in memhostalloc
return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 889, in memhostalloc
pointer = allocator()
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 884, in allocator
return driver.cuMemHostAlloc(size, flags)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 344, in verbose_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE
========= main process now complete at 2024-05-30 16:50:25.091160.
========= monitor process now complete at 2024-05-30 16:50:25.096414.

Thanks @abrilot.
Is there a cluster workload manager, like gridengine or slurm, used in job submission?
Were other jobs (not necessarily CryoSPARC) running on node hydra when job J46 failed?

no sge or slurm. We almost never have other users run jobs on our cryosparc servers, and this used to happen often enough that I doubt it was due to other jobs (though I haven’t seen users report this specific issue in some time).